๐Ÿ Python & library/PyTorch

[PyTorch/Tensorflow v1, v2] Gradient Clipping ์ถ”๊ฐ€ํ•˜๊ธฐ

๋ณต๋งŒ 2022. 1. 12. 22:13

Gradient clipping์€ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์€ gradient์˜ ๊ฐ’์„ ์ œํ•œํ•˜์—ฌ vanishing gradient๋‚˜ exploding gradient ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

 

ํŠนํžˆ RNN์—์„œ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ํ˜„์ƒ์ธ๋ฐ ์ด์™ธ์—๋„ ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ ์œ ์šฉํ•˜๊ฒŒ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

์ค‘๊ฐ„์— loss๊ฐ€ ๋„ˆ๋ฌด ๋›ฐ๋ฉด์„œ weight update๊ฐ€ ์ด์ƒํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค๋ฉด ์‚ฌ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

์•„๋ž˜ ๊ธ€์„ ์ฐธ๊ณ ํ•˜์˜€๋‹ค.

 

https://neptune.ai/blog/understanding-gradient-clipping-and-how-it-can-fix-exploding-gradients-problem

 

Understanding Gradient Clipping (and How It Can Fix Exploding Gradients Problem) - neptune.ai

Gradient Clipping solves one of the biggest problems that we have while calculating gradients in Backpropagation for a Neural Network.  You see, in a backward pass we calculate gradients of all weights and biases in order to converge our cost function. Th

neptune.ai

 

 

Gradient clip by value, by norm

Gradient clipping์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ๋‘ ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

 

์ฒซ ๋ฒˆ์งธ๋กœ, Clipping-by-value๋Š” ๋‹จ์ˆœํžˆ ๋ชจ๋“  gradient๋ฅผ (min_threshold, max_threshold) ๋ฒ”์œ„๋กœ clippingํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๋˜๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์ธ Clipping-by-norm์€ norm์ด threshold ์ด์ƒ์ผ ๊ฒฝ์šฐ, threshold์— gradient์˜ unit vector๋ฅผ ๊ณฑํ•œ ๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ” ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

$g\leftarrow \text{threshold} * g / ||g||$        if $||g|| \geq \text{threshold}$

 

์•„๋ž˜์—์„œ Tensorflow์™€ PyTorch๋ฅผ ์ด์šฉํ•ด ์ด ๋‘ ๊ฐ€์ง€ gradient clipping์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ „์ฒด ์ฝ”๋“œ๋Š” ์œ„ ๋งํฌ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๊ธ€์—์„œ๋Š” gradient clipping ๋ถ€๋ถ„์˜ ์ฝ”๋“œ๋งŒ ์š”์•ฝํ•ด ๋‘์—ˆ๋‹ค.

 

 

Tensorflow (v1)

tf.clip_by_value๋ฅผ ์ด์šฉํ•œ๋‹ค.

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs] #min_value์™€ max_value๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.
train_op = optimizer.apply_gradients(capped_gvs)

 

Clipping_by_norm์€ line 3์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฐ”๊ฟ”์ฃผ๋ฉด ๋œ๋‹ค.

gradients = [(tf.clip_by_norm(grad, clip_norm=2.0)) for grad in gradients]

 

 

Tensorflow (v2)

v1๊ณผ ๋™์ผํ•˜๋‹ค.

with tf.GradientTape() as tape:
	predictions= model(inputs, training=True)
    loss = get_loss(targets, predictions)

gradients = tape.gradient(loss, model.trainable_variables)
gradients = [(tf.clip_by_value(grad, -1.0, 1.0)) for grad in gradients]
optimizer.apply_gradients(zip(gradients, model.trainable_variables))

 

Clipping_by_norm์—ญ์‹œ v1๊ณผ ๋™์ผํ•˜๋‹ค.

gradients = [(tf.clip_by_norm(grad, clip_norm=2.0)) for grad in gradients]

 

 

PyTorch

nn.utils.clip_grad_value_๋ฅผ ์ด์šฉํ•œ๋‹ค. PyTorch์˜ ๊ฒฝ์šฐ (-clip_value, clip_value)์˜ ๋ฒ”์œ„๋กœ gradient๋ฅผ clipํ•ด์ค€๋‹ค.

optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)
optimizer.step()

 

Clipping_by_norm์€ line 3์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฐ”๊ฟ”์ฃผ๋ฉด ๋œ๋‹ค.

nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2)
๋ฐ˜์‘ํ˜•