๐ŸŒŒ Deep Learning/๋…ผ๋ฌธ ๋ฆฌ๋ทฐ [KOR]

[๋”ฅ๋Ÿฌ๋‹ ๋…ผ๋ฌธ๋ฆฌ๋ทฐ] AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights (Naver AI Lab, ICLR 2021)

๋ณต๋งŒ 2023. 7. 23. 22:07

Adam์˜ ๋ฌธ์ œ์ ์„ ๊ทน๋ณตํ•˜๋Š” ์ƒˆ๋กœ์šด optimizer์ธ AdamP๋ฅผ ์ œ์•ˆํ•˜๋Š” ๋…ผ๋ฌธ์œผ๋กœ, Naver AI Lab & Naver Clova์—์„œ ICLR 2021์— ๋ฐœํ‘œํ•˜์˜€๋‹ค.

 

Paper: https://arxiv.org/pdf/2006.08217.pdf

Project page: https://clovaai.github.io/AdamP/

Code: https://github.com/clovaai/adamp

 


 

1. Adam์˜ ๋ฌธ์ œ์ 

 

์š”์•ฝ: Adam์„ ๋น„๋กฏํ•œ momentum-based gradient descent optimzer๋“ค์€ ํ•™์Šต ๋„์ค‘ weight norm์„ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค.

 

 

๊ทธ ์›์ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ๋“ค์—์„œ Batch normalization ๋“ฑ์˜ normalization ๊ธฐ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•ด weight๋ฅผ scale-invariantํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค. ๋”ฐ๋ผ์„œ weight๋“ค์˜ ํฌ๊ธฐ๋Š” ๋ชจ๋ธ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๊ฒŒ ๋œ๋‹ค.

๋”ฐ๋ผ์„œ ๋ชจ๋ธ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฐ’์€, weight๋“ค์„ l2-norm์œผ๋กœ ๋‚˜๋ˆˆ ๊ฐ’์ด๋‹ค. ์ด๋“ค์„ effective weight $\hat w=\frac{w}{||w||_2}$์ด๋ผ๊ณ  ํ•˜์ž.

๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ optimization์ด ์ผ์–ด๋‚˜๋Š” ๊ณต๊ฐ„์€ effective weight์ด ์žˆ๋Š” ๊ณต๊ฐ„์ด ์•„๋‹ˆ๋ผ, ์›๋ž˜์˜ weight๊ฐ€ ๋†“์—ฌ์žˆ๋Š” nominal space์ด๋‹ค.

 

์ด๋Ÿฌํ•œ ์ด์œ ๋กœ effective step size์™€ ์‹ค์ œ nominal step size ๊ฐ„์— ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ $w_t$๊ฐ€ $w_{t+1}$์œผ๋กœ ์—…๋ฐ์ดํŠธ ๋˜๋ฉด, ์‹ค์ œ ๋ชจ๋ธ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” effective step์€ ์ฃผํ™ฉ์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋œ ๋ถ€๋ถ„๊ณผ ๊ฐ™๋‹ค.

 

 

 

 

Nominal step size์™€ effective step size๋Š” ์•ฝ $\frac{1}{||w_{t+1}||_2}$๋งŒํผ ์ฐจ์ด๊ฐ€ ๋‚˜๊ฒŒ ๋œ๋‹ค.

 

 

 

 

์ผ๋ฐ˜์ ์ธ gradient descent (GD) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋ฉด ํ•™์Šต ๋„์ค‘ weight norm์ด ์ฆ๊ฐ€ํ•˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๋Š”๋ฐ, (Lemma 2.1) ๋”ฐ๋ผ์„œ effective step size $\Delta \hat w_t$๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ค„์–ด๋“ค๊ฒŒ ๋˜๋ฉฐ ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ๋•๋Š”๋‹ค.

 

 

 

๊ทธ๋Ÿฌ๋‚˜ Momentum์ด ์ถ”๊ฐ€๋œ Adam๊ณผ ๊ฐ™์€ optimizer์˜ ๊ฒฝ์šฐ, weight norm์ด ๋”์šฑ ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•œ๋‹ค (Lemma 2.2). 

 

 

๊ทธ ๊ฒฐ๊ณผ, ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด, GD๋Š” ์ˆ˜๋ ด์†๋„๊ฐ€ ๋А๋ฆฌ์ง€๋งŒ weight norm์ด ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๋Š” ํ•œํŽธ,

GD + momentum์€ ์—…๋ฐ์ดํŠธ ์†๋„๊ฐ€ ๋น ๋ฅด์ง€๋งŒ weight norm์ด ์—ญ์‹œ ๋งค์šฐ ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜์—ฌ effective step size๋Š” ๊ทธ๋ฆฌ ํฌ์ง€ ์•Š์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

์ด๋กœ ์ธํ•ด effective convergence์˜ ์†๋„๊ฐ€ ๋งค์šฐ ๋А๋ ค์ง€๊ฑฐ๋‚˜, sub-optimalํ•œ ํ•ด๋‹ต์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค.

 

 

 

2. Method (SGDP, AdamP)

 

์œ„์—์„œ Adam๊ณผ ๊ฐ™์€ momentum ๊ธฐ๋ฐ˜ optimizer์˜ ๋ฌธ์ œ๋ฅผ ๋ฐํ˜€๋ƒˆ๋‹ค. ํ•™์Šต๊ณผ์ •์—์„œ weight norm์„ ๋งค์šฐ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ์„ฑํ–ฅ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, effective convergence์˜ ์†๋„๊ฐ€ ๋งค์šฐ ๋А๋ ค์ง€๊ณ , sub-optimalํ•œ ํ•ด๋‹ต์„ ์ฐพ๊ฒŒ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ projection์„ ์ด์šฉํ•˜์—ฌ, weight norm์˜ ์ฆ๊ฐ€๋ฅผ ๋ง‰์„ ์ˆ˜ ์žˆ๋Š” optimization ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ effective space์—์„œ์˜ update direction์€ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๊ณ , effective step size๋งŒ์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

 

 

๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด, ๋งค update์—์„œ weight์™€ parallelํ•œ radial component๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ์•„์ด๋””์–ด์ด๋‹ค.

 

 

 

Update vector์„ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ณ„์‚ฐํ•œ ๋’ค ($p_t$), weight vector์™€ parallelํ•œ ์„ฑ๋ถ„์€ projection์œผ๋กœ ์ œ๊ฑฐํ•œ๋‹ค ($\Pi_{w_t}(p_t)$).

 

Weight vector์™€ parallelํ•œ ์„ฑ๋ถ„์€ loss minimization์—๋Š” ๊ธฐ์—ฌํ•˜์ง€ ์•Š๊ณ , weight norm์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐ์—๋งŒ ๊ธฐ์—ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

์ด ๋•Œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ธ $\delta$๋Š” scale-invariantํ•œ weight๋ฅผ detectํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” $\delta=0.1$๋กœ ์„ค์ •ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

 

 

 

3. Experiments

 

Image classificaiton, object detection, audio classification, language modeling ๋“ฑ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ๊ณผ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

 

 

์‹คํ—˜ 1. Image classification - SGD, Adam๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ SGDP, AdamP๊ฐ€ ํ•ญ์ƒ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

 

 

 

์‹คํ—˜ 2. Object detection - Adam๋ณด๋‹ค AdamP๊ฐ€ ํ•ญ์ƒ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

 

 

 

์‹คํ—˜ 3. Adversarial learning - model robustness๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค. Adam๋ณด๋‹ค AdamP๊ฐ€ ๋” ๋†’์€ robustness๋ฅผ ๊ฐ€์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

๋ฐ˜์‘ํ˜•