๐ŸŒŒ Deep Learning/๋…ผ๋ฌธ ๋ฆฌ๋ทฐ [KOR]

[๋”ฅ๋Ÿฌ๋‹ ๋…ผ๋ฌธ๋ฆฌ๋ทฐ] Deep Double Descent: Where Bigger Models and More Data Hurt (ICLR 2020)

๋ณต๋งŒ 2020. 12. 22. 10:48

ICLR2020์— ๋ฐœํ‘œ๋œ ๋…ผ๋ฌธ์ธ

Deep Double Descent: Where Bigger Models and More Data Hurt

๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค.

 

๋‹ค์–‘ํ•œ Deep Learning task์—์„œ ๋ฐœ๊ฒฌ๋˜๋Š” Double-descent๋ผ๋Š” ํ˜„์ƒ์„ Model complexity ๊ด€์ ์—์„œ ํ•ด์„ํ•˜๊ณ ,

์–ด๋–ค ๊ฒฝ์šฐ์—์„œ๋Š” Model complexity๋‚˜ Train epoch๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ์„ ํ•˜๋ฝ์‹œํ‚ฌ ์ˆ˜๋„ ์žˆ๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.

 

 

Classical Statistics vs. Modern Neural Networks

1) Classical Statistics: Bias-variance trade-off์— ๋”ฐ๋ฅด๋ฉด, Model complexity๊ฐ€ ์ผ์ • ์ˆ˜์ค€ ์ด์ƒ ์ปค์ง€๋ฉด Overfitting์ด ๋ฐœ์ƒํ•ด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ํ•˜๋ฝํ•ฉ๋‹ˆ๋‹ค.

 

 

์ถœ์ฒ˜: https://www.analyticsvidhya.com/blog/2020/08/bias-and-variance-tradeoff-machine-learning/

 

 

2) Modern Neural Networks: ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ์˜คํžˆ๋ ค ๋ชจ๋ธ์ด ํด์ˆ˜๋ก ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. (Bigger models are better)

 

 

3) Training time์— ๊ด€ํ•ด์„œ๋„ ์˜๊ฒฌ์ด ๋ถ„๋ถ„ํ•ฉ๋‹ˆ๋‹ค. ์–ด๋–ค ๊ฒฝ์šฐ์—์„œ๋Š” Early stopping์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์„ฑ๋Šฅ์ด ๋†’๊ณ , ์–ด๋–ค ๊ฒฝ์šฐ์—์„œ๋Š” Epoch์„ ํฌ๊ฒŒ ํ• ์ˆ˜๋ก ์ข‹๋‹ค๊ณ  ์–˜๊ธฐํ•˜๊ณ  ์žˆ์ฃ .

 

 

 

์™œ ์ƒํ™ฉ๋งˆ๋‹ค ์ด๋ ‡๊ฒŒ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐœ์ƒํ• ๊นŒ์š”?

 

 

Two Regimes of Deep Learning Setting

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ Setting์— ๋”ฐ๋ผ ๋‘ ๊ฐ€์ง€ Regime์ด ์กด์žฌํ•œ๋‹ค๊ณ  ๋งํ•ฉ๋‹ˆ๋‹ค.

 

 

 

 

1. Under-parameterized regime (Classical Regime): Sample์˜ ์ˆ˜์— ๋น„ํ•ด ๋ชจ๋ธ์˜ Complexity๊ฐ€ ์ž‘์€ ๊ฒฝ์šฐ.

์ด ๊ฒฝ์šฐ, ๋ชจ๋ธ์˜ Complexity์— ๋Œ€ํ•œ Test error์˜ ํ•จ์ˆ˜๋Š” Classical bias/variance tradeoff๋ฅผ ๋”ฐ๋ฅด๋Š” U๋ชจ์–‘ ํ˜•ํƒœ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

 

2. Over-parameterized regime (Modern Regime): ๋ชจ๋ธ์˜ Complexity๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ปค์„œ Train error๊ฐ€ 0์— ์ˆ˜๋ ดํ•˜๋Š” ๊ฒฝ์šฐ.

์ด ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ Complexity๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ์ˆ˜๋ก Test error์€ ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค.

์œ„์™€ ๋‹ฌ๋ฆฌ, Modern intuition์ธ "Bigger models are better"์„ ๋”ฐ๋ฅด๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค.

 

 

 

๋‹จ์ˆœํžˆ ๋งํ•ด, Model Complexity์— ๋”ฐ๋ฅธ Test error์ด ๊ฐ์†Œ->์ฆ๊ฐ€->๊ฐ์†Œ์˜ ํ˜•ํƒœ๋ฅผ ๋„๊ฒŒ ๋˜๋Š”๋ฐ์š”,

์ด๋Ÿฌํ•œ ํ˜„์ƒ์„ 2018๋…„ Belkin ๋“ฑ์ด "Double descent"๋ผ๊ณ  ๋ช…๋ช…ํ•˜๊ณ , ๋‹ค์–‘ํ•œ ML ๋ฐ DL task์—์„œ ๋‚˜ํƒ€๋‚จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

 

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”

 

  • Effective Model Complexity (EMC)๋ผ๋Š” Complexity measurement๋ฅผ ๋„์ž…ํ•˜์—ฌ Double descent์— ๋Œ€ํ•œ ์ •์˜๋ฅผ ์„ธ์› ๊ณ ,
  • ์‹คํ—˜์„ ํ†ตํ•ด ์‹ค์ œ๋กœ ๋งค์šฐ ๋‹ค์–‘ํ•œ setting์—์„œ Double descent๊ฐ€ ๋ฐœ์ƒํ•จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

 

 

Effective Model Complexity (EMC)

Effective Model Complexity (EMC)๋Š” Model Complexity๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ฒ™๋„์ž…๋‹ˆ๋‹ค. ์ •์˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

 

EMC์˜ ์ •์˜

 

์—ฌ๊ธฐ์„œ S๋Š” n๊ฐœ์˜ Sample์„ ๊ฐ–๊ณ  ์žˆ๋Š” Train dataset์ž…๋‹ˆ๋‹ค.

 

 

 

๊ฐ„๋‹จํžˆ ๋งํ•˜์ž๋ฉด, EMC๋Š” ๋ชจ๋ธ์˜ Train error์ด 0์— ๊ฐ€๊น๊ฒŒ ์ˆ˜๋ ดํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

EMC๊ฐ€ ํด์ˆ˜๋ก Model Complexity๊ฐ€ ๋†’์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

 

 

 

์ด EMC๋ฅผ ์ด์šฉํ•˜์—ฌ Generalized Double Descent hypothesis๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

 

 

1) EMC๊ฐ€ Dataset ํฌ๊ธฐ๋ณด๋‹ค ์ถฉ๋ถ„ํžˆ ์ž‘์€ ๊ฒฝ์šฐ : Under-parameterized regime

์ด ๋•Œ EMC๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด Test error์€ ์ค„์–ด๋“ ๋‹ค.

 

2) EMC๊ฐ€ Dataset ํฌ๊ธฐ๋ณด๋‹ค ์ถฉ๋ถ„ํžˆ ํฐ ๊ฒฝ์šฐ : Over-parameterized regime

์ด ๋•Œ EMC๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด Test error์€ ์ค„์–ด๋“ ๋‹ค.

 

3) EMC๊ฐ€ Dataset ํฌ๊ธฐ์™€ ๋น„์Šทํ•œ ๊ฒฝ์šฐ : Critically parameterized regime

์ด ๋•Œ EMC๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด Test error์€ ์ค„์–ด๋“ค๊ฑฐ๋‚˜ ๋Š˜์–ด๋‚œ๋‹ค.

 

 

 

 

์œ„์˜ ๊ทธ๋ž˜ํ”„์— ๊ฐ ๊ตฌ๊ฐ„์„ ๋‹ค์‹œ ํ‘œ์‹œํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

 

์ด์™€ ๊ฐ™์ด EMC๋ผ๋Š” ๊ฐœ๋…์„ ๋„์ž…ํ•ด Double descent๋ฅผ ์„ค๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๋”ฐ๋ผ์„œ Interpolation threshold(EMC=n์ธ ์ง€์ )์„ ๊ธฐ์ค€์œผ๋กœ,

 

  • Critical interval ์™ธ๋ถ€์—์„œ๋Š” Model complexity๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ์— ๋„์›€์ด ๋˜๋‚˜
  • Critical interval ๋‚ด๋ถ€์—์„œ๋Š” Model complexity๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์„ ๋–จ์–ดํŠธ๋ฆด ์ˆ˜ ์žˆ๋‹ค

๋ผ๊ณ  ์–˜๊ธฐํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํ•˜์ง€๋งŒ Critical interval์˜ ๋„ˆ๋น„๊ฐ€ ์–ด๋Š ์ •๋„์ธ์ง€๋Š” Data distribution๊ณผ Training procedure์˜ ์ข…๋ฅ˜ ๋“ฑ์— ๋”ฐ๋ผ ๊ฐ๊ฐ ๋‹ค๋ฅธ๋ฐ, ์ด์— ๋Œ€ํ•ด์„œ๋Š” ์•„์ง ์ •ํ™•ํ•˜๊ฒŒ ์•Œ์ง€ ๋ชปํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด hypothesis๊ฐ€ informalํ•˜๋‹ค๊ณ  ์–˜๊ธฐํ•˜๊ณ  ์žˆ์–ด์š”.

 

 

Experiments

์ •๋ง ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์œ„์˜ hypothesis๋ฅผ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ช‡ ๊ฐ€์ง€์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋งŒ ์†Œ๊ฐœํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ž์„ธํ•œ ์‹คํ—˜ ๋‚ด์šฉ๊ณผ ๋” ๋งŽ์€ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

  • Model-wise Double Descent: ๋‹ค์–‘ํ•œ Datsaet, model architecture, optimizer, number of train samples, training procedures์— ๋Œ€ํ•ด ์‹คํ—˜์„ ์ง„ํ–‰ํ•ด model size์— ๋”ฐ๋ฅธ double descent ํ˜„์ƒ์„ ๊ด€์ฐฐํ•˜๊ณ , test error peak์ด interpolation threshold์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค. Bigger models are worse
  • Epoch-wise Double Descent: model size ๋ฟ ์•„๋‹ˆ๋ผ epoch์— ๋Œ€ํ•ด์„œ๋„ double descent ํ˜„์ƒ์„ ๊ด€์ฐฐํ•˜์˜€์Šต๋‹ˆ๋‹ค. Training longer can correct overfitting

๋ชจ๋ธ ๋ณต์žก๋„์— ๋”ฐ๋ฅธ ์‹คํ—˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. Double Descent ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Training epoch์— ๋”ฐ๋ฅธ ์‹คํ—˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•