๐ŸŒŒ Deep Learning/๋…ผ๋ฌธ ๋ฆฌ๋ทฐ [KOR]

[๋”ฅ๋Ÿฌ๋‹ ๋…ผ๋ฌธ๋ฆฌ๋ทฐ] AIM: Scalable Pre-training of Large Autoregressive Image Models (Apple, 2024)

๋ณต๋งŒ 2024. 1. 21. 23:03

Apple์—์„œ 2024๋…„ 1์›” large pretrained image model์ธ AIM(Autoregressive Image Models)์„ ๋ฐœํ‘œํ–ˆ๋‹ค. ์ฝ”๋“œ์™€ model weight์ด Github์— ๊ณต๊ฐœ๋˜์–ด ์žˆ๋‹ค.

 

๋…ผ๋ฌธ ๋งํฌ: https://arxiv.org/pdf/2401.08541.pdf

GitHub: https://github.com/apple/ml-aim/tree/main

 

 

 

AIM์€ LLM์— ์˜๊ฐ์„ ๋ฐ›์•„ ๋งŒ๋“ค์–ด์ง„ ๋Œ€๊ทœ๋ชจ vision ๋ชจ๋ธ์ด๋‹ค. BEiT (2021), Masked autoencoder(MAE) (2021) ๋“ฑ์ด masked language modeling (MLM)์„ ํ†ตํ•ด ์‚ฌ์ „ํ•™์Šต ์‹œํ‚จ ๊ฒƒ๊ณผ ๋‹ค๋ฅด๊ฒŒ, ์ฃผ์–ด์ง„ ํŒจ์น˜๋กœ ๋‹ค์Œ ํŒจ์น˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” autoregressive object๋ฅผ ์ด์šฉํ•˜์—ฌ ์‚ฌ์ „ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

 

AIM์˜ ์ฃผ์š” contribution์€ vision ๋ชจ๋ธ๋„ LLM๊ณผ ์œ ์‚ฌํ•œ scaling property๋ฅผ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ–ˆ๋‹ค๋Š” ์ ์ด๋‹ค. DINOv2 (2023) ์—์„œ๋Š” 142M์žฅ์˜ ์ด๋ฏธ์ง€๋กœ 460M ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผฐ์ง€๋งŒ, vision ๋ชจ๋ธ์€ LLM์—์„œ์˜ scaling law๋ฅผ ๋”ฐ๋ฅด์ง€ ์•Š๋Š”๋‹ค๊ณ  ์ฃผ์žฅํ–ˆ๊ณ , MAE์—์„œ๋„ ๋น„์Šทํ•˜๊ฒŒ ์–˜๊ธฐํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ AIM์€ 2B์žฅ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด 7B ๋ชจ๋ธ์„ autoregressive objective๋กœ ์„ฑ๊ณต์ ์œผ๋กœ ํ•™์Šต์‹œ์ผฐ์œผ๋ฉฐ, ์ด ์ •๋„์˜ ๊ทœ๋ชจ์—์„œ๋„ saturation์ด ์ผ์–ด๋‚˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์„ ํ† ๋Œ€๋กœ large-scale vision model์˜ ์ƒˆ๋กœ์šด ์ง€ํ‰์„ ์—ด ๊ฐ€๋Šฅ์„ฑ์„ ํ™•์ธํ–ˆ๋‹ค.

 

 

 

 

Related works

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” iGPT (2020) ์—์„œ ์‚ฌ์šฉํ•œ autoregressive objective๋ฅผ ์‚ฌ์ „ํ•™์Šต์— ์‚ฌ์šฉํ–ˆ๋‹ค. ๋˜๋‹ค๋ฅธ pretrained vision model์ธ BEiT, MAE ๋“ฑ์€ BERT์— ์˜๊ฐ์„ ๋ฐ›์€ MLM๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค. Contrastive method๋“ค๋„ ๋ผ๋ฒจ ์—†์ด ์‚ฌ์ „ํ•™์Šตํ•œ๋‹ค๋Š” ์ ์—์„œ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋“ค์€ ์ž‘์€ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์—์„œ๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด์ง€๋งŒ scaling์—๋Š” ์–ด๋ ค์›€์ด ์žˆ๋‹ค. 

 

 

Method

Dataset

 

DFN dataset์—์„œ 2B์žฅ ์ด๋ฏธ์ง€๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์—ฌ๊ธฐ์— ImageNet์„ ์„ž์–ด์„œ ์‚ฌ์šฉํ–ˆ๋‹ค.

 

 

Objective

 

 

์ด๋ฏธ์ง€๋ฅผ K๊ฐœ์˜ ํŒจ์น˜๋กœ overlap ์—†์ด ์ž๋ฅด๊ณ , next patch prediction์„ ํ•œ๋‹ค.

 

$\sum_x \sum_k -\log P(x_k|x_{<k})$

 

MAE์™€ ์œ ์‚ฌํ•˜๊ฒŒ, normalized pixel-level regression loss๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

$\min_\theta \frac1K\sum_{k=1}^K||\hat x _k (\theta)-x_k||_2^2$

 

 

Architecture

 

๊ธฐ๋ณธ์ ์œผ๋กœ ViT ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ƒ์„ธํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

์ด์ „ sequence์˜ ํŒจ์น˜๋งŒ ์ด์šฉํ•ด attention์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•˜๋Š” causal mask๋ฅผ ์ ์šฉํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ downstream task์—์„œ๋Š” bidirectional self-attention์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ์„ฑ๋Šฅ์„ ๋–จ์–ดํŠธ๋ฆฐ๋‹ค.

 

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž “prefix transformer”๋ฅผ ๋„์ž…ํ–ˆ๋‹ค. ์ฒ˜์Œ S๊ฐœ์˜ ํŒจ์น˜๊ฐ€ "prefix"์— ํ•ด๋‹นํ•œ๋‹ค. ์ด๋“ค์€ ๋‚˜๋จธ์ง€ ํŒจ์น˜๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ context๋กœ ์‚ฌ์šฉ๋˜๊ณ , ์ด๋“ค์€ autoregression prediction์—์„œ ์ œ์™ธ๋œ๋‹ค.

 

 

Downstream tasks

 

Downstream task์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ• ๋•Œ๋Š” backbone์€ ๋ชจ๋‘ freezeํ•˜๊ณ  classification head๋งŒ trainํ–ˆ๋‹ค. Large model ์ด๋‹ˆ๋งŒํผ ์ „์ฒด๋ฅผ ๋‹ค์‹œ finetuning ํ•ด์„œ ์“ฐ๋Š”๊ฒƒ์€ ๋„ˆ๋ฌด ๋‚ญ๋น„์ด๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ pretrain task์—์„œ๋Š” ํŒจ์น˜๋‹จ์œ„ prediction๋งŒ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฏธ์ง€ ๋‹จ์œ„ token์ด ์—†์—ˆ๋‹ค. Classification ๋“ฑ์˜ ์ด๋ฏธ์ง€ ๋‹จ์œ„ prediction์„ ์œ„ํ•ด์„œ๋Š” ํŒจ์น˜ feature๋“ค์— global average pooling์„ ํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, attention pooling operation์„ ํ†ตํ•ด global descriptor ๊ณ„์‚ฐํ•˜๋Š” ๋กœ์ง์„ ์ถ”๊ฐ€ํ–ˆ๋‹ค.

 

 

 

Results

Impact of scaling

 

 

Pretrain loss์™€ classification accuracy๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. Pretrain loss๊ฐ€ ์ค„์–ด๋“ค์ˆ˜๋ก downstream task์˜ ์„ฑ๋Šฅ์ด ๋†’์•„์ง€๋Š” ๊ฒƒ์„ ํ†ตํ•ด pretrain objective๋ฅผ ์ž˜ ์„ค์ •ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋ฅผ ํ‚ค์šธ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” LLM์—์„œ์˜ ์–‘์ƒ๊ณผ ์œ ์‚ฌํ•˜๋‹ค.

 

 

Ablations

 

 

Method์—์„œ ์„ค๋ช…ํ•œ ๋‹ค์–‘ํ•œ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ablation. ์ฐธ๊ณ ๋กœ autoregression pattern์€ ํŒจ์น˜๋ฅผ ์–ด๋–ค ์ˆœ์„œ๋กœ ๋„ฃ์–ด์ค„ ๊ฒƒ์ธ๊ฐ€์— ๋Œ€ํ•œ ๋‚ด์šฉ์ธ๋ฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ์ƒ๊ฐํ•˜๋Š” ๊ฐ€๋กœ-์„ธ๋กœ ์ˆœ์„œ๊ฐ€ ๊ฐ€์žฅ ์ข‹์•˜๋‹ค๊ณ  ํ•œ๋‹ค.

 

 

 

Pretrain objective

 

MLM ๋ฐฉ์‹๊ณผ๋„ ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ–ˆ๋‹ค. MLM๋ณด๋‹ค autoregressive๊ฐ€ ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค.

 

 

 

Comparison with other pretrained models

 

 

๋‹ค๋ฅธ pretrained ๋ชจ๋ธ๋“ค๊ณผ์˜ ๋น„๊ต์ด๋‹ค. DINOv2๋ฅผ ์ œ์™ธํ•˜๊ณ  AIM์ด ๋ชจ๋‘ ์ด๊ฒผ๋Š”๋ฐ, DINOv2๋Š” ๋” ๋†’์€ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  DINOv2๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ž์งˆ๊ตฌ๋ ˆํ•œ ํ•™์Šต ํŠธ๋ฆญ๋“ค์— ํฌ๊ฒŒ ์˜์กดํ•˜๊ณ  ์žˆ๋Š”๋ฐ, AIM์€ ํ•™์Šต ๋ฐฉ๋ฒ•์ด ๋งค์šฐ ๊ฐ„๋‹จํ•˜๋‹ค๊ณ  ํ•œ๋‹ค (...)

 

๊ทผ๋ฐ AIM์„ ์ œ์™ธํ•˜๊ณ  2๋“ฑ์ธ iBOT๊ณผ ๋น„๊ตํ•ด๋ด๋„ iBOT์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 300M๊ฐœ์ •๋„์ด๋‹ค. AIM-0.6B์˜ ์ ˆ๋ฐ˜ ์ •๋„์ธ๋ฐ๋„ ํ›จ์”ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ์•„๋งˆ ํ‘œ์— ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฏ์ˆ˜๋ฅผ ์•ˆ ์ ์–ด๋†“์€ ๊ฒƒ์€ ์ด๋Ÿฐ ๋ถˆ๋ฆฌํ•จ ๋•Œ๋ฌธ ์•„๋‹ˆ์—ˆ์„๊นŒ..

 

 

ํ›„๊ธฐ

์ด์ „ pretrain vision ๋ชจ๋ธ๋“ค์€ MLM์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๊ณค ํ–ˆ๋Š”๋ฐ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” 2020๋…„ iGPT์—์„œ ์‚ฌ์šฉํ•œ autoregressive task๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. vision ๋ชจ๋ธ์—์„œ๋„ LLM์ฒ˜๋Ÿผ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ฆ๊ฐ€ํ•˜๋Š” scaling law๊ฐ€ ์ž‘์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ ๋ช‡๊ฐ€์ง€ ์˜๋ฌธ์ ์€ ..

  1. ๋งˆ์ง€๋ง‰ result์—์„œ ๋ณด๋“ฏ์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ์—„์ฒญ๋‚˜๊ฒŒ ํฐ๊ฒƒ ์น˜๊ณ  ์„ฑ๋Šฅ์ด ์••๋„์ ์œผ๋กœ ์ข‹์€๊ฑด ์•„๋‹˜
  2. ํ•™์Šต ์žฅ๋น„์™€ ํ•™์Šต์— ์†Œ์š”๋œ ์‹œ๊ฐ„์„ ์•ˆ์ ์–ด๋†“์Œ. ๊ทธ๋ฆฌ๊ณ  ๊นƒํ—™์— ํ•™์Šต ์ฝ”๋“œ์™€ loss ์ฝ”๋“œ๊ฐ€ ์—†๋‹ค.
  3. ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ 2B๊นŒ์ง€ ํ‚ค์›Œ๋„ ์„ฑ๋Šฅ์ด saturation ๋˜์ง€ ์•Š์•˜๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ๊ทธ๋Ÿฌ๋ฉด DINOv2๋ฅผ ์ด๊ธธ๋•Œ๊นŒ์ง€ ํ•œ๋ฒˆ๋” scaling์„ ์•ˆํ•œ ์ด์œ ๋Š”..?
๋ฐ˜์‘ํ˜•