Apple의 Multimodal LLM Ferret 논문 리뷰

🌌 Deep Learning/논문 리뷰 [KOR]

Apple의 Multimodal LLM Ferret 논문 리뷰

복만 2024. 1. 7. 23:06

Apple에서 2023년 10월 내놓은 Multimodal LLM인 Ferret의 논문이다. 모델 크기는 7B, 13B 두가지이며 Github에 코드와 checkpoint가 공개되어 있고, 비상업적 용도로 사용가능하다.

논문 링크: https://arxiv.org/pdf/2310.07704.pdf

Github: https://github.com/apple/ml-ferret

GitHub - apple/ml-ferret

Contribute to apple/ml-ferret development by creating an account on GitHub.

github.com

Introduction

Vision-language learning 모델의 주요한 두 capability는 referring과 grounding이다.

Referring: 이미지에서 주어진 영역에 대한 이해도 (location-in text-out)
Grounding: 텍스트에서 설명하는 영역을 이미지에서 찾아내는 능력 (text-in location-out)

이전 연구들에서는 referring과 grounding을 별개의 task로 두고 각각 학습하고자 했다.

Ferret에서는 두 task에서 얻을 수 있는 상호보완성을 기대하여 두 task를 통합하여 학습하는 MLLM (Multimodal Large Language Model)을 만들고자 했다.

본 연구의 contribution은 다음과 같다.

Ferret은 free-formed region input을 처리할 수 있는 최초의 MLLM이다.
Ferret 학습을 위해 만든 데이터셋 GRIT: Ground-and-Refer Instruction Tuning dataset
Ferret-Bench 개발: Ferret이 기존 best MLLM보다 평균적으로 20.4% 우세

논문에 GRIT과 Ferret-Bench에 대한 설명도 있지만 본 글에서는 스킵하고 Ferret 구조를 중심으로 정리하고자 한다.

Related work

다양한 MLLM 연구들을 나열하고 있다. 간단히 정리만 하고 넘어가면..

MLLMs

image-text pre-training 모델: SimVLM, GIT, PaLI, PaLI-X, BLIP-2, Flamingo (최초로 pretrained CLIP과 LLM을 결합), PALM-E, CM3, CM3Leon
LLM 활용: LLaVA, MultiGPT-4, mPLUG-Owl, Otter, InstructBLIP
Image generation: FROMAGe, GILL, Emu
참고: Multimodal Foundation Models: From Specialists to General-Purpose Assistants (2023.10)

MLLMs for referring and grounding

Kosmos-2, Shikra, GPT4ROI, PIVT, BuboGPT, VisionLLM, ContextDET
Ferret과의 가장 큰 차이점은, 이들은 input 형태로 bounding box만을 지원한다는 점이다.

Unifying grounding and VL understanding

UniTAB, OFA, Unified-IO, Pix2Seq

Methods

Hybrid Region Representation

이미지의 특정 영역은 세가지 형식으로 표현될 수 있다.

point: [x, y]
box: [xmin, ymin, xmax, ymax]
free-form

서로 다른 세 종류의 영역을 일반화하여 표현하기 위해 visual sampler을 이용하여 hybrid region representation을 생성해 사용한다. Hybrid region representation은 coordinate + continuous visual feature로 구성된다. (Visual sampler은 다음 section에서 후술)

Continuous visual feature: 영역을 2D binary mask로 변환한 다음 (segmentation mask처럼), image encoder에서 뽑은 feature map과 함께 visual sampler에 input으로 넣어서 visual continuous feature을 뽑는다.

* Point의 경우 해당 point를 중심으로 하고 fixed radius를 가진 원을 영역으로 사용하여 visual feature를 뽑는다.

Final image input은 다음과 같이 된다.

point: {x, y, f}
box & free-form: {x_min, y_min, x_max, y_max, f}
- x_min, x_max: min/max x-axis coordinate

모델 구조

Ferret은 image encoder, visual sampler, LLM 세가지로 구성된다. 사실상 핵심 구조는 Visual sampler이다. 본 논문에서 주요 contribution으로 내세운 것도 free-form input region이고, image encoder과 LLM은 기존 모델들을 사용했다.

Image encoder: CLIP-ViT-L/14
Visual sampler:
- binary region mask M에서 random하게 512개의 point를 추출한다.
- bilinear interpolation을 이용해 각 point의 feature을 추출한다.
- 512개의 point들을 sampling, gathering, pooling 세 단계로 구성된 block들에 통과시킨다. (2회 반복)
  - Sampling: N/4개의 point를 farthest point sampling (FPS) 알고리즘으로 추출한다.
  - Gathering: 각 point x_i에 대해서 처음 N개의 point들을 대상으로 k-NN을 수행해 가장 가까운 k개의 point를 찾고, 그렇게 찾은 각 point들과 x_i의 feature들을 합쳐준다. 각 N/4개의 point들에 k개의 feature들이 남게 된다.
  - Pooling: 각 point들의 k개의 feature을 max pooling을 통해 하나로 합쳐준다.
  - 이러한 과정을 통해 N개의 point들이 N/4개의 dense한 point들이 되어 남게 된다.
- 마지막으로 나온 32개의 point의 feature들을 flatten하여 LLM embedding으로 사용한다.
LLM:
- Vicuna (LLaMA + instruction tuning)
- Grounding을 위해서, output에서 이미지의 영역은 box coordinate로 나타난다.
- Image embedding은 linear layer로 한번 projection 한다음에 input으로 들어간다.

Experiments

Training details

앞에서 언급한 바와 같이 image encoder로는 CLIP-ViT-L/14@336p, LLM으로는 Vicuna를 사용했다. Projection layer로는 LLaVA의 first-stage weight을 사용했으며, visual sampler은 random initialization을 사용했다.

GRIT dataset을 이용해 3 epoch 학습시켰으며 Loshchilov & Hutter optimization (lr=2e-5) 을 사용했다.

Batch size 128로, A100 GPU 8대에서 13B모델은 5일, 7B모델은 2.5일동안 학습했다.

Results

결과는 예시만 간단히 첨부했다. 자세한 성능 수치는 논문을 참고..

Ferret-Bench

Ferret-Bench에서는 당연히 Ferret이 가장 잘하고, 모델 크기가 클수록 성능이 좋다.

LLaVA-Bench에서도 좋은 성능을 보인다.

Ferret vs GPT-4V

용감하게 GPT에게도 도전장을 내밀었다.

논문의 표현을 빌리자면, Ferret은 정밀한 bounding box가 요구되는 상황에서 특히 빛을 발한다고 한다.

'🌌 Deep Learning > 논문 리뷰 [KOR]' 카테고리의 다른 글

[딥러닝 논문리뷰] MeZO: Fine-Tuning Language Models with Just Forward Passes (NeurIPS 2023) (2)	2024.01.28
[딥러닝 논문리뷰] AIM: Scalable Pre-training of Large Autoregressive Image Models (Apple, 2024) (0)	2024.01.21
[딥러닝 논문리뷰] AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights (Naver AI Lab, ICLR 2021) (0)	2023.07.23
[딥러닝 논문리뷰] Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (1)	2022.09.21
[딥러닝 논문리뷰 + 코드] PointCutMix: Regularization Strategy for Point Cloud Classification (Neurocomputing 2022) (0)	2022.09.14

현재글Apple의 Multimodal LLM Ferret 논문 리뷰

🐬

Today :
Yesterday :

IBOK