GLaMM: Pixel Grounding Large Multimodal Model

논문 리뷰

minty_y 2025. 5. 21. 16:24

Background and Motivation

Large Multimodal Models (LMMs) extend Large Language Models into the vision domain
Early LMMs generated ungrounded textual responses based on holistic images
Recent region-level LMMs allow visually grounded responses
Limitations include single-object references, need for manual region input, and lack of dense pixel-wise grounding

Proposed Model: GLaMM

GLaMM is the first model to generate natural language responses with corresponding segmentation masks
Grounds objects mentioned in conversation at the pixel level
Accepts both textual input and optional visual prompts (regions of interest)
Enables flexible interaction across both text and visual domains at various levels of granularity
i) Global Image Encoder, ii) Region Encoder, iii) LLM, iv) Grounding Image Encoder, and v) Pixel Decoder

Task Definition: Visually Grounded Conversation Generation (GCG)

GLaMM targets the new task of Visually Grounded Conversation Generation
No existing benchmarks for this setting
Introduces a comprehensive evaluation protocol using curated grounded conversations
GCG requires dense grounding of concepts in large-scale natural scenes

Dataset: GranD (Grounding-anything Dataset)

Performance and Applications

GLaMM performs well across multiple downstream tasks:
- Referring expression segmentation
- Image-level and region-level captioning
- Vision-language conversations

Background and Motivation

Proposed Model: GLaMM

Visually Grounded Conversation Generation (GCG)

GranD: New Dataset for GCG

Performance and Applications

GLaMM은 GCG 외에도 다음과 같은 다양한 태스크에서 효과 확인
- 지시 표현 분할 (Referring Expression Segmentation)
- 이미지 및 영역 기반 캡셔닝
- 비전-언어 대화