[20230504] Weekly VLM3 - FLAVA #7

SoongE · 2023-06-08T05:00:19Z

Speaker
@SoongE

Contrastive learning의 단점들
- cross-modal 환경은 multi-modal 환경에 사용이 용이하지 않음.
- 대규모의 corpora를 필요로 함. 따라서 연구 환경에 적합하지 않음
기존 연구는 각각의 downstream task를 타깃할 때는 좋으나, 범용적이지 못함.

Vision, Language, Multi-model을 동시에 처리하는 foundation을 소개
3개의 encoder
- image encoder: unimodal image representation을 추출
- text encoder: unidmoal text representation을 추출
- multimodal encoder: multi-modal 추리를 위해 이미지 & 텍스트 representation을 융합하고 정렬
Downstream task에서는 각각의 encoder에 fc를 추가해서 사용
Multi-model Loss
- Global contrastive(GC) loss: CLIP처럼 contrastive loss
- Masked multimodal modeling(MMM): 두 feature에 모두 마스킹
- Image-text matching(ITM)
Uni-model Loss
- Masked image modeling(MIM)
- Masked language modeling(MLM)
Encoder initialize from unimodal pre-training
- uni-model dataset으로 MLM을 먼저 학습한 뒤, MIM을 학습
- 그 뒤, multi-model을 학습

SoongE self-assigned this Jun 8, 2023

SoongE added ContrastiveLearning MLM/ITM labels Jun 8, 2023

Provide feedback