Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[20230504] Weekly VLM3 - FLAVA #7

Open
SoongE opened this issue Jun 8, 2023 · 0 comments
Open

[20230504] Weekly VLM3 - FLAVA #7

SoongE opened this issue Jun 8, 2023 · 0 comments

Comments

@SoongE
Copy link
Collaborator

SoongE commented Jun 8, 2023

Paper
FLAVA: A Foundational Language And Vision Alignment Model

Speaker
@SoongE

Literature review / Theoretical background

  • Contrastive learning의 단점들
    • cross-modal 환경은 multi-modal 환경에 사용이 용이하지 않음.
    • 대규모의 corpora를 필요로 함. 따라서 연구 환경에 적합하지 않음
  • 기존 연구는 각각의 downstream task를 타깃할 때는 좋으나, 범용적이지 못함.

Methods

CleanShot 2023-05-04 at 17 55 48

  • Vision, Language, Multi-model을 동시에 처리하는 foundation을 소개
  • 3개의 encoder
    • image encoder: unimodal image representation을 추출
    • text encoder: unidmoal text representation을 추출
    • multimodal encoder: multi-modal 추리를 위해 이미지 & 텍스트 representation을 융합하고 정렬
  • Downstream task에서는 각각의 encoder에 fc를 추가해서 사용
  • Multi-model Loss
    • Global contrastive(GC) loss: CLIP처럼 contrastive loss
    • Masked multimodal modeling(MMM): 두 feature에 모두 마스킹
    • Image-text matching(ITM)
  • Uni-model Loss
    • Masked image modeling(MIM)
    • Masked language modeling(MLM)
  • Encoder initialize from unimodal pre-training
    • uni-model dataset으로 MLM을 먼저 학습한 뒤, MIM을 학습
    • 그 뒤, multi-model을 학습
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant