Training DALLE from scratch, utilizing target language's PLMs' token embedding layer and position embedding layer as text encoder.
π For the project details, please refer to README.pdf
- Training DALLE model from scratch demands large size paired dataset of images and captions. For example, OpenAI DALLE is trained with more than 250 million text-image pairs for the training.
- If the dataset isnβt large enough or is limited to specific domains, number of vocabularies in the trained DALLE model are insufficient. For instance, 1 million text captions of K-Fashion dataset only consists of more or less than 300 tokens.
- Therefore, inferencing from such DALLE models could be problematic if the given sentence query is unconnected to the originally trained captionsβ text dataset.
OpenAIβs DALLE | KoDALLE of HappyFace | |
---|---|---|
Train Dataset Size | 250 Million Pairs | 0.8 Million Pairs |
#Params | 12 Billion | 428 Million |
#Layers | 64 Layers | 16 Layers |
Computing Resource | 1024 x V100 16GB | 1 x V100 32GB |
Text Encoder | 16384 Vocab x 512 Dim BPE | 32000 Vocab x 1024 Dim klue/roberta-large |
Image Encoder | VQVAE | VQGAN |
Optimizer | AdamW | AdamW |
Learning Rate | 4.5e-5 | 3.0e-5 |
Weight Decay | 4.5e-3 | 3.0e-3 |
LR Scheduler | ReduceLROnPlateau | - |
The team constructed Text to Fashion Design DALLE model in Korean language with less than 100k text-image sampled pairs.
Caption | μμ°ν°λ μμμ΄ μΉ΄ν€ μμ¬κ° μ°λΈ νμ΄ λ£¨μ¦μΈ μ½νΈμ΄λ€. νμλ μμμ΄ λ€μ΄λΉ μμ¬κ° λ°λ νμ΄ μ€ν€λμΈ μ²λ°μ§μ΄λ€. |
Generated Image |
Experimentations were conducted with the following Korean Transformers Modelsβ embedding layers. The team selected klue/roberta-large as baseline in the repository considering the size of the model.
- klue/roberta-large: Vocab Size of 32000, Embedding Dimension of 1024.
- KoGPT Trinity of SKT: Vocab Size of 51200, Embedding Dimension of 1920.
- KoGPT of Kakao Brain: Vocab Size of 64512, Embedding Dimension of 4096.
KoDALLE with klue/roberta-large's wpe and wte were trained on 32GB V100 GPU environment. Hyperparams related to the DALLE's model size are following.
'BATCH_SIZE': 40
'DEPTH': 16
'TEXT_SEQ_LEN': 128
'VOCAB_SIZE': 32000
'MODEL_DIM': 1024
'ATTN_TYPES': 'full'
'DIM_HEAD': 64
'HEADS': 8
- DALLE model is composed on lucidrain's DALLE-pytorch
- Image encoder is constructed based on VQGAN(Taming Transformers)
- Offers promising result for training from scratch on specific domains with small size dataset.
- Introduces solution for domain specific DALLE & CLIP models to be robust on input sentence.
- Recommends adequate text-to-image model size for given computation resource.
- Suggests effortless method of creating DALLE & CLIP model for own languages if pretrained language model is available.
- Add image-caption reranker(EfficientNet + Klue/roberta-large)
- Model trained with 500k text-image pairs.
- Modulize in python code.
- Update Inference code.
- Update FID and IS metrics on test and validation dataset.
@misc{ramesh2021zeroshot,
title = {Zero-Shot Text-to-Image Generation},
author = {Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever},
year = {2021},
eprint = {2102.12092},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
@misc{esser2021taming,
title = {Taming Transformers for High-Resolution Image Synthesis},
author = {Patrick Esser and Robin Rombach and BjΓΆrn Ommer},
year = {2021},
eprint = {2012.09841},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}