MAE: Masked Autoencoders Are Scalable Vision Learners

November 2021

tl;dr: Scalable unsupervised pretraining of vision model by masked image modeling.

Overall impression

This paper is very enlightening.

This paper rushed the publication of other contemporary work such as SimMIM and iBOT. The clarity of the message, the depth of insight, the craft of engineering consideration, the coverage of ablation study of MAE is significantly superior to the others.

Key ideas

Masking a high proportions of the input image yields a nontrivial and meaningful self-supervisory task.
Language and vision have very different information density.
- Languages are human generated signals, highly semantic and information dense.
Asymmetric encoder and decoder
- Encoder only
- Saves significant computation for transformer-based backbone
Downstream tasks (object detection, instance and semantic segmentation) all surpassed supervised pretraining.

Technical details

Summary of technical details

Notes

Questions and notes on how to improve/revise the current work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mae.md

mae.md

MAE: Masked Autoencoders Are Scalable Vision Learners

Overall impression

Key ideas

Technical details

Notes

Files

mae.md

Latest commit

History

mae.md

File metadata and controls

MAE: Masked Autoencoders Are Scalable Vision Learners

Overall impression

Key ideas

Technical details

Notes