In the last few years, there have been several breakthroughs in the methodologies used in Natural Language Processing (NLP) as well as Computer Vision (CV). Beyond these improvements on single-modality models, large-scale multi-modal approaches have become a very active area of research.
In this seminar, we reviewed these approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other, as well as models in which one modality is utilized to enhance representation learning for the other. To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced. Finally, we also cover other modalities as well as general-purpose multi-modal models, which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art) eventually caps off this booklet.
This book is the result of a student seminar for Master Statistics and Master Data Science at the LMU in the summer semester 2022. Each student in the seminar wrote about a specific chapter of the book to pass the seminar.
Step 0: Prerequisites
Make sure you have git and R up and running on your computer.
Step 1: Clone the repository to your machine
With RStudio: https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN
With command-line:
git clone [email protected]/slds-lmu/seminar_multimodal_dl.git
Step 2: Install dependencies
Start R in the project folder:
install.packages("devtools")
devtools::install_dev_deps()
Step 3: Render the book (R commands)
# HTML
bookdown::render_book('./', 'bookdown::gitbook')
# PDF
bookdown::render_book('./', 'bookdown::pdf_book')