Hey fellow CV Course Contributors and Reviewers 🤗
This issue discusses an initial draft for the chapter Fusion of Text and Vision which is part of Unit 4: Multimodal Models. We feel that since this is an introductory section, we will have less content related to code and more stress will be on content and setting up the stage for later sections in the unit. We would like this unit to be short and crisp, at most 3 sections, nothing more than that unless some other additions are required like spaces/demos.
Thought Process: Previous unit is Unit 3 on Vision Transformers, Next unit is Unit 5 on Generative Models. So, content in this unit will use Unit 3's transformer models (and not traditional approaches to the tasks, so we will refrain from adding too much historical aspects) and also will form a precursor for later sections as well as Unit 5 Generative Models.
1. Introduction
- Why Multimodality?
- Real-world data is multimodal (it is always a combination of different modalities)
- Short example of the human sensory feedback system (humans make decisions based on different sensory inputs and feedback)
- Multimodal in what sense? Data? Models? Fusion Technique? Are spectrograms an example of multimodal data representation? (input is multimodal, output is multimodal, both input and output are of different modalities, this part is foundation for multimodal tasks and models)
- Why data is multimodal in many real-life scenarios, how real-life content is multimodal data and is essential for search (example from Google and Bing)
- Some cool applications and examples of multimodality (Robotics: Vision Language Action models like RT2, RTX, Palm-E)
2. Multimodal Tasks and Models
A brief overview of different tasks and models (more emphasis on those tasks, which will be taken up in the course in different sections like #29 and #28
Mention briefly about the tasks and models (task, input and output, models with links or spaces). Can include other examples like text to speech, and speech to text in tasks and add a one-liner on it referring to the HF-Audio Course "For more information on this refer to HF-Audio Course". After this focus on Vision + Text/Audio.
Tasks and Models (each task, it's input output, and around 3-4 model names to go with it):
- Document Visual Question Answering (text + vision), Models: LayoutLM, Nougat, Donut.
- Image to Text, Visual Question Answering Models: Deplot, Pix2Struct, VILT, TrOCR, BLIP
- Text to Image (synthesis and generation) SD, IF etc
- Image and Video Captioning
- Text to Video Models: CLIP-VTT etc
We can also create an infographic that divides the models into different categories like text + vision, text + vision + audio, more than 3 modalities etc, like a chart or hierarchy.
Mention everything related to tasks on vision + X (audio, text) and focus on Vision Language Models (text + vision) in the next section.
3. Vision Language Models
- Introduction to Vision Language Models (brief, mechanism)
- Cool Applications and examples (Multimodal Chatbots like GILL, LLava, Video ChatGPT, some cool application being developed in #29 )
- Emphasize on tasks that involve CLIP and relatives #29
- A brief ending of the introduction section which sets the stage for next sections like CLIP and relatives and fine-tuning.
References:
- Awesome Self-Supervised Multimodal Learning
- HF Tasks
- Multi Modal Machine Learning Course, CMU
- Meta's ImageBind
- Multimodal Machine Learning: A Survey and Taxonomy
- Recent blog by Chip Huyen
Please feel free to share your views on the outline 🤗 🚀 🔥