We use BLIP2 as the multimodal pre-training method. BLIP2 is one of the SOTA models in multimodal pre-training method, and outperforms most of the existing methods in Visual Question Answering, Image Captioning and Image-Text Retrieval. For LLM, we will use Llama 2.
The overall solution illustration and architecture is demonstrated below
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.