Code Monkey home page Code Monkey logo

instructblip_peft's Introduction

Parameter-Efficient Fine-tuning of InstructBLIP for Visual Reasoning Tasks

We inspect the effectiveness of PEFT methods on the Q-Former and LLM layer for Visual Reasoning Tasks.

Overview

Visual language models have recently demonstrated enhanced capabilities in visual reasoning tasks by employing external modules upon language models for visual language alignment. InstructBLIP uses a Q-Former and a projection layer to convert input image embeddings into soft visual prompts to enhance the instruction-following capabilities of large language models (LLMs). Although fine-tuning InstructBLIP has shown great results in downstream tasks, previous works have been restrictive, only full fine-tuning the Q-Former, while freezing the LLM.

In this work, we investigate the performance of the PEFT method, LoRA, on both the Q-Former and the base LLMs, specifically Flan-T5-XL and Vicuna-7B, using visual reasoning benchmarks ScienceQA and IconQA. We observe that, when the LLM is frozen, training the Q-Former with LoRA achieves comparable performance to full fine-tuning using under 2% of the trainable parameters. Furthermore, fine-tuning the LLM consistently result in better performances than InstructBLIP. Lastly, applying LoRA to both the LLM and the Q-Former surpasses the performance of only full fine-tuning the Q-Former while using less than 12% of the trainable parameters. These results highlight the effectiveness of applying PEFT to visual language models for visual reasoning tasks. The code is available at https://github.com/AttentionX/InstructBLIP_PEFT.

Contents (index)

Install

Install Code

  1. Clone this repository and navigate to InstructBLIP_PEFT folder.
git clone https://github.com/AttentionX/InstructBLIP_PEFT.git
cd InstructBLIP_PEFT
  1. Install Package
pip install -r requirements.txt

Install ScienceQA dataset

  1. download ScienceQA dataset from https://scienceqa.github.io/
  2. run scienceqa_data_preprocess.py

This will save preprocessed scienceQA dataset in /input/scienceqa/.

This is the Instruction Format for ScienceQA dataset.

<Image> Context: { {hint} {lecture} } Question: { {question} } Options: { {choices} } Answer: (a) { {answer} }

Install IconQA dataset

  1. download multi-text-choice dataset from https://iconqa.github.io/
  2. run iconqa_data_preprocess.py

This will save preprocessed scienceQA dataset in /input/iconqa/.

This is the Instruction Format for IconQA dataset.

<Image> Question: { {question} } Options: { {choices} }. Short answer: (a) { {answer} }

Train

We train our model using a single A100 GPU.

Dataset

Datasets must be placed in the location specified in the file lavis/config/datasets/{dataset_name}/defaults.yaml .

This is an example of dataset default.yaml file.

# lavis/config/datasets/scienceqa/default.yaml
datasets:
  scienceqa:
    # data_dir: ${env.data_dir}/datasets
    data_type: images # [images|videos|features]

    build_info:
      # Be careful not to append minus sign (-) before split to avoid itemizing
      annotations:
        train:
          storage: /input/scienceqa/scienceqa_train.json
        val:
          storage: /input/scienceqa/scienceqa_val.json
        test:
          storage: /input/scienceqa/scienceqa_test.json
      images:
        storage: /input
        train:
          storage: /input
        val:
          storage: /input
        test:
          storage: /input

In this case, dataset json files (scienceqa_train.json, scienceqa_test.json and scienceqa_val.json) should be located at /input/scienceqa.
Images files should be located at input/scienceqa/images/train, input/scienceqa/images/test and input/scienceqa/images/val because of the content in json files.

Experiment ID

This is the table for the ID for each experiements.

r = 1 r = 2 r = 4 r = 8
LLM LoRA (ffn, FlanT5-XL) 1 2 3 4
LLM LoRA (attn, FlanT5-XL) 5 6 7 8
LLM LoRA (all, FlanT5-XL) 9 10 11 12
Q-Former LoRA (ffn, FlanT5-XL) 13 14 15 16
Q-Former LoRA (self-attn, FlanT5-XL) 17 18 19 20
Q-Former LoRA (cross-attn, FlanT5-XL) 21 22 23 24
Q-Former LoRA (all, FlanT5-XL) 25 26 27 28
Q-Former and LLM LoRA (all, FlanT5-XL) 29 30 31 32
LLM LoRA (ffn, Vicuna-7B) 33 34 35 36
LLM LoRA (attn, Vicuna-7B) 37 38 39 40
LLM LoRA (all, Vicuna-7B) 41 42 43 44
Q-Former LoRA (ffn, Vicuna-7B) 45 46 47 48
Q-Former LoRA (self-attn, Vicuna-7B) 49 50 51 52
Q-Former LoRA (cross-attn, Vicuna-7B) 53 54 55 56
Q-Former LoRA (all, Vicuna-7B) 57 58 59 60
Q-Former and LLM LoRA (all, Vicuna-7B) 61 62 63 64

Run Script

You can run experiment with this command.

bash run_scripts/instructblip/train/run_finetune_instructblip_experiments.sh {dataset_name} {experiment_id}

The result will be saved in /input/results/{dataset_name}/{experiment_id}. You can change this in sh file run_finetune_instructblip_experiments.sh.

For example, If you want to try experiment 15 for scienceqa, you can use this command.

bash run_scripts/instructblip/train/run_finetune_instructblip_experiments.sh scienceqa 15

Citation

Acknowledgement

License

BSD 3-Clause License (from LAVIS)

Apache 2.0 License (From lit-llama)

instructblip_peft's People

Contributors

waitzkin avatar lingjiao10 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.