Code Monkey home page Code Monkey logo

lingo's Introduction

:shipit:Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

Parameter-Efficient Fine-Tuning (PEFT) has become the de facto approach to fine-tune PFMs while decreasing the computational costs. The current status of PEFT includes:

  1. Prefix Tuning methods, e.g., Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
  2. Prompt Tuning methods, e.g., The Power of Scale for Parameter-Efficient Prompt Tuning
  3. Low-rank adaptation method, e.g., LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS and AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Among these methods, we opt for adaptive rank sampling to deal with the data heterogeneous issue and LINGO: Language prefix fINe-tuning for GenOmes to leverage the in-context learning ability of LLMs. The framework is as follows:

The framework

The repository is organized as follows:

  1. dataset/: the directory of data sets. We applied our adaptive rank sampling for a comprehensive set of genome understanding tasks on various LLMs, i.e., promoter detection, epigenetic marks prediction in yeast, and in multiple human cell types. the link is here
  2. finetune/: fine-tuning LLMs and pre-trained DNA foundation models for single label task and multiple label tasks using DSP with BBPE tokenized embeddings and one-hot embeddings.
  3. peftnew/: Coupling RS with AdaLoRA method
  4. scripts/: SLURM batch script to run the .py files.
  5. demos/: Some minimal demos to run AdaLoRA + RS with DSP on OPT and 4-bit quantized Llama. See llama_dna_sequential_finetune_QLoRA.ipynb
  6. Besides, this link contains 2 fine-tuned checkpoints. See link. Replace "/path/to/your/local/model" with the actual file path to your saved model on your local system.
model_name_or_path: Optional[str] = field(default="/path/to/your/local/model")

Setting up environment

Typically, the setup process on a standard PC requires several tens of minutes to complete.

conda env create -f dna_llm.yml

For fine-tune

sbatch run_llm_lora.sh data_path

Models support matrix

Find models that are supported out of the box below.

Model LoRA AdaLoRA Adaptive rank sampling LINGO + one-hot LINGO + BBPE
1000G-500M
DNABERT-2
OPT
LLaMA

Figures

The Pareto front

The MCC Changes over time

Cite

@inproceedings{zhan2023parameter,
  title={Parameter-Efficient Fine-Tune on Open Pre-trained Transformers for Genomic Sequence},
  author={Zhan, Huixin and Zhang, Zijun Frank},
  booktitle={NeurIPS 2023 Generative AI and Biology (GenBio) Workshop},
  year={2023}
}

lingo's People

Contributors

98k-bot avatar zj-zhang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.