Code Monkey home page Code Monkey logo

llm-train's Introduction

AIDocks

The AI Trainer's Dry Dock.

Features

  • ๐Ÿš€ Fine-Tune Embeddings, ReRankerings & Large Language Models (LLMs),
  • ๐Ÿš€ Dataset templates,
  • ๐Ÿš€ Build-Your-Own Mixture-of-Experts (MoE),
  • ๐Ÿš€ Optimize LLMs with LASER-Random Matrix Theory,
  • ๐Ÿš€ Quantize models for optimal model size &
  • ๐Ÿš€ Publish models to ๐Ÿค— HuggingFace Hub.

Roadmap

(unsorted)

  • Auto Hardware Detection -> Model recommendation for fine-tuning and inference
  • Combined LLM & retrieval model fine-tuning with human feedback
  • The Truth Tables: Distributed (private & shared) Knowledge/Document Management in Chroma over sup- and sub-domain graph in Neo4j.
  • Model Conditioning: Chat-based LLM alignment for domain-(field) expertise with auto & human scoring on retrieval relevance, AI reasoning & conclusion.
    • Memory & History
    • Domain specific knowledge retrieval & expert prompting
    • Multiple Conversation
    • Multiple human & AI participants
    • General & Agent Specific Knowledge attachment by domain tags
    • Auto & Human eval for retrieval, reasoning & conclusion results
  • AI Task Library

Disclaimer In very early development stage. So feedback and contributions are highly appreciated!

Pre-Requisites

  1. CUDA-GPU
  2. Docker & docker-compose
  3. NVIDIA Container Toolkit

Quick Start

git clone https://github.com/l4b4r4b4b4/AIDocks
cd AIDocks
docker-compose up -d && \
docker-compose ps && \
docker-compose logs -f

Go to the interactive API documentation to explore all available endpoints & features!

Services

Docks WebApp

Docks API

Vision

Llava 1.6 service incl. Gradio Frontend, Controller & Model Worker

llm-inference

Endpoints ๐Ÿš€

The following endpoints are exposed:

  1. /train
  2. /compose
  3. /optimize
  4. /quantize
  5. /publish

/train Training & Fine-Tuning

The training routes expose different endpoints to fine-tune embeddings or reranking models used for retrieval and LLMs.

/train/llm LLM fine-tuning (DPO & SFT)

Try API endpoint Finetune Mistral, Llama 2-5x faster with 50% less memory with unsloth

Example datasets when using ChatML for

  1. SFT
  2. DPO

Supported Models

  • Llama,
  • Yi,
  • Mistral,
  • CodeLlama,
  • Qwen (llamafied),
  • Deepseek and their derived models (Open Hermes etc).

Features

  1. All kernels written in OpenAI's Triton language. Manual backprop engine
  2. 0% loss in accuracy - no approximation methods - all exact
  3. No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow
  4. Works on Linux and Windows via WSL
  5. Download 4 bit models 4x faster from ๐Ÿค— Huggingface! Eg: unsloth/mistral-7b-bnb-4bit
  6. Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes

/train/emb Embeddings

LoRA-PEFT for Embeddings using peft and accelerate library.

Supported Models

Example datasets

/train/rerank ReRankerings

LoRA-PEFT for re-ranking models.

Supported Models

Example datasets

/compose - BYO-MoE

Try API endpoint

/compose is an endpoint for combining Mistral or Llama models of the same size into Mixture-of-Experts models. The endpoint will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models.

/compose endpoint can be used with minimal or no GPU.

/compose endpoint uses its own JSON configuration syntax, which looks like so: request body

{
    "base_model": "cognitivecomputations/dolphin-2.6-mistral-7b-dpo",
    "gate_mode": "hidden",
    "dtype": "bfloat16",
    "experts":[
        {
            "source_model": "teknium/OpenHermes-2.5-Mistral-7B",
            "positive_prompts": [
                "instruction"
                "solutions"
                "chat"
                "questions"
                "comprehension"
            ]
        },   
        {
            "source_model": "openaccess-ai-collective/DPOpenHermes-7B",
            "positive_prompts": [
                "mathematics"
                "optimization"
                "code"
                "step-by-step"
                "science"
            ],
            "negative_prompts": [
                "chat"
                "questions"
            ]
        }
    ]
}

Options:

gate_mode: hidden, cheap_embed, or random

dtype: float32, float16, or bfloat16

Gate Modes

There are three methods for populating the MoE gates implemented.

"hidden"

Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model).

Coming Soon: use --load-in-8bit or --load-in-4bit to reduce VRAM usage.

"cheap_embed"

Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than "hidden". Can be run on much, much lower end hardware.

"random"

Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won't judge.

/optimize - LaserRMT

Try API endpoint request body

{
    "base_model_name" : "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "laser_model_name": "TinyLaser",
    "top_k_layers": 15
}

LaserRMT optimizes LLMs combining Layer-Selective Rank Reduction (LASER) and the Marchenko-Pastur law from Random Matrix Theory. This method targets model complexity reduction while maintaining or enhancing performance, making it more efficient than the traditional brute-force search method.

  1. LASER Framework Adaptation: LaserRMT adapts the LASER technique, which reduces the complexity of neural networks by selectively pruning the weights of a model's layers.
  2. Marchenko-Pastur Law Integration: The Marchenko-Pastur law, a concept from Random Matrix Theory used to determine the distribution of eigenvalues in large random matrices, guides the identification of redundant components in LLMs. This allows for effective complexity reduction without loss of key information.
  3. Enhanced Model Performance: By systematically identifying and eliminating less important components in the model's layers, LaserRMT can potentially enhance the model's performance and interpretability.
  4. Efficient Optimization Process: LaserRMT provides a more efficient and theoretically robust framework for optimizing large-scale language models, setting a new standard for language model refinement.

This approach opens new avenues for optimizing neural networks, underscoring the synergy between advanced mathematical theories and practical AI applications. LaserRMT sets a precedent for future developments in the field of LLM optimization.

/quantize/{method}

Try API endpoint

AWQ

Generate AWQ-quantizations optimized for GPU-inference.

/publish to HuggingFace ๐Ÿค—

Try API endpoint Publish generated local models to ๐Ÿค— HuggingFace Hub.

Explaining Resources

Some explaining resources for concepts, technologies and tools used in this repository.

  1. MergeKit Mixtral
  2. Mixture of Experts for Clowns (at a Circus)
  3. Fernando Fernandes Neto, David Golchinfar and Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.
  4. The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
  5. An Empirical view of Marchenko-Pastur Theorem

llm-train's People

Contributors

l4b4r4b4b4 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.