mlabonne / llm-course Goto Github PK

View Code? Open in Web Editor NEW

29.1K 29.1K 3.0K 6.75 MB

Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.

Home Page: https://mlabonne.github.io/blog/

License: Apache License 2.0

Jupyter Notebook 100.00%

course large-language-models llm machine-learning roadmap

llm-course's Introduction

🗣️ Large Language Model Course

🐦 Follow me on X • 🤗 Hugging Face • 💻 Blog • 📙 Hands-on GNN

The LLM course is divided into three parts:

🧩 LLM Fundamentals covers essential knowledge about mathematics, Python, and neural networks.
🧑‍🔬 The LLM Scientist focuses on building the best possible LLMs using the latest techniques.
👷 The LLM Engineer focuses on creating LLM-based applications and deploying them.

For an interactive version of this course, I created two LLM assistants that will answer questions and test your knowledge in a personalized way:

🤗 HuggingChat Assistant: Free version using Mixtral-8x7B.
🤖 ChatGPT Assistant: Requires a premium account.

📝 Notebooks

A list of notebooks and articles related to large language models.

Tools

Notebook	Description	Notebook
🧐 LLM AutoEval	Automatically evaluate your LLMs using RunPod
🥱 LazyMergekit	Easily merge models using MergeKit in one click.
🦎 LazyAxolotl	Fine-tune models in the cloud using Axolotl in one click.
⚡ AutoQuant	Quantize LLMs in GGUF, GPTQ, EXL2, AWQ, and HQQ formats in one click.
🌳 Model Family Tree	Visualize the family tree of merged models.
🚀 ZeroSpace	Automatically create a Gradio chat interface using a free ZeroGPU.

Fine-tuning

Notebook	Description	Article
Fine-tune Llama 2 with SFT	Step-by-step guide to supervised fine-tune Llama 2 in Google Colab.	Article
Fine-tune CodeLlama using Axolotl	End-to-end guide to the state-of-the-art tool for fine-tuning.	Article
Fine-tune Mistral-7b with SFT	Supervised fine-tune Mistral-7b in a free-tier Google Colab with TRL.	Article
Fine-tune Mistral-7b with DPO	Boost the performance of supervised fine-tuned models with DPO.	Article
Fine-tune Llama 3 with ORPO	Cheaper and faster fine-tuning in a single stage with ORPO.	Article

Quantization

Notebook	Description	Article
1. Introduction to Quantization	Large language model optimization using 8-bit quantization.	Article
2. 4-bit Quantization using GPTQ	Quantize your own open-source LLMs to run them on consumer hardware.	Article
3. Quantization with GGUF and llama.cpp	Quantize Llama 2 models with llama.cpp and upload GGUF versions to the HF Hub.	Article
4. ExLlamaV2: The Fastest Library to Run LLMs	Quantize and run EXL2 models and upload them to the HF Hub.	Article

Other

Notebook	Description	Article
Decoding Strategies in Large Language Models	A guide to text generation from beam search to nucleus sampling	Article
Improve ChatGPT with Knowledge Graphs	Augment ChatGPT's answers with knowledge graphs.	Article
Merge LLMs with MergeKit	Create your own models easily, no GPU required!	Article
Create MoEs with MergeKit	Combine multiple experts into a single frankenMoE	Article

🧩 LLM Fundamentals

This section introduces essential knowledge about mathematics, Python, and neural networks. You might not want to start here, but refer to it as needed.

Toggle section

1. Mathematics for Machine Learning

Before mastering machine learning, it is important to understand the fundamental mathematical concepts that power these algorithms.

Linear Algebra: This is crucial for understanding many algorithms, especially those used in deep learning. Key concepts include vectors, matrices, determinants, eigenvalues and eigenvectors, vector spaces, and linear transformations.
Calculus: Many machine learning algorithms involve the optimization of continuous functions, which requires an understanding of derivatives, integrals, limits, and series. Multivariable calculus and the concept of gradients are also important.
Probability and Statistics: These are crucial for understanding how models learn from data and make predictions. Key concepts include probability theory, random variables, probability distributions, expectations, variance, covariance, correlation, hypothesis testing, confidence intervals, maximum likelihood estimation, and Bayesian inference.

📚 Resources:

3Blue1Brown - The Essence of Linear Algebra: Series of videos that give a geometric intuition to these concepts.
StatQuest with Josh Starmer - Statistics Fundamentals: Offers simple and clear explanations for many statistical concepts.
AP Statistics Intuition by Ms Aerin: List of Medium articles that provide the intuition behind every probability distribution.
Immersive Linear Algebra: Another visual interpretation of linear algebra.
Khan Academy - Linear Algebra: Great for beginners as it explains the concepts in a very intuitive way.
Khan Academy - Calculus: An interactive course that covers all the basics of calculus.
Khan Academy - Probability and Statistics: Delivers the material in an easy-to-understand format.

2. Python for Machine Learning

Python is a powerful and flexible programming language that's particularly good for machine learning, thanks to its readability, consistency, and robust ecosystem of data science libraries.

Python Basics: Python programming requires a good understanding of the basic syntax, data types, error handling, and object-oriented programming.
Data Science Libraries: It includes familiarity with NumPy for numerical operations, Pandas for data manipulation and analysis, Matplotlib and Seaborn for data visualization.
Data Preprocessing: This involves feature scaling and normalization, handling missing data, outlier detection, categorical data encoding, and splitting data into training, validation, and test sets.
Machine Learning Libraries: Proficiency with Scikit-learn, a library providing a wide selection of supervised and unsupervised learning algorithms, is vital. Understanding how to implement algorithms like linear regression, logistic regression, decision trees, random forests, k-nearest neighbors (K-NN), and K-means clustering is important. Dimensionality reduction techniques like PCA and t-SNE are also helpful for visualizing high-dimensional data.

📚 Resources:

Real Python: A comprehensive resource with articles and tutorials for both beginner and advanced Python concepts.
freeCodeCamp - Learn Python: Long video that provides a full introduction into all of the core concepts in Python.
Python Data Science Handbook: Free digital book that is a great resource for learning pandas, NumPy, Matplotlib, and Seaborn.
freeCodeCamp - Machine Learning for Everybody: Practical introduction to different machine learning algorithms for beginners.
Udacity - Intro to Machine Learning: Free course that covers PCA and several other machine learning concepts.

3. Neural Networks

Neural networks are a fundamental part of many machine learning models, particularly in the realm of deep learning. To utilize them effectively, a comprehensive understanding of their design and mechanics is essential.

Fundamentals: This includes understanding the structure of a neural network such as layers, weights, biases, and activation functions (sigmoid, tanh, ReLU, etc.)
Training and Optimization: Familiarize yourself with backpropagation and different types of loss functions, like Mean Squared Error (MSE) and Cross-Entropy. Understand various optimization algorithms like Gradient Descent, Stochastic Gradient Descent, RMSprop, and Adam.
Overfitting: Understand the concept of overfitting (where a model performs well on training data but poorly on unseen data) and learn various regularization techniques (dropout, L1/L2 regularization, early stopping, data augmentation) to prevent it.
Implement a Multilayer Perceptron (MLP): Build an MLP, also known as a fully connected network, using PyTorch.

📚 Resources:

3Blue1Brown - But what is a Neural Network?: This video gives an intuitive explanation of neural networks and their inner workings.
freeCodeCamp - Deep Learning Crash Course: This video efficiently introduces all the most important concepts in deep learning.
Fast.ai - Practical Deep Learning: Free course designed for people with coding experience who want to learn about deep learning.
Patrick Loeber - PyTorch Tutorials: Series of videos for complete beginners to learn about PyTorch.

4. Natural Language Processing (NLP)

NLP is a fascinating branch of artificial intelligence that bridges the gap between human language and machine understanding. From simple text processing to understanding linguistic nuances, NLP plays a crucial role in many applications like translation, sentiment analysis, chatbots, and much more.

Text Preprocessing: Learn various text preprocessing steps like tokenization (splitting text into words or sentences), stemming (reducing words to their root form), lemmatization (similar to stemming but considers the context), stop word removal, etc.
Feature Extraction Techniques: Become familiar with techniques to convert text data into a format that can be understood by machine learning algorithms. Key methods include Bag-of-words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and n-grams.
Word Embeddings: Word embeddings are a type of word representation that allows words with similar meanings to have similar representations. Key methods include Word2Vec, GloVe, and FastText.
Recurrent Neural Networks (RNNs): Understand the working of RNNs, a type of neural network designed to work with sequence data. Explore LSTMs and GRUs, two RNN variants that are capable of learning long-term dependencies.

📚 Resources:

RealPython - NLP with spaCy in Python: Exhaustive guide about the spaCy library for NLP tasks in Python.
Kaggle - NLP Guide: A few notebooks and resources for a hands-on explanation of NLP in Python.
Jay Alammar - The Illustration Word2Vec: A good reference to understand the famous Word2Vec architecture.
Jake Tae - PyTorch RNN from Scratch: Practical and simple implementation of RNN, LSTM, and GRU models in PyTorch.
colah's blog - Understanding LSTM Networks: A more theoretical article about the LSTM network.

🧑‍🔬 The LLM Scientist

This section of the course focuses on learning how to build the best possible LLMs using the latest techniques.

1. The LLM architecture

While an in-depth knowledge about the Transformer architecture is not required, it is important to have a good understanding of its inputs (tokens) and outputs (logits). The vanilla attention mechanism is another crucial component to master, as improved versions of it are introduced later on.

High-level view: Revisit the encoder-decoder Transformer architecture, and more specifically the decoder-only GPT architecture, which is used in every modern LLM.
Tokenization: Understand how to convert raw text data into a format that the model can understand, which involves splitting the text into tokens (usually words or subwords).
Attention mechanisms: Grasp the theory behind attention mechanisms, including self-attention and scaled dot-product attention, which allows the model to focus on different parts of the input when producing an output.
Text generation: Learn about the different ways the model can generate output sequences. Common strategies include greedy decoding, beam search, top-k sampling, and nucleus sampling.

📚 References:

The Illustrated Transformer by Jay Alammar: A visual and intuitive explanation of the Transformer model.
The Illustrated GPT-2 by Jay Alammar: Even more important than the previous article, it is focused on the GPT architecture, which is very similar to Llama's.
Visual intro to Transformers by 3Blue1Brown: Simple easy to understand visual intro to Transformers
LLM Visualization by Brendan Bycroft: Incredible 3D visualization of what happens inside of an LLM.
nanoGPT by Andrej Karpathy: A 2h-long YouTube video to reimplement GPT from scratch (for programmers).
Attention? Attention! by Lilian Weng: Introduce the need for attention in a more formal way.
Decoding Strategies in LLMs: Provide code and a visual introduction to the different decoding strategies to generate text.

2. Building an instruction dataset

While it's easy to find raw data from Wikipedia and other websites, it's difficult to collect pairs of instructions and answers in the wild. Like in traditional machine learning, the quality of the dataset will directly influence the quality of the model, which is why it might be the most important component in the fine-tuning process.

Alpaca-like dataset: Generate synthetic data from scratch with the OpenAI API (GPT). You can specify seeds and system prompts to create a diverse dataset.
Advanced techniques: Learn how to improve existing datasets with Evol-Instruct, how to generate high-quality synthetic data like in the Orca and phi-1 papers.
Filtering data: Traditional techniques involving regex, removing near-duplicates, focusing on answers with a high number of tokens, etc.
Prompt templates: There's no true standard way of formatting instructions and answers, which is why it's important to know about the different chat templates, such as ChatML, Alpaca, etc.

📚 References:

Preparing a Dataset for Instruction tuning by Thomas Capelle: Exploration of the Alpaca and Alpaca-GPT4 datasets and how to format them.
Generating a Clinical Instruction Dataset by Solano Todeschini: Tutorial on how to create a synthetic instruction dataset using GPT-4.
GPT 3.5 for news classification by Kshitiz Sahay: Use GPT 3.5 to create an instruction dataset to fine-tune Llama 2 for news classification.
Dataset creation for fine-tuning LLM: Notebook that contains a few techniques to filter a dataset and upload the result.
Chat Template by Matthew Carrigan: Hugging Face's page about prompt templates

3. Pre-training models

Pre-training is a very long and costly process, which is why this is not the focus of this course. It's good to have some level of understanding of what happens during pre-training, but hands-on experience is not required.

Data pipeline: Pre-training requires huge datasets (e.g., Llama 2 was trained on 2 trillion tokens) that need to be filtered, tokenized, and collated with a pre-defined vocabulary.
Causal language modeling: Learn the difference between causal and masked language modeling, as well as the loss function used in this case. For efficient pre-training, learn more about Megatron-LM or gpt-neox.
Scaling laws: The scaling laws describe the expected model performance based on the model size, dataset size, and the amount of compute used for training.
High-Performance Computing: Out of scope here, but more knowledge about HPC is fundamental if you're planning to create your own LLM from scratch (hardware, distributed workload, etc.).

📚 References:

LLMDataHub by Junhao Zhao: Curated list of datasets for pre-training, fine-tuning, and RLHF.
Training a causal language model from scratch by Hugging Face: Pre-train a GPT-2 model from scratch using the transformers library.
TinyLlama by Zhang et al.: Check this project to get a good understanding of how a Llama model is trained from scratch.
Causal language modeling by Hugging Face: Explain the difference between causal and masked language modeling and how to quickly fine-tune a DistilGPT-2 model.
Chinchilla's wild implications by nostalgebraist: Discuss the scaling laws and explain what they mean to LLMs in general.
BLOOM by BigScience: Notion page that describes how the BLOOM model was built, with a lot of useful information about the engineering part and the problems that were encountered.
OPT-175 Logbook by Meta: Research logs showing what went wrong and what went right. Useful if you're planning to pre-train a very large language model (in this case, 175B parameters).
LLM 360: A framework for open-source LLMs with training and data preparation code, data, metrics, and models.

4. Supervised Fine-Tuning

Pre-trained models are only trained on a next-token prediction task, which is why they're not helpful assistants. SFT allows you to tweak them to respond to instructions. Moreover, it allows you to fine-tune your model on any data (private, not seen by GPT-4, etc.) and use it without having to pay for an API like OpenAI's.

Full fine-tuning: Full fine-tuning refers to training all the parameters in the model. It is not an efficient technique, but it produces slightly better results.
LoRA: A parameter-efficient technique (PEFT) based on low-rank adapters. Instead of training all the parameters, we only train these adapters.
QLoRA: Another PEFT based on LoRA, which also quantizes the weights of the model in 4 bits and introduce paged optimizers to manage memory spikes. Combine it with Unsloth to run it efficiently on a free Colab notebook.
Axolotl: A user-friendly and powerful fine-tuning tool that is used in a lot of state-of-the-art open-source models.
DeepSpeed: Efficient pre-training and fine-tuning of LLMs for multi-GPU and multi-node settings (implemented in Axolotl).

📚 References:

The Novice's LLM Training Guide by Alpin: Overview of the main concepts and parameters to consider when fine-tuning LLMs.
LoRA insights by Sebastian Raschka: Practical insights about LoRA and how to select the best parameters.
Fine-Tune Your Own Llama 2 Model: Hands-on tutorial on how to fine-tune a Llama 2 model using Hugging Face libraries.
Padding Large Language Models by Benjamin Marie: Best practices to pad training examples for causal LLMs
A Beginner's Guide to LLM Fine-Tuning: Tutorial on how to fine-tune a CodeLlama model using Axolotl.

5. Reinforcement Learning from Human Feedback

After supervised fine-tuning, RLHF is a step used to align the LLM's answers with human expectations. The idea is to learn preferences from human (or artificial) feedback, which can be used to reduce biases, censor models, or make them act in a more useful way. It is more complex than SFT and often seen as optional.

Preference datasets: These datasets typically contain several answers with some kind of ranking, which makes them more difficult to produce than instruction datasets.
Proximal Policy Optimization: This algorithm leverages a reward model that predicts whether a given text is highly ranked by humans. This prediction is then used to optimize the SFT model with a penalty based on KL divergence.
Direct Preference Optimization: DPO simplifies the process by reframing it as a classification problem. It uses a reference model instead of a reward model (no training needed) and only requires one hyperparameter, making it more stable and efficient.

📚 References:

An Introduction to Training LLMs using RLHF by Ayush Thakur: Explain why RLHF is desirable to reduce bias and increase performance in LLMs.
Illustration RLHF by Hugging Face: Introduction to RLHF with reward model training and fine-tuning with reinforcement learning.
StackLLaMA by Hugging Face: Tutorial to efficiently align a LLaMA model with RLHF using the transformers library.
LLM Training: RLHF and Its Alternatives by Sebastian Rashcka: Overview of the RLHF process and alternatives like RLAIF.
Fine-tune Mistral-7b with DPO: Tutorial to fine-tune a Mistral-7b model with DPO and reproduce NeuralHermes-2.5.

6. Evaluation

Evaluating LLMs is an undervalued part of the pipeline, which is time-consuming and moderately reliable. Your downstream task should dictate what you want to evaluate, but always remember Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."

Traditional metrics: Metrics like perplexity and BLEU score are not as popular as they were because they're flawed in most contexts. It is still important to understand them and when they can be applied.
General benchmarks: Based on the Language Model Evaluation Harness, the Open LLM Leaderboard is the main benchmark for general-purpose LLMs (like ChatGPT). There are other popular benchmarks like BigBench, MT-Bench, etc.
Task-specific benchmarks: Tasks like summarization, translation, and question answering have dedicated benchmarks, metrics, and even subdomains (medical, financial, etc.), such as PubMedQA for biomedical question answering.
Human evaluation: The most reliable evaluation is the acceptance rate by users or comparisons made by humans. Logging user feedback in addition to the chat traces (e.g., using LangSmith) helps to identify potential areas for improvement.

📚 References:

Perplexity of fixed-length models by Hugging Face: Overview of perplexity with code to implement it with the transformers library.
BLEU at your own risk by Rachael Tatman: Overview of the BLEU score and its many issues with examples.
A Survey on Evaluation of LLMs by Chang et al.: Comprehensive paper about what to evaluate, where to evaluate, and how to evaluate.
Chatbot Arena Leaderboard by lmsys: Elo rating of general-purpose LLMs, based on comparisons made by humans.

7. Quantization

Quantization is the process of converting the weights (and activations) of a model using a lower precision. For example, weights stored using 16 bits can be converted into a 4-bit representation. This technique has become increasingly important to reduce the computational and memory costs associated with LLMs.

Base techniques: Learn the different levels of precision (FP32, FP16, INT8, etc.) and how to perform naïve quantization with absmax and zero-point techniques.
GGUF and llama.cpp: Originally designed to run on CPUs, llama.cpp and the GGUF format have become the most popular tools to run LLMs on consumer-grade hardware.
GPTQ and EXL2: GPTQ and, more specifically, the EXL2 format offer an incredible speed but can only run on GPUs. Models also take a long time to be quantized.
AWQ: This new format is more accurate than GPTQ (lower perplexity) but uses a lot more VRAM and is not necessarily faster.

📚 References:

Introduction to quantization: Overview of quantization, absmax and zero-point quantization, and LLM.int8() with code.
Quantize Llama models with llama.cpp: Tutorial on how to quantize a Llama 2 model using llama.cpp and the GGUF format.
4-bit LLM Quantization with GPTQ: Tutorial on how to quantize an LLM using the GPTQ algorithm with AutoGPTQ.
ExLlamaV2: The Fastest Library to Run LLMs: Guide on how to quantize a Mistral model using the EXL2 format and run it with the ExLlamaV2 library.
Understanding Activation-Aware Weight Quantization by FriendliAI: Overview of the AWQ technique and its benefits.

8. New Trends

Positional embeddings: Learn how LLMs encode positions, especially relative positional encoding schemes like RoPE. Implement YaRN (multiplies the attention matrix by a temperature factor) or ALiBi (attention penalty based on token distance) to extend the context length.
Model merging: Merging trained models has become a popular way of creating performant models without any fine-tuning. The popular mergekit library implements the most popular merging methods, like SLERP, DARE, and TIES.
Mixture of Experts: Mixtral re-popularized the MoE architecture thanks to its excellent performance. In parallel, a type of frankenMoE emerged in the OSS community by merging models like Phixtral, which is a cheaper and performant option.
Multimodal models: These models (like CLIP, Stable Diffusion, or LLaVA) process multiple types of inputs (text, images, audio, etc.) with a unified embedding space, which unlocks powerful applications like text-to-image.

📚 References:

Extending the RoPE by EleutherAI: Article that summarizes the different position-encoding techniques.
Understanding YaRN by Rajat Chawla: Introduction to YaRN.
Merge LLMs with mergekit: Tutorial about model merging using mergekit.
Mixture of Experts Explained by Hugging Face: Exhaustive guide about MoEs and how they work.
Large Multimodal Models by Chip Huyen: Overview of multimodal systems and the recent history of this field.

👷 The LLM Engineer

This section of the course focuses on learning how to build LLM-powered applications that can be used in production, with a focus on augmenting models and deploying them.

1. Running LLMs

Running LLMs can be difficult due to high hardware requirements. Depending on your use case, you might want to simply consume a model through an API (like GPT-4) or run it locally. In any case, additional prompting and guidance techniques can improve and constrain the output for your applications.

LLM APIs: APIs are a convenient way to deploy LLMs. This space is divided between private LLMs (OpenAI, Google, Anthropic, Cohere, etc.) and open-source LLMs (OpenRouter, Hugging Face, Together AI, etc.).
Open-source LLMs: The Hugging Face Hub is a great place to find LLMs. You can directly run some of them in Hugging Face Spaces, or download and run them locally in apps like LM Studio or through the CLI with llama.cpp or Ollama.
Prompt engineering: Common techniques include zero-shot prompting, few-shot prompting, chain of thought, and ReAct. They work better with bigger models, but can be adapted to smaller ones.
Structuring outputs: Many tasks require a structured output, like a strict template or a JSON format. Libraries like LMQL, Outlines, Guidance, etc. can be used to guide the generation and respect a given structure.

📚 References:

Run an LLM locally with LM Studio by Nisha Arya: Short guide on how to use LM Studio.
Prompt engineering guide by DAIR.AI: Exhaustive list of prompt techniques with examples
Outlines - Quickstart: List of guided generation techniques enabled by Outlines.
LMQL - Overview: Introduction to the LMQL language.

2. Building a Vector Storage

Creating a vector storage is the first step to build a Retrieval Augmented Generation (RAG) pipeline. Documents are loaded, split, and relevant chunks are used to produce vector representations (embeddings) that are stored for future use during inference.

Ingesting documents: Document loaders are convenient wrappers that can handle many formats: PDF, JSON, HTML, Markdown, etc. They can also directly retrieve data from some databases and APIs (GitHub, Reddit, Google Drive, etc.).
Splitting documents: Text splitters break down documents into smaller, semantically meaningful chunks. Instead of splitting text after n characters, it's often better to split by header or recursively, with some additional metadata.
Embedding models: Embedding models convert text into vector representations. It allows for a deeper and more nuanced understanding of language, which is essential to perform semantic search.
Vector databases: Vector databases (like Chroma, Pinecone, Milvus, FAISS, Annoy, etc.) are designed to store embedding vectors. They enable efficient retrieval of data that is 'most similar' to a query based on vector similarity.

📚 References:

LangChain - Text splitters: List of different text splitters implemented in LangChain.
Sentence Transformers library: Popular library for embedding models.
MTEB Leaderboard: Leaderboard for embedding models.
The Top 5 Vector Databases by Moez Ali: A comparison of the best and most popular vector databases.

3. Retrieval Augmented Generation

With RAG, LLMs retrieves contextual documents from a database to improve the accuracy of their answers. RAG is a popular way of augmenting the model's knowledge without any fine-tuning.

Orchestrators: Orchestrators (like LangChain, LlamaIndex, FastRAG, etc.) are popular frameworks to connect your LLMs with tools, databases, memories, etc. and augment their abilities.
Retrievers: User instructions are not optimized for retrieval. Different techniques (e.g., multi-query retriever, HyDE, etc.) can be applied to rephrase/expand them and improve performance.
Memory: To remember previous instructions and answers, LLMs and chatbots like ChatGPT add this history to their context window. This buffer can be improved with summarization (e.g., using a smaller LLM), a vector store + RAG, etc.
Evaluation: We need to evaluate both the document retrieval (context precision and recall) and generation stages (faithfulness and answer relevancy). It can be simplified with tools Ragas and DeepEval.

📚 References:

Llamaindex - High-level concepts: Main concepts to know when building RAG pipelines.
Pinecone - Retrieval Augmentation: Overview of the retrieval augmentation process.
LangChain - Q&A with RAG: Step-by-step tutorial to build a typical RAG pipeline.
LangChain - Memory types: List of different types of memories with relevant usage.
RAG pipeline - Metrics: Overview of the main metrics used to evaluate RAG pipelines.

4. Advanced RAG

Real-life applications can require complex pipelines, including SQL or graph databases, as well as automatically selecting relevant tools and APIs. These advanced techniques can improve a baseline solution and provide additional features.

Query construction: Structured data stored in traditional databases requires a specific query language like SQL, Cypher, metadata, etc. We can directly translate the user instruction into a query to access the data with query construction.
Agents and tools: Agents augment LLMs by automatically selecting the most relevant tools to provide an answer. These tools can be as simple as using Google or Wikipedia, or more complex like a Python interpreter or Jira.
Post-processing: Final step that processes the inputs that are fed to the LLM. It enhances the relevance and diversity of documents retrieved with re-ranking, RAG-fusion, and classification.
Program LLMs: Frameworks like DSPy allow you to optimize prompts and weights based on automated evaluations in a programmatic way.

📚 References:

LangChain - Query Construction: Blog post about different types of query construction.
LangChain - SQL: Tutorial on how to interact with SQL databases with LLMs, involving Text-to-SQL and an optional SQL agent.
Pinecone - LLM agents: Introduction to agents and tools with different types.
LLM Powered Autonomous Agents by Lilian Weng: More theoretical article about LLM agents.
LangChain - OpenAI's RAG: Overview of the RAG strategies employed by OpenAI, including post-processing.
DSPy in 8 Steps: General-purpose guide to DSPy introducing modules, signatures, and optimizers.

5. Inference optimization

Text generation is a costly process that requires expensive hardware. In addition to quantization, various techniques have been proposed to maximize throughput and reduce inference costs.

Flash Attention: Optimization of the attention mechanism to transform its complexity from quadratic to linear, speeding up both training and inference.
Key-value cache: Understand the key-value cache and the improvements introduced in Multi-Query Attention (MQA) and Grouped-Query Attention (GQA).
Speculative decoding: Use a small model to produce drafts that are then reviewed by a larger model to speed up text generation.

📚 References:

GPU Inference by Hugging Face: Explain how to optimize inference on GPUs.
LLM Inference by Databricks: Best practices for how to optimize LLM inference in production.
Optimizing LLMs for Speed and Memory by Hugging Face: Explain three main techniques to optimize speed and memory, namely quantization, Flash Attention, and architectural innovations.
Assisted Generation by Hugging Face: HF's version of speculative decoding, it's an interesting blog post about how it works with code to implement it.

6. Deploying LLMs

Deploying LLMs at scale is an engineering feat that can require multiple clusters of GPUs. In other scenarios, demos and local apps can be achieved with a much lower complexity.

Local deployment: Privacy is an important advantage that open-source LLMs have over private ones. Local LLM servers (LM Studio, Ollama, oobabooga, kobold.cpp, etc.) capitalize on this advantage to power local apps.
Demo deployment: Frameworks like Gradio and Streamlit are helpful to prototype applications and share demos. You can also easily host them online, for example using Hugging Face Spaces.
Server deployment: Deploy LLMs at scale requires cloud (see also SkyPilot) or on-prem infrastructure and often leverage optimized text generation frameworks like TGI, vLLM, etc.
Edge deployment: In constrained environments, high-performance frameworks like MLC LLM and mnn-llm can deploy LLM in web browsers, Android, and iOS.

📚 References:

Streamlit - Build a basic LLM app: Tutorial to make a basic ChatGPT-like app using Streamlit.
HF LLM Inference Container: Deploy LLMs on Amazon SageMaker using Hugging Face's inference container.
Philschmid blog by Philipp Schmid: Collection of high-quality articles about LLM deployment using Amazon SageMaker.
Optimizing latence by Hamel Husain: Comparison of TGI, vLLM, CTranslate2, and mlc in terms of throughput and latency.

7. Securing LLMs

In addition to traditional security problems associated with software, LLMs have unique weaknesses due to the way they are trained and prompted.

Prompt hacking: Different techniques related to prompt engineering, including prompt injection (additional instruction to hijack the model's answer), data/prompt leaking (retrieve its original data/prompt), and jailbreaking (craft prompts to bypass safety features).
Backdoors: Attack vectors can target the training data itself, by poisoning the training data (e.g., with false information) or creating backdoors (secret triggers to change the model's behavior during inference).
Defensive measures: The best way to protect your LLM applications is to test them against these vulnerabilities (e.g., using red teaming and checks like garak) and observe them in production (with a framework like langfuse).

📚 References:

OWASP LLM Top 10 by HEGO Wiki: List of the 10 most critic vulnerabilities seen in LLM applications.
Prompt Injection Primer by Joseph Thacker: Short guide dedicated to prompt injection for engineers.
LLM Security by @llm_sec: Extensive list of resources related to LLM security.
Red teaming LLMs by Microsoft: Guide on how to perform red teaming with LLMs.

Acknowledgements

This roadmap was inspired by the excellent DevOps Roadmap from Milan Milanović and Romano Roth.

Special thanks to:

Thomas Thelen for motivating me to create a roadmap
André Frade for his input and review of the first draft
Dino Dunn for providing resources about LLM security
Magdalena Kuhn for improving the "human evaluation" part
Odoverdose for suggesting 3Blue1Brown's video about Transformers

Disclaimer: I am not affiliated with any sources listed here.

llm-course's People

Contributors

Stargazers

Watchers

Forkers

bestcourses-ai alinandrei74 elfxxx hellojio tendaishoko care4truth gkakavel nathalierocelle allthingsllm okeefe4 harshul-24 son1128 deveshdutt2710 jeffara jackylens wendi-code seshakiran aaditya29 mohit-choithwani coinhubx apollohuang1 drgonzalomora spankyed ngthanhtin aqdesk yolantele fayezalhussein deltavml neeland nikolausn blackhawkee marloncepeda nguyenvanson1998 auyez-khassenov techthiyanes kustomzone krzysiaczek99 afirez manu87ds stephanrempel zjc17 kwasganguly jrhumberto f901107 songkq gemelgb patrickcnkm divein2learning abdoiiii abhinav23484 cmokale balakreshnan gulsmyigit elpolini manojpatilmoutgrowdigital lian-ai ale9806 charlie-xiaoqi finalyearplacementbc muhammadsajid1997 chengkai-huang webwahab rogetxtap aganiezgoda d-t-n mohan-chinnappan-n ebridge-llm 0x11c11e grv805 schosmiel maxreiss123 revanks chenyh19 owami onejune2018 enesbol ixrst riiduan lrochetta ymg2007 cesarcalvocobo samee99 jaislp3 openmindx syq23719034 hurricanejin akashad98 laper01 ukaserge kumar045 hal22 anil2k nirvanesque sshuster look4pritam chenzongxiong asheraz6019 allenkaichen jerryyu ritesh1991

llm-course's Issues

Issue with pad_token == eos_token : model not "learning when to stop"

Hey @mlabonne thanks a lot for the great resources!

I have been reading the Fine_tune_Llama_2_in_Google_Colab.ipynb notebook and I am encountering an issue.

Just to play around I have tried adapting your notebook to fine-tune a model to perform PII masking using this dataset (to do it very quickly I adapted the format such that examples look like this: <s>[INST] Mise à jour : l'heure de début de la thérapie physique a été modifiée à 8:46 AM. Lieu : Suite 348 Iva Junctions. Veuillez nous excuser pour le désagrément. [/INST] Mise à jour : l'heure de début de la thérapie physique a été modifiée à [TIME_1]. Lieu : [SECONDARYADDRESS_1] [STREET_1]. Veuillez nous excuser pour le désagrément. </s>).

After fine-tuning the model I noticed that it was continuously generating text, effectively never producing the EOS_TOKEN and thus only stopping at the max sequence length.

By looking online it seems that this might be related to the default DataCollatorForLanguageModeling (which gets passed to the SFTTrainer class by default).During training with that collator I think that the PAD tokens are getting masked out and excluded from the loss computation, thus leading the model not to "learn when to stop", and I see that you have added the PAD token to be the same as the EOS token with the following lines:

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

Do you know if this might actually be the issue here / do you have an idea for a fix? I tried to comment out the line where you set the 2 tokens to be the same, but in that case my model trains for a while and then the loss suddendly drops to 0 so something must be wrong!

Error when trying to quantize GGUF.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-stubs 2.0.3.230814 requires numpy>=1.25.0; python_version >= "3.9", but you have numpy 1.24.4 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.25.3 which is incompatible.
torchaudio 2.2.1+cu121 requires torch==2.2.1, but you have torch 2.1.2 which is incompatible.
torchtext 0.17.1 requires torch==2.2.1, but you have torch 2.1.2 which is incompatible.
torchvision 0.17.1+cu121 requires torch==2.2.1, but you have torch 2.1.2 which is incompatible.
Successfully installed einops-0.7.0 gguf-0.6.0 numpy-1.24.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105 protobuf-4.25.3 torch-2.1.2 triton-2.1.0
WARNING: The following packages were previously imported in this runtime:
[numpy]
You must restart the runtime in order to use newly installed versions.

Collaboration: Unsloth + llm-course

Hey @mlabonne! Actually found this repo via Linkedin! :) Happy New Year!

Had a look through your notebooks - they look sick! Interestingly I was trying myself to run axolotl via Google Colab to no avail.

Anyways I'm the maintainer of Unsloth, which makes QLoRA 2.2x faster and use 62% less memory! It would be awesome if we could somehow collaborate :)

I have a few examples:

Mistral 7b + Alpaca: https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing
DPO Zephyr replication: https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing
TinyLlama automatic RoPE Scaling from 2048 to 4096 tokens + full Alpaca dataset in 80 minutes. https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing (still running since TinyLlama was just released!)

Anyways great work again!

Security

Pls help, stuck with AutoGGUF

I tried to make ggufs of different models (one that was already available and one which I made using the lazymergekit).

I always get the same error how ever. It's this one (I edited the model name out but it happens with both I tested. They are Mistral 7b based ones):

GML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
main: build = 2151 (704359e2)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'ModelName/modelname.fp16.bin' to 'ModelName/modelname.Q4_K_S.gguf' as Q4_K_S
llama_model_quantize: failed to quantize: failed to open ModelName/modelname.fp16.bin: No such file or directory
main: failed to quantize model from 'ModelName/modelname.fp16.bin'

Also, before that error, I get another error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.25.2 which is incompatible.
torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 2.1.2 which is incompatible.
torchdata 0.7.0 requires torch==2.1.0, but you have torch 2.1.2 which is incompatible.
torchtext 0.16.0 requires torch==2.1.0, but you have torch 2.1.2 which is incompatible.
torchvision 0.16.0+cu121 requires torch==2.1.0, but you have torch 2.1.2 which is incompatible.
Successfully installed gguf-0.6.0 numpy-1.24.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.101 nvidia-nvtx-cu12-12.1.105 protobuf-4.25.2 torch-2.1.2

WARNING: The following packages were previously imported in this runtime:
  [numpy]
You must restart the runtime in order to use newly installed versions.

Is there any solution? I would like to try the model I merged locally, I was even able to evaluate it in the leaderboard but I can't turn it into GGUF.
Also is there a dedicated GitHub page for that notebook?

Mikor es hogyan

Egyes itt jelenlevők tudnak segiteni? Uj vagyok

moe version update? and llama pro?

hi please tell me the approach to solve this problem

You have to solve a multi-label classification problem statement.
It contains two files: train.csv and test.csv.
The dataset contains the following columns:
- LossDescription: Description of Event
- ResultingInjuryDesc: Injury Description
- PartInjuredDesc: Body Part Injured Description
- Cause - Hierarchy 1: Cause Hierarchy 1
- Body Part - Hierarchy 1: Body Part Hierarchy 1
- Index: Identifier
Tasks:
- Perform exploratory data analysis (EDA) on the dataset.
- Train multi-label classification models to predict "Cause - Hierarchy 1" and "Body Part - Hierarchy 1" when other columns are given.
  Two models will be required to predict each target variable.

Dependency Map and minimum path for each category

This repo is stunning! Kudos for the creators and maintainers, foremost!

I want to contribute with a suggestion.

For each "path" make a visual guidance categorizing with colors or other approach the minimal path.

Also, I want to know if in this course as it is organized it's possible to start with LLM Engineer path without LLM Fundamentals and LLM Fundamentals Path as many audiences is developers without math and data sciences skills just want to create applications with API, Vectors DBs all boundaries tools and technics to use LLMs models but not get deep on them or how LLMs models works under the root.

Best roadmap and must follow Repo for this decade for everyone that's needs to acquire knowledge in this field. Or learning at most how to use AI and LLMs or be unemployed in near future. Sad but true.

Add resources about training and finetuning for MOE models

Issue after finetuning

Hi I have finetuned my custom dataset but finding diificulties to load the model during inference. can u help me regarding that

Llm

New colab : Fine-tune LLMs with Axolotl End-to-end guide to the state-of-the-art tool for fine-tuning

Hi, I've uploaded colab that follows your article.

The Merge operation is missing, because I didn't know if you were interested.

Downloads.zip

Cannot quantize after fine tuning on colab

Getting this error after on quantizing after fine tuning with the instructions for colab.

FileNotFoundError: Could not find tokenizer.model in llama-2-7b-meditext or its parent; if it's in another directory, pass the directory as --vocab-dir
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5
main: build = 1267 (bc9d3e3)
main: built with cc (Ubuntu 11.4.0-1ubuntu122.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'llama-2-7b-meditext/llama-2-7b-meditext.gguf.fp16.bin' to 'llama-2-7b-meditext/llama-2-7b-meditext.gguf.q4_k_m.bin' as Q4_K_M
llama_model_quantize: failed to quantize: failed to open llama-2-7b-meditext/llama-2-7b-meditext.gguf.fp16.bin: No such file or directory
main: failed to quantize model from 'llama-2-7b-meditext/llama-2-7b-meditext.gguf.fp16.bin'
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5
main: build = 1267 (bc9d3e3)
main: built with cc (Ubuntu 11.4.0-1ubuntu122.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'llama-2-7b-meditext/llama-2-7b-meditext.gguf.fp16.bin' to 'llama-2-7b-meditext/llama-2-7b-meditext.gguf.q5_k_m.bin' as Q5_K_M
llama_model_quantize: failed to quantize: failed to open llama-2-7b-meditext/llama-2-7b-meditext.gguf.fp16.bin: No such file or directory
main: failed to quantize model from 'llama-2-7b-meditext/llama-2-7b-meditext.gguf.fp16.bin'

DPO with Axolotl

It is possible to perform DPO with Axolotl. If I were to create a notebook for DPO fine-tuning, do you think it would be suitable for your repository?

cuda out of memory

VRAM doesn't clear even when I run cell [7].

INST problems in mistral 7b DPO script

See discussion here: https://huggingface.co/CultriX/NeuralTrix-7B-dpo/discussions/1

Turkish Version..

Hi! Is it okay for you, if we try to do similar one for Turkish users e.g referencing this repo and using the sources as well?
Best Regards,
Zaur Samedov!

`ref_model` not needed in `Fine_tune_a_Mistral_7b_model_with_DPO.ipynb`

Hi here @mlabonne! Congratulations on your awesome work with this course 🤝🏻

After going through Fine_tune_a_Mistral_7b_model_with_DPO.ipynb I realised that there's no need to define the ref_model required by DPO, since when fine-tuning using LoRA, the reference model is not required, as the one without the adapters will be used to compute the logprobs, so you can remove the ref_model and the result will still be the same, but using even less resources.

Finally, as a tip, when using the DPOTrainer for full fine-tunes you can also specify precompute_ref_log_probs to compute those in advance before the actual fine-tune starts, so that the ref_model is not needed either.

中文版整理到这里了

将相关论文和博客的网址整理到一起，按照不同类别做了分组。

大语言模型学习路线图和关键技术

not able to quantize after fine tuning

i am not able to quantize and getting this error
FileNotFoundError: Could not find tokenizer.model in llama-2-7b-meditext or its parent; if it's in another directory, pass the directory as --vocab-dir
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5
what do i do?

error in fine tune LLM using axolotl

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/content/axolotl/src/axolotl/cli/train.py", line 59, in
fire.Fire(do_cli)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(varargs, **kwargs)
File "/content/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
File "/content/axolotl/src/axolotl/cli/train.py", line 55, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
File "/content/axolotl/src/axolotl/train.py", line 104, in train
trainer = setup_trainer(
File "/content/axolotl/src/axolotl/utils/trainer.py", line 338, in setup_trainer
return trainer_builder.build(total_num_steps)
File "/content/axolotl/src/axolotl/core/trainer_builder.py", line 1245, in build
trainer = trainer_cls(
File "/content/axolotl/src/axolotl/core/trainer_builder.py", line 223, in init
super().init(_args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 539, in init
self.callback_handler = CallbackHandler(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer_callback.py", line 313, in init
self.add_callback(cb)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer_callback.py", line 330, in add_callback
cb = callback() if isinstance(callback, type) else callback
File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/integration_utils.py", line 954, in init
raise RuntimeError("MLflowCallback requires mlflow to be installed. Run pip install mlflow.")
RuntimeError: MLflowCallback requires mlflow to be installed. Run pip install mlflow.
Exception ignored in: <function MLflowCallback.del at 0x7d8a76cbf400>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/integration_utils.py", line 1105, in del
self._auto_end_run
AttributeError: 'MLflowCallback' object has no attribute '_auto_end_run'
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1057, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 673, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'axolotl.cli.train', 'config.yaml']' returned non-zero exit status 1.`

plz delete this issue, sorry to bother

any reason why the finetuning llama notebook is running only on colab?

i tried running the same notebook on gcp A100 machine, and it failed on :

`File ~/.local/lib/python3.9/site-packages/transformers/utils/bitsandbytes.py:109, in set_module_quantized_tensor_to_device(module, tensor_name, device, value, fp16_statistics)
107 new_value = old_value.to(device)
108 elif isinstance(value, torch.Tensor):
--> 109 new_value = value.to(device)
110 else:
111 new_value = torch.tensor(value, device=device)

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.`

on colab it work perfectly.
any idea ?

LLM Course

All fine-tuned models should be available for inference with HF TGI

model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

All fine-tuned models should be available for inference with HF TGI
However showed NotSupportedError: Model fine-tuned mode is not available for inference with this client.
Is there any way to cope with?

RuntimeError: Expected to mark a variable ready only once... error while finetuning Llama 2

I am following along with the "Fine-tune Llama 2 in Google Colab" example notebook in Databricks, but I am receiving this error when I attempt to fine tune the model:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 127 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

And here is the final block of the traceback:

File /databricks/python/lib/python3.10/site-packages/torch/autograd/__init__.py:200, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    195     retain_graph = create_graph
    197 # The reason we repeat same the comment below is that
    198 # some Python versions print out the first line of a multi-line function
    199 # calls in the traceback and some print out the last line
--> 200 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    201     tensors, grad_tensors_, retain_graph, create_graph, inputs,
    202     allow_unreachable=True, accumulate_grad=True)

I have tried turning off gradient checkpointing but I received the same error. I am using a g4dn.4xl cluster. Is the problem due to my verion of torch? or cuda? I'm not sure how to set the environment variable, but from what I've seen online it's not very helpful when dealing with these higher level libraries (peft, transformers). Some solutions mention fiddling with find_unused_parameters and _set_static_graph(), but I believe that is on the pytorch level of things, and not a changeable parameter in the code as it stands.

Please specify a license

Hi, great articles, great Colabs, thanks!

My request: please specify a license for the repository so I would know if there are any limitations on the use of this code.

Cheers!

LazyMergeKit ERROR

mergekit-moe: command not found

mergekit-moe config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle --trust-remote-code
/bin/bash: line 1: mergekit-moe: command not found

Great work. Can I translate in Tamil.

Its Great work, and I am requesting your permission to translate in Tamil.

consider adding content to learn about agents

I think agents are also becoming a critical components of what an LLM engineer has to implement, consider adding some contents to this excellent resource covering that.

lazyaxolotl runpod not running

... because seemingly template not found anymore.
you can use image_name="winglian/axolotl-runpod:main-latest",
without #template_id="eul6o46pab",
but then get in the container: ... ServerApp] Bad config encountered during initialization: /workspace is outside root contents directory
currently have no time to look into this further

Could save the fine-tune model as saffetensors?

Hi!
Could save the fine-tune model as saffetensors?
Thanks

Error in mergeKit

File "/usr/local/bin/mergekit-moe", line 8, in
sys.exit(main())

File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in_call_
return self.main(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)

File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)

File r/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return callback(*args, **kwargs)

File "/content/mergekit/mergekit/options.py", line 76, in wrapper
f(*args, **kwargs)

File "/content/mergekit/mergekit/scripts/mixtral_moe.py", line 452, in main
config = MistralMOEConfig.model_validate(yaml.safe_load(config_source))

File
usr/local/lib/python3.10/dist-packages/pydantic/main.py", line 503, in model_validate
return cls. pydantic_validator.validate_python(

pydantic_core._pydantic_core.ValidationError: 1 validation error for MistralMOEConfig
experts

Field required [type=missing, input_value={'slices': [{'sources': [...}]}, 'dtype': 'float16'}, input_type=dict] information visit https://errors.pydantic.dev/2.5/v/missing

Kernel is dying on Fine-tune Llama 2

Libraries & Versions:
Package Version

absl-py 1.4.0
accelerate 0.21.0
aiohttp 3.8.5
aiosignal 1.3.1
anyio 3.7.1
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 2.2.1
astunparse 1.6.3
async-lru 2.0.4
async-timeout 4.0.2
attrs 23.1.0
Babel 2.12.1
backcall 0.2.0
beautifulsoup4 4.12.2
bitsandbytes 0.40.2
bleach 6.0.0
cachetools 5.3.1
certifi 2023.7.22
cffi 1.15.1
charset-normalizer 3.2.0
cmake 3.27.1
comm 0.1.4
datasets 2.14.3
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.7
exceptiongroup 1.1.2
executing 1.2.0
fastjsonschema 2.18.0
filelock 3.12.2
flatbuffers 23.5.26
frozenlist 1.4.0
fsspec 2023.6.0
gast 0.4.0
google-auth 2.22.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
grpcio 1.56.2
h5py 3.9.0
huggingface-hub 0.16.4
idna 3.4
importlib-metadata 6.8.0
importlib-resources 6.0.1
ipykernel 6.25.0
ipython 8.12.2
ipython-genutils 0.2.0
ipywidgets 8.1.0
jedi 0.19.0
Jinja2 3.1.2
json5 0.9.14
jsonschema 4.18.6
jsonschema-specifications 2023.7.1
jupyter 1.0.0
jupyter-client 8.3.0
jupyter-console 6.6.3
jupyter-core 5.3.1
jupyter-events 0.7.0
jupyter-lsp 2.2.0
jupyter-server 2.7.0
jupyter-server-terminals 0.4.4
jupyterlab 4.0.4
jupyterlab-pygments 0.2.2
jupyterlab-server 2.24.0
jupyterlab-widgets 3.0.8
keras 2.10.0
Keras-Preprocessing 1.1.2
libclang 16.0.6
lit 16.0.6
Markdown 3.4.4
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mistune 3.0.1
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.15
nbclient 0.8.0
nbconvert 7.7.3
nbformat 5.9.2
nest-asyncio 1.5.7
networkx 3.1
notebook 7.0.2
notebook-shim 0.2.3
numpy 1.24.3
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
oauthlib 3.2.2
opt-einsum 3.3.0
overrides 7.4.0
packaging 23.1
pandas 2.0.3
pandocfilters 1.5.0
parso 0.8.3
peft 0.4.0
pexpect 4.8.0
pickleshare 0.7.5
pip 20.0.2
pip-autoremove 0.10.0
pkg-resources 0.0.0
pkgutil-resolve-name 1.3.10
platformdirs 3.10.0
prometheus-client 0.17.1
prompt-toolkit 3.0.39
protobuf 3.19.6
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 12.0.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycparser 2.21
Pygments 2.16.1
pyspark 3.4.1
python-dateutil 2.8.2
python-json-logger 2.0.7
python-version 0.0.2
pytz 2023.3
PyYAML 6.0.1
pyzmq 25.1.0
qtconsole 5.4.3
QtPy 2.3.1
referencing 0.30.2
regex 2023.6.3
requests 2.31.0
requests-oauthlib 1.3.1
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rpds-py 0.9.2
rsa 4.9
safetensors 0.3.1
scipy 1.10.1
Send2Trash 1.8.2
setuptools 44.0.0
six 1.16.0
sniffio 1.3.0
soupsieve 2.4.1
stack-data 0.6.2
sympy 1.12
tensorboard 2.10.1
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.6.2
tensorflow-estimator 2.10.0
tensorflow-io-gcs-filesystem 0.33.0
termcolor 2.3.0
terminado 0.17.1
tinycss2 1.2.1
tokenizers 0.13.3
tomli 2.0.1
torch 2.0.1
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.31.0
triton 2.0.0
trl 0.4.7
typing-extensions 4.5.0
tzdata 2023.3
urllib3 1.26.16
wcwidth 0.2.6
webencodings 0.5.1
websocket-client 1.6.1
Werkzeug 2.3.6
wheel 0.34.2
widgetsnbextension 4.0.8
wrapt 1.15.0
xxhash 3.3.0
yarl 1.9.2
zipp 3.16.2

Script:
`import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
pipeline,
logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

The model that you want to train from the Hugging Face hub

model_name = "NousResearch/Llama-2-7b-chat-hf"

The instruction dataset to use

dataset_name = "mlabonne/guanaco-llama2-1k"

Fine-tuned model name

new_model = "llama-2-7b-miniguanaco"

################################################################################

QLoRA parameters

################################################################################

LoRA attention dimension

lora_r = 64

Alpha parameter for LoRA scaling

lora_alpha = 16

Dropout probability for LoRA layers

lora_dropout = 0.1

################################################################################

bitsandbytes parameters

################################################################################

Activate 4-bit precision base model loading

use_4bit = True

Compute dtype for 4-bit base models

bnb_4bit_compute_dtype = "float16"

Quantization type (fp4 or nf4)

bnb_4bit_quant_type = "nf4"

Activate nested quantization for 4-bit base models (double quantization)

use_nested_quant = False

################################################################################

TrainingArguments parameters

################################################################################

Output directory where the model predictions and checkpoints will be stored

output_dir = "./results"

Number of training epochs

num_train_epochs = 1

Enable fp16/bf16 training (set bf16 to True with an A100)

fp16 = False
bf16 = False

Batch size per GPU for training

per_device_train_batch_size = 4

Batch size per GPU for evaluation

per_device_eval_batch_size = 4

Number of update steps to accumulate the gradients for

gradient_accumulation_steps = 1

Enable gradient checkpointing

gradient_checkpointing = True

Maximum gradient normal (gradient clipping)

max_grad_norm = 0.3

Initial learning rate (AdamW optimizer)

learning_rate = 2e-4

Weight decay to apply to all layers except bias/LayerNorm weights

weight_decay = 0.001

Optimizer to use

optim = "paged_adamw_32bit"

Learning rate schedule

lr_scheduler_type = "cosine"

Number of training steps (overrides num_train_epochs)

max_steps = -1

Ratio of steps for a linear warmup (from 0 to learning rate)

warmup_ratio = 0.03

Group sequences into batches with same length

Saves memory and speeds up training considerably

group_by_length = True

Save checkpoint every X updates steps

save_steps = 0

Log every X updates steps

logging_steps = 25

################################################################################

SFT parameters

################################################################################

Maximum sequence length to use

max_seq_length = None

Pack multiple short examples in the same input sequence to increase efficiency

packing = False

Load the entire model on the GPU 0

device_map = {"": 0}

Load dataset (you can process it here)

dataset = load_dataset(dataset_name, split="train")

Load tokenizer and model with QLoRA configuration

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)

Check GPU compatibility with bfloat16

if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16: accelerate training with bf16=True")
print("=" * 80)

Load base model

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

Load LLaMA tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

Load LoRA configuration

peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)

Set training parameters

training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to="tensorboard"
)

Set supervised fine-tuning parameters

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)

Train model

trainer.train()

Save trained model

trainer.model.save_pretrained(new_model)`

Error: At: trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
Error operation not supported at line 351 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c
/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

A UTF-8 locale is required. Got ANSI_X3.4-1968

Cannot log into hugging face after fine tuning the model.

Can I translate it into chinese?

Hi Maxime
Thank you very much for writing such a tutorial.

Your tutorial is the most outstanding one I have seen, with comprehensive coverage and very complete explanations and experiments. May I translate it into Chinese?

My Favorite Course

RuntimeWarning in Fine-tune Llama 3 with ORPO.ipynb

Could you please explain the runtime warning in cell 3 in Fine-tune Llama 3 with ORPO.ipynb:

/usr/local/lib/python3.10/dist-packages/multiprocess/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork()

Is JAX being somewhere in the notebook? I'm afraid I cannot see why this warning occurs.

Add example on fine-tuning for function calling

Hey @mlabonne

Great work on this repo!

Would be amazing if you would want to add an example of fine-tuning for function calling. This would gain a lot of traction due to devs wanting OSS alternatives to GPT.

Prompt is getting repeated in response

I tried to retrain Llama-2 model. I just followed the steps you have mentioned. But when I am generating the text with the following code snippet -

prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

I am getting a weird response as below -

[INST] What is a large language model? [/INST]
[INST] What is a large language model? [/INST]
[INST] What is a large language model? [INST]
[INST] What is a large language model? [INST]
[INST] What is a large language model? [INST] [INST] What is a large language model? [/INST]
[INST] What is a large language model? [INST] [INST] What is a large language model? [/INST] [INST] What is a large language model? [INST] [INST] What is a large language model? [/INST] [INST] What is a large language model? [INST] [INST] What is a large language model? [/INST] [INST] What is a large language model? [INST] [INST] What is a large language model? [/INST] [INST] What is a

What could be the issue?

RAG

Hi sir again! Do you plan to add some content for RAG? If not I'd like to sum some content and push them here.
Best Regards,

Train Custom data set PDF

Hello, Super interesting. I would like to train the model on my own data in pdf format.
How to adapt the code. instead of using # The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k" I would like to replace it with
dataset= doc.pdf but it sucks.
Do you have an idea? THANKS

Mobile deploy of LLM project

Thanks your project~🚀

Perhaps deploying LLM (Language Learning Models) with a size of 1.8 to 2 billion parameters on mobile or edge devices will become the next hotspot. Here is a mention of work in this area.

https://github.com/wangzhaode/mnn-llm

4-bit LLM Quantization with GPTQ Tokenizer stuck

I'm trying to run the 4-bit LLM Quantization with GPTQ notebook with my own fine-tuned Llama2 7b model. However, it is getting stuck at the tokenizer step:

tokenized_data = tokenizer("\n\n".join(data['text']), return_tensors='pt')

I already tried using the tokenizer from the merged fine-tune model as well as the tokenizer from the llama2 repo. However, it still hangs on this step. Would appreciate any help or tips on how to fix this.

i-Quants in AutoQuant?

Would it be possible to support i-Quants in AutoQuant or are they more demanding to quantize?

LazyMergeKit - Tensor model.final_layernorm.weight required but not present in model ...

Hi there I'm trying to merge Phi-2 models using the following config:
`MODEL_NAME = "..."
yaml_config = """
models:

model: microsoft/phi-2
no parameters necessary for base model
model: rhysjones/phi-2-orange
parameters:
density: 0.5
weight: 0.5
model: cognitivecomputations/dolphin-2_6-phi-2
parameters:
density: 0.5
weight: 0.3
merge_method: ties
base_model: microsoft/phi-2
parameters:
normalize: true
dtype: float16

"""`

but i get the following error:
RuntimeError: Tensor model.final_layernorm.weight required but not present in model rhysjones/phi-2-orange

I tried with lxuechen/phi-2-dpo before insead of phi-2-orange but got the same error.

I'm executinhg on Google Collab with CPU Runtime with Remote_Code set to true.

Can someone help and tell me if I'm doing something wrong or if it just oesnt work with Phi?

Here is the full log:
mergekit-yaml config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle --trust-remote-code Warmup loader cache: 0% 0/3 [00:00<?, ?it/s] Fetching 10 files: 100% 10/10 [00:00<00:00, 9925.00it/s] Warmup loader cache: 33% 1/3 [00:00<00:00, 5.18it/s] Fetching 11 files: 100% 11/11 [00:00<00:00, 71977.14it/s] Warmup loader cache: 67% 2/3 [00:00<00:00, 5.58it/s] Fetching 10 files: 100% 10/10 [00:00<00:00, 31583.61it/s] Warmup loader cache: 100% 3/3 [00:00<00:00, 5.69it/s] 0% 1/2720 [00:00<00:02, 1276.42it/s] Traceback (most recent call last): File "/usr/local/bin/mergekit-yaml", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/content/mergekit/mergekit/options.py", line 76, in wrapper f(*args, **kwargs) File "/content/mergekit/mergekit/scripts/run_yaml.py", line 47, in main run_merge( File "/content/mergekit/mergekit/merge.py", line 90, in run_merge for _task, value in exec.run(): File "/content/mergekit/mergekit/graph.py", line 191, in run res = task.execute(**arguments) File "/content/mergekit/mergekit/io/tasks.py", line 73, in execute raise RuntimeError( RuntimeError: Tensor model.final_layernorm.weight required but not present in model rhysjones/phi-2-orange

In LazyMergekit.ipynb file Merged model cannot be uploaded in Huggingface

It is showing that secret does not exist. But I have tried different secret key.

mlabonne / llm-course Goto Github PK

llm-course's Introduction

🗣️ Large Language Model Course

📝 Notebooks

Tools

Fine-tuning

Quantization

Other

🧩 LLM Fundamentals

1. Mathematics for Machine Learning

2. Python for Machine Learning

3. Neural Networks

4. Natural Language Processing (NLP)

🧑‍🔬 The LLM Scientist

1. The LLM architecture

2. Building an instruction dataset

3. Pre-training models

4. Supervised Fine-Tuning

5. Reinforcement Learning from Human Feedback

6. Evaluation

7. Quantization

8. New Trends

👷 The LLM Engineer

1. Running LLMs

2. Building a Vector Storage

3. Retrieval Augmented Generation

4. Advanced RAG

5. Inference optimization

6. Deploying LLMs

7. Securing LLMs

Acknowledgements

llm-course's People

Contributors

Stargazers

Watchers

Forkers

llm-course's Issues

Libraries & Versions: Package Version

The model that you want to train from the Hugging Face hub

The instruction dataset to use

Fine-tuned model name

QLoRA parameters

LoRA attention dimension

Alpha parameter for LoRA scaling

Dropout probability for LoRA layers

bitsandbytes parameters

Activate 4-bit precision base model loading

Compute dtype for 4-bit base models

Quantization type (fp4 or nf4)

Activate nested quantization for 4-bit base models (double quantization)

TrainingArguments parameters

Output directory where the model predictions and checkpoints will be stored

Number of training epochs

Enable fp16/bf16 training (set bf16 to True with an A100)

Batch size per GPU for training

Batch size per GPU for evaluation

Number of update steps to accumulate the gradients for

Enable gradient checkpointing

Maximum gradient normal (gradient clipping)

Initial learning rate (AdamW optimizer)

Weight decay to apply to all layers except bias/LayerNorm weights

Optimizer to use

Learning rate schedule

Number of training steps (overrides num_train_epochs)

Ratio of steps for a linear warmup (from 0 to learning rate)

Group sequences into batches with same length

Saves memory and speeds up training considerably

Save checkpoint every X updates steps

Log every X updates steps

SFT parameters

Maximum sequence length to use

Pack multiple short examples in the same input sequence to increase efficiency

Load the entire model on the GPU 0

Load dataset (you can process it here)

Load tokenizer and model with QLoRA configuration

Check GPU compatibility with bfloat16

Load base model

Load LLaMA tokenizer

Load LoRA configuration

Set training parameters

Libraries & Versions:
Package Version