Code Monkey home page Code Monkey logo

llama2.rs's Introduction

llama2.rs ๐Ÿค—

This is a Rust implementation of Llama2 inference on CPU

The goal is to be as fast as possible.

It has the following features:

  • Support for 4-bit GPT-Q Quantization
  • Batched prefill of prompt tokens
  • SIMD support for fast CPU inference
  • Memory mapping, loads 70B instantly.
  • Static size checks for safety
  • Support for Grouped Query Attention (needed for big Llamas)
  • Python calling API

Can run up on 1 tok/s 70B Llama2 and 9 tok/s 7B Llama2. (on my intel i9 desktop)

To build, you'll need the nightly toolchain, which is used by default:

> rustup toolchain install nightly # to get nightly
> ulimit -s 10000000 # Increase your stack memory limit. 

You can load models from the Hugging Face hub. For example this creates a version of a 70B quantized) model with 4 bit quant and 64 sized groups:

> pip install -r requirements.export.txt
> python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True

The library needs to be recompiled to match the model. You can do this with cargo.

To run:

> cargo run --release --features 70B,group_64 -- -c llama2-70b-q.bin -t 0.0 -s 11 -p "The only thing"                                                                                                                                 
The only thing that I can think of is that the          
achieved tok/s: 0.89155835

Honestly, not so bad for running on my GPU machine, significantly faster than llama.c.

Here's a run of 13B quantized:

> cargo run --release --features 13B,group_128 -- -c l13orca.act.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
Hello to all the cool people out there who are reading this. I hope you are having a great day. I am here
achieved tok/s: 5.1588936

Here's a run of 7B quantized:

cargo run --release --features 7B,group_128 -- -c l7.ack.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
> Hello to all the cool people out there who are reading this. I am a newbie here and I am looking for some
achieved tok/s: 9.048136

Python

To run in Python, you need to first compile from the main directory with the python flag.

cargo build --release --features 7B,group_128,python
pip install .

You can then run the following code.

import llama2_rs

def test_llama2_13b_4_128act_can_generate():
    model = llama2_rs.LlamaModel("lorca13b.act132.bin", False)
    tokenizer = llama2_rs.Tokenizer("tokenizer.bin")
    random = llama2_rs.Random()
    response = llama2_rs.generate(
        model,
        tokenizer,
        "Tell me zero-cost abstractions in Rust ",
        50,
        random, 
        0.0
    )

Todos

Configuration

In order to make the model as fast as possible, you need to compile a new version to adapt to other Llama versions. Currently in .cargo/config. The model will fail if these disagree with the binary model that is being loaded. To turn quantization off set it to quant="no".

See Also

Originally, a Rust port of Karpathy's llama2.c but now has a bunch more features to make it scale to 70B.

Also check out:

How does it work?

Started as a port of the original code, with extra type information to make it easier to extend.

There are some dependencies:

  • memmap2for memory mapping
  • rayon for parallel computation.
  • clap for command-line args.
  • pyO3 for python calling
  • SIMD enabled support with portable_simd

Authors

Llama2.rs is written by @srush and @rachtsingh.

llama2.rs's People

Contributors

echosprint avatar guoqingbao avatar rachtsingh avatar srush avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.