Code Monkey home page Code Monkey logo

llm-sharp's Introduction

llm-sharp

๐Ÿšง Under very early development

Run and serve language models in C# with TorchSharp.

This project aims to write most things natively in C# except for some specialized CUDA kernels where you only have cpp. This offers the best developer experience for production ready apps.

Features & TODOs

C#:

  • Python resources interop
    • Load safetensors in C#
    • Convert scripts
      • GPTQ & Awq convert scripts
    • streamlit web ui for api service
  • Introduce more models
    • Llama2 family
      • Llama2 (7b awq tested)
      • Qwen tested
      • Baichuan2 with alibi tested
    • Bert family
      • SentenceTransformer tested
  • Introduce more tokenizers
    • BPE
    • BPE (SentencePiece)
    • Tiktoken (BPE)
    • Unigram (SentencePiece)
    • WordPiece
    • Unit tests
  • Serve
    • OpenAI compatible api with ASP.Core
    • Command line interface
    • Batched inference

Native cpp:

  • Model parallel
    • Layer parallel
    • Tensor parallel Needs NCCL support.
  • Specialized cuda ops
    • C# cuda op loading
    • GPTQ int4 ops Can be converted to AWQ format without loss if des_act is not used.
    • AWQ int4 ops
    • Flash attention
    • Fused ops (RMS norm, Rotary embeddings)

Usage

Get a release from Releases.

Run llm-sharp test to verify libtorch is correctly loaded. If you start from scratch, use llm-sharp download to download required version of libtorch. This will default download to ~/.cache/llm-sharp. You can also install a required version of PyTorch using pip or conda. The libtorch lookup order is: env LIBTORCH_PATH > ~/.cache/llm-sharp > python site-packages > os fallback

Convert your models using scripts in python-scripts. This will convert the original model and tokenizer to model_config.json, tokenizer_config.json and *.safetensors. Sentencepiece tokenizers should be converted to huggingface fast tokenizer format first.

Modify appsettings.json in App project or add an environment aware config appsettings.[Development|Production].jsonwith:

{
  "llm": {
    "models": [
      {
        "name": "llama2-7b-chat",
        "type": "LlamaAwq",
        "path": "path/to/llama2-7b-chat-awq",
        "dtype": "float16",
        "device": "cuda:0"
      }
    ]
  }
}

Default will start an http api service. The api is almost compatible with openai v1 api with /v1/chat/completions and /v1/embeddings. Visit http://<server_url>/swagger/index.html for api docs. You can also set "Bearer": { "Token": "your-secret-token", "Tokens": ["some-extra-tokens"] } in appsettings.json to enable endpoint authorization. You can use --urls http://<host>:<port>,http://<host2>:<port2> to change the default listening urls.

After starting the api service, run streamlit run web-ui.py in python-scripts to start a simple web ui with streamlit.

For command line interface:

llm-sharp cli

Dev env setup

It's recommended to use conda to manage the build environment for NativeOps package. Current TorchSharp depends on torch==2.1.0 and cuda==12.1:

conda install pytorch=2.1.0 pytorch-cuda=12.1 cuda -c pytorch -c nvidia

This will automatically install required nvcc to build NativeOps package. The build pipeline also requires ninja python package and an MSVC compiler (which should be setup by installing Visual Studio).

Then you can build the NativeOps by:

python NativeOps/build.py

This will build the native codes to NativeOps/runtimes/[rid]/native/llm_sharp_ops.{dll,so}, and will be automatically recognized by dotnet. For cpp dev, use the following args to print the include dir for the build pipeline:

python NativeOps/build.py include

If you already have a built binary from latest release, or you don't need any op from the NativeOps, you can also install torch by pip or directly download the libs required.

Performance

Single inference on Linux with single RTX 3090 (Qwen-14B-chat, awq int4):

Decoder perf:
  len: 670(prefix) + 326(gen)
 init: 0.3439 s
  sum: 6.5723 s
  gen: 49.6019 tok/s
  avg: 47.2804 tok/s

Acknowledgement

llm-sharp's People

Contributors

k024 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.