Code Monkey home page Code Monkey logo

llm-batch-inference's Introduction

fast-batch-inference

This is a WYSIWYG (what you see is what you get).

It is to help you do batch inference on GPUs in your vnet/vpc. It takes advantage of using vLLM which has custom optimizations for throughput for batch inference. We run this on a Databricks single node or cluster.

There are no real configurations for this, just make sure you use the right vm type for this to work.

Any model the size of llama 70b - Mixtral 8x7b make sure to use atleast 2 A100s. Everything else can use A100 or smaller. Will post a better table of model size and number of gpus needed to host one instance.

The reason to use vLLM is it supports batching ootb so it will be rare to hit OOM errors for passing in a larger payload as it will figure out how much space there is available and batch appropriately. It will though throw OOM if you dont have enough memory to load the model as well as room for KV cache.

The plan is to have 3 notebooks.

  1. Batch scoring for single node (multi or single gpu). [DONE]
  2. Batch scoring for multi node (multi or single gpu). [TBD]
  3. Batch scoring by making api calls to provisioned throughput models hosted on model serving. [TBD]

Getting access to models

  1. log in to your databricks workspace
  2. go to marketplace (you may need your admin to do the next steps)
  3. search for dbrx and get instant access
  4. download the models
  5. use the provided notebook if you need provisioned throughput deployed model
  6. Otherwise follow this WYIWYG guide to do batch inference on a job / interactive cluster

Notebooks

Currently all notebooks are for single node multi-gpu vms.

  1. batch scoring with dbrx (4 x A100 GPUs): notebook
    • DBRX needs vllm 0.4.0 and it has a slight bug so we are using 0.4.0.post1 hotfix with the direct url
  2. batch scoring with llama or mixtral: notebook
    • you need (2 x A100 GPUs) atleast for the 70b or mixtral 8x7b models
    • the rest you should be able to make do with 1 x A100 GPU on the vm

Prompting & Performance

Most OSS models have specific instruction tokens and special tokens to deal with prompting and sending instructions to the model. This is extremely important for throughput and performance. Otherwise the model will be very chatty and potentially loop completions till the max token has been met. This is where these special tokens come into play.

For now mixtral and llama based models use similar tokens and work similar to xml/html tags with subtle differences explained below:

  1. [INST] and [/INST] to indicate instruction blocks
  2. <<SYS>> and <</SYS>> to indicate system prompt
  3. <s> and </s> to indicate beginning of string (BOS) and end of string (EOS) respectively

LLAMA Models

The Llama2 series requires the use of <<SYS>> for creaing systen prompts and using [INST] tokens for giving specific instructions. please note <s> is not closed

Prompt Template:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

Example:

<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

There's a llama in my garden ๐Ÿ˜ฑ What should I do? [/INST]

Further details here

Mixtral and Mistral models

Both these family of models only use the <s> and [INST] tokens for creating the prompt structure.

Prompt Template:

<s>[INST] Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]

Example:

[INST] You are a helpful code assistant. Your task is to generate a valid JSON object based on the given information:
name: John
lastname: Smith
address: #1 Samuel St.
Just generate the JSON object without explanations:
[/INST]

Further details here

Inference Sampling Parameters

When the model is trying to predict the next token the following parameters impact the accuracy / consistency of the results.

  1. Temperature controls randomness: Lower values make responses more deterministic, higher values increase diversity. (Typically ranges from 0 - 1)

  2. P (top-p) controls the probability mass: Lower values focus on more likely tokens, cutting off the less likely ones.

llm-batch-inference's People

Contributors

puneet-jain159 avatar stikkireddy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.