databricks / dbrx Goto Github PK

View Code? Open in Web Editor NEW

2.5K 40.0 231.0 65 KB

Code examples and resources for DBRX, a large language model developed by Databricks

Home Page: https://www.databricks.com/

License: Other

Python 100.00%

databricks gen-ai generative-ai llm llm-inference llm-training mosaic-ai

dbrx's Introduction

DBRX

DBRX is a large language model trained by Databricks, and made available under an open license. This repository contains the minimal code and examples to run inference, as well as a collection of resources and links for using DBRX.

Founder's Blog, DBRX Technical Blog
Hugging Face: https://huggingface.co/collections/databricks/
LLM Foundry: https://github.com/mosaicml/llm-foundry

A reference model code can be found in this repository at modeling_dbrx.py.

Note: this model code is supplied for references purposes only, please see the Hugging Face repository for the official supported version.

Model details

DBRX is a Mixture-of-Experts (MoE) model with 132B total parameters and 36B live parameters. We use 16 experts, of which 4 are active during training or inference. DBRX was pre-trained for 12T tokens of text. DBRX has a context length of 32K tokens.

The following models are open-sourced:

Model	Description
DBRX Base	Pre-trained base model
DBRX Instruct	Finetuned model for instruction following

The model was trained using optimized versions of our open source libraries Composer, LLM Foundry, MegaBlocks and Streaming.

For the instruct model, we used the ChatML format. Please see the DBRX Instruct model card for more information on this.

Quick start

To download the weights and tokenizer, please first visit the DBRX Hugging Face page and accept the license. Note: access to the Base model requires manual approval.

We recommend having at least 320GB of memory to run the model.

Then, run:

pip install -r requirements.txt # Or requirements-gpu.txt to use flash attention on GPU(s)
huggingface-cli login           # Add your Hugging Face token in order to access the model
python generate.py              # See generate.py to change the prompt and other settings

For more advanced usage, please see LLM Foundry (chat script, batch generation script)

If you have any package installation issues, we recommend using our Docker image: mosaicml/llm-foundry:2.2.1_cu121_flash2-latest

Inference

Both TensorRT-LLM and vLLM can be used to run optimized inference with DBRX. We have tested both libraries on NVIDIA A100 and H100 systems. To run inference with 16-bit precision, a minimum of 4 x 80GB multi-GPU system is required.

TensorRT-LLM

DBRX support is being added to TensorRT-LLM library: Pending PR

After merging, instructions to build and run DBRX TensorRT engines will be found at: README

vLLM

Please see the vLLM docs for instructions on how to run DBRX with the vLLM engine.

MLX

If you have an Apple laptop with a sufficiently powerful M-series chip, quantized version of DBRX can be run with MLX. See instructions for running DBRX on MLX here.

LLama.cpp

If you have an Apple M-series chip laptop with atleast 64GB RAM, you can run a quantized version of DBRX using llama.cpp.

Compile llama.cpp
Download a quantized ggml version of dbrx-instruct such as dranger003/dbrx-instruct-iMat.GGUF
From llama.cpp folder, run:

./main -ngl 41 -m ./models/ggml-dbrx-instruct-16x12b-iq1_s.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

Finetune

To finetune DBRX with our open source library LLM Foundry, please see the instructions in our training script (found here). We have finetuning support for both:

Full parameter finetuning, see the yaml config dbrx-full-ft.yaml
LoRA finetuning, see the yaml config dbrx-lora-ft.yaml

Note: LoRA support currently cannot finetune the experts, since the experts are fused. Stay tuned for more.

Model card

The model cards can be found at:

Integrations

DBRX is available on the Databricks platform through:

Other providers have recently added support for DBRX:

The same tools used to train high quality MoE models such as DBRX are available for Databricks customers. Please reach out to us at https://www.databricks.com/company/contact if you are interested in pre-training, finetuning, or deploying your own DBRX models!

Issues

For issues with model output, or community discussion, please use the Hugging Face community forum (instruct, base)

For issues with LLM Foundry, or any of the underlying training libraries, please open an issue on the relevant GitHub repository.

License

Our model weights and code are licensed for both researchers and commercial entities. The Databricks Open Source License can be found at LICENSE, and our Acceptable Use Policy can be found here.

dbrx's People

Contributors

Stargazers

Watchers

Forkers

dennyglee hubayirp nistrate skbennet kukupigs eugenem321 wlngai asnelling rajkrishnamurthy vijayendram dattgoswami funmu jmac122 serverchief adityaprashant aakashapoorv dodcorp vital121 eonretief spaparaju roysh tkone2018 kafkaqin jwmemail w4ester ototao xmas25 yibit zhutony djdev 021gink starstylesky vionaaru onenotell shenzh1990 mole-bai jieyoujun shimura0 zhangyu789 zinsayon petercao breakmyself tops666 rickyhong hopehit iamleon121 chaffeechenyefei hsiehchou captake genostack bruinxiong baosuning suryacharanteja thy-chan zzmjohn kekewind dsdanielpark techthiyanes fpxtest dpflann lebrosoft ccc0168 shauryashaurya david20080125 nefario7 a43501 skaiphd mypythondemo ajits-github sorokinvld vinicius-ianni patricksilva take2u mikechen66 laobadao tingbaozhao originx-23 panyuyi f901107 chenxingqiang zhengyafei123 itsjameshan ryyzn9 adesh-thakare lin-happiness shihuaxing nepalisagun suaifu azazel9966 jmaigc yingzi6776 yinxx misterypoem omungelwar45 pradipkhomane greensuse rezabehnoud ailabteam oztc dalian-ai

dbrx's Issues

HumanEval

When evaluating humaneval, does dbrx use some specical prompts to improve performance?

Does the tokenizer of this model have a network to load successfully?

Silu or Glu activation?

According to Model card on huggingface:

DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA).

However, when I run

config = AutoConfig.from_pretrained('/models/dbrx-instruct/')
print(config.ffn_config)

It shows:

DbrxFFNConfig {
  "ffn_act_fn": {
    "name": "silu"
  },
  "ffn_hidden_size": 10752,
  "moe_jitter_eps": 0,
  "moe_loss_weight": 0.05,
  "moe_normalize_expert_weights": 1,
  "moe_num_experts": 16,
  "moe_top_k": 4,
  "transformers_version": "4.38.1",
  "uniform_expert_assignment": false
}

It is somehow misleading and confusing.

How to get hands on experience as a newbie

My first option is to run quantized versions.

Quantized

I read this https://github.com/databricks/dbrx#mlx

and then went to https://huggingface.co/mlx-community/dbrx-instruct-4bit

I read this

On my Macbook Pro M2 with 96GB of Unified Memory, DBRX Instruct in 4-bit for the above prompt it eats 70.2GB of RAM.

I am on a macbook pro M1 Max with 64Gb memory.

I guess that's not enough?

Computing

My next version is to figure out what's a cheap way to run the model but the details confuse me.

Can help?

How to use API?

Hi, thank you for your wonderful work! However, I have to learn how to use API in our project. So please show us the way to use API when you update README.md next time. Thank you so much~

Quantized distilled version

It would be great to have a version that can be better worked with locally - ideally executable on CPU

Missing tokenizer when use vllm

  File "/home/paas/vllm/vllm/engine/llm_engine.py", line 222, in _init_tokenizer
    self.tokenizer: BaseTokenizerGroup = get_tokenizer_group(
  File "/home/paas/vllm/vllm/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
    return TokenizerGroup(**init_kwargs)
  File "/home/paas/vllm/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
  File "/home/paas/vllm/vllm/transformers_utils/tokenizer.py", line 66, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/paas/miniconda3/envs/naie/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 822, in from_pretrained
    return tokenizer_class.from_pretrained(
  File "/home/paas/miniconda3/envs/naie/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2086, in from_pretrained
    return cls._from_pretrained(
  File "/home/paas/miniconda3/envs/naie/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2327, in _from_pretrained
    raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

training slow

Hello, I find it is slow to train the moe model, because the DbrxExperts moe training process is serial.

Can it use parallel training process in "for loop"？

Is DBRX, the most powerful open-source LLM yet?

In the rapidly evolving landscape of natural language processing (NLP), the emergence of DBRX marks a significant milestone. Developed by Databricks, DBRX represents a quantum leap in the realm of large language models (LLMs), boasting unparalleled performance and accuracy across a multitude of benchmarks. This article delves into the intricate details and remarkable statistics that underscore the prowess of DBRX, positioning it as a frontrunner in the field.

Performance on Composite Benchmarks:

DBRX's superiority becomes evident when evaluated against established open and closed models across composite benchmarks. Notably, on the Hugging Face Open LLM Leaderboard, DBRX achieves an exceptional score of 74.5%, surpassing its closest competitor by a significant margin of 1.8%. Similarly, on the Databricks Model Gauntlet, DBRX outshines its peers with a commanding score of 66.8%, underscoring its unrivaled proficiency in diverse tasks encompassing world knowledge, commonsense reasoning, and language understanding.

Dominance in Specialized Domains:

Where DBRX truly shines is in specialized domains such as programming and mathematics. On benchmarks tailored to assess programming prowess like HumanEval and GSM8k, DBRX demonstrates remarkable superiority. For instance, on HumanEval, DBRX achieves an impressive score of 70.1%, outperforming Grok-1 by 6.9%, Mixtral Instruct by 15.3%, and the best-performing LLaMA2-70B variant by 37.9%. Similarly, on GSM8k, DBRX secures a notable 66.9%, surpassing competitors by margins ranging from 4.0% to 12.8%.

Unprecedented Versatility:

DBRX's exceptional performance across diverse domains underscores its versatility and adaptability. Whether tackling complex programming challenges or unraveling intricate linguistic nuances, DBRX consistently delivers unparalleled results. This versatility positions DBRX as a formidable tool for a wide array of applications, ranging from natural language understanding to specialized tasks in programming and mathematics.

Efficiency in Training and Inference:

Beyond its remarkable performance, DBRX also excels in training and inference efficiency. Leveraging a fine-grained mixture-of-experts (MoE) architecture, DBRX achieves superior FLOP efficiency compared to dense models, enabling faster training and inference without compromising on quality. Additionally, DBRX's inference throughput surpasses that of its counterparts, offering up to 150 tokens per second on Mosaic AI Model Serving.

Conclusion:

In conclusion, DBRX represents a paradigm shift in the landscape of large language models. Its exceptional performance, unmatched versatility, and superior efficiency position it as a frontrunner in the field, setting new standards for accuracy and proficiency. As the pinnacle of Databricks' innovation in NLP, DBRX promises to empower enterprises and researchers alike, heralding a new era of breakthroughs in natural language understanding and AI-driven applications.

Whats you think?

I have encountered a problem：LayerNorm.init() got an unexpected keyword argument 'bias'

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/home/roo/train/dbrx-instruct/generate.py", line 39, in
model = AutoModelForCausalLM.from_pretrained(
File "/home/roo/anaconda3/envs/Meditron/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/home/roo/anaconda3/envs/Meditron/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3404, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 1261, in init
self.transformer = DbrxModel(config)
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 1013, in init
self.blocks = nn.ModuleList([
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 1014, in
DbrxBlock(config, block_idx) for block_idx in range(config.n_layers)
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 856, in init
self.norm_attn_norm = DbrxNormAttentionNorm(
File "/home/roo/.cache/huggingface/modules/transformers_modules/model/modeling_dbrx.py", line 642, in init
self.norm_1 = nn.LayerNorm(hidden_size, bias=False)
TypeError: LayerNorm.init() got an unexpected keyword argument 'bias'

`convert_ids_to_tokens` not working as expected.

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained('/models/dbrx-instruct/')
t.encode('请问你是谁')
# [15225, 57107, 57668, 21043, 39013, 223]
t.decode([15225, 57107, 57668, 21043, 39013, 223])
# '请问你是谁'
print(t.convert_ids_to_tokens(15225))
# 'è¯·'

I suppose it should output token text like 请？

Fine-tune dbrx-instruct on a single VM with 8 H100s

reposting from the llm-foundery repo:

I'm trying to fine-tune dbrx on a single machine with 8 H100 gpus. I keep getting OOM error with different configurations, I wonder if this is even doable.

I see a note that suggests 64x80 GPUs, but I wonder if there are ways to do it with 8X80 GB gpus.

Thanks

Fine Tuning?

Do you support fine-tuning on this model? Such as using Lora, Deepspeed, etc

What's the optimal parallel strategy using TensorRT-LLM?

Thanks for your great efforts first. I read the PR you opened in the TensorRT-LLM repo and noticed that EP +TP, PP + TP, and TP are supported during inference. May I ask which one is optimal? Specifically, as for the MoE layer, does EP or TP yield better performance?

How inference efficiency is measured

The tech report described the methodology of the inference efficiency measurement but not in detail. It compared the Llama2-70B and DBRX. We have great interests in the comparison. So we also carried out some tests where we spawned different number synchronous clients in order to stress the service in different QPS. What performance we get is different from the tech report. DBRX is faster than Llama2-70B when the traffic is lower than 0.35 QPS. The Latency vs QPS curve is flipped after that. By the way we use the same prompt length and output length as that in tech report.

So I wonder if you could give more details about how the performance is test.

Transformers Key Error : 'dbrx'

│ /usr/lib/python3/dist-packages/llmfoundry/models/hf/hf_causal_lm.py:119 in │ │ __init__ │ │ │ │ 116 │ │ ] │ │ 117 │ │ │ │ 118 │ │ # Construct the Hugging Face config to use │ │ ❱ 119 │ │ config = AutoConfig.from_pretrained( │ │ 120 │ │ │ pretrained_model_name_or_path, │ │ 121 │ │ │ trust_remote_code=trust_remote_code, │ │ 122 │ │ │ use_auth_token=use_auth_token, ValueError: The checkpoint you are trying to load has model type dbrx but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

This is the issue I faced when trying to run dbrx_lora_ft.yaml. Any suggestions would be appreciated!

Why pretrainModel class "_supports_sdpa" is False?

The most models' pretrainModel class attribute _supports_sdpa are True, why DBRX set False?

Bad performance on PrOntoQA benchmark

PrOntoQA is a question-answering dataset that generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models.

I have tested the performance of DBRX-Base on GSM8k, AQuA, strategyQA dataset using COT-4-shot, its performance is satisfying compared to other models (GPT4, Claude Opus, LLama 70B, etc.).

Nevertheless, when I test the model's performance on PrOntoQA, its performance is not that satisfying, where dbrx-instruction achieves a 24.2% accuracy and dbrx-base is worse. Although there might be some output processing errors when using dbrx-base, dbrx-instruct has no problem with endless generation but still fails to achieve a good performance.

Therefore, I want to know whether there is an official test result on PrOntoQA for others to take as a reference.

Thanks!

Stuck on the output "Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation." more then 10 mins

Hello, I just download the files of dbrx-instruct from huggingface. But When I run the example code I just stuck the message "Setting pad_token_id to eos_token_id:100257 for open-end generation.". Is that the problem of memory?

The code is :
"""
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

print(input_ids)

outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

"""

The basic configure of my local host is:
Geforce 3090
Memroy 32G
CPU 24cores.

Please Help.

Loading over multiple gpus in 8bit and 4bit with transformers loader

I can load the instruct model using the transformers loader and 8bit bits and bytes, I can get it to load evenly among multiple gpus.

However, I cannot seem to load the model with 4bit precion over multiple gpus, I managed to get the model to load across 1 24GB gpu and then start loading onto a second gpu of equivalent size, but it will not move on to any of the remaining gpus (7 in total). It will oom on the second gpu with the others sitting empty.

I've loaded other transformers based models via 4bit and never experience this heavily unbalanced loading before.

generate.py : tiktoken.py throws Encoding import error

After setting everything up locally, both the generate.py from the github and the minimal python script on the huggingface page throw the same error.
I followed all the steps in this repo and my brand new venv has been populated with "pip install -r requirements.txt"


Traceback (most recent call last):
  File "/media/models/dbrx/generate.py", line 34, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/models/dbrx/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 822, in from_pretrained
    return tokenizer_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/models/dbrx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2086, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/media/models/dbrx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2325, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/tiktoken.py", line 105, in __init__
    from tiktoken import Encoding  # type: ignore (thirdParty)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: cannot import name 'Encoding' from 'tiktoken' (/media/models/dbrx/tiktoken.py)

Real Performance versus llama-70B？

I have a problem about the inference data posted in this blog:
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

A MoE model with 36B activated parameters and 132B total parameters, it's inference performance will act like a 90B dense model with 2000 prompt and 256 output tokenes. How can it always performs better than llama2-70B dense model? As the batchsize increases, it will perform better than llama2-70B dense model first, and will perform worse than llama2-70B dense model from batchszie 3 or 4, because it will load all the 132B parameters when more and more experts are activated.