microsoft / samba Goto Github PK

View Code? Open in Web Editor NEW

772.0 24.0 44.0 611 KB

Official implementation of "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling"

Home Page: https://arxiv.org/pdf/2406.07522

License: MIT License

Dockerfile 0.42% Python 99.58%

samba's Introduction

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Samba is a simple yet powerful hybrid model with an unlimited context length. Its architecture is frustratingly simple:

Samba = Mamba + MLP + Sliding Window Attention + MLP stacking at the layer level.

Our largest model, Samba-3.8B, is trained on 3.2 trillion tokens from the Phi3 dataset, outperforming Phi3-mini on major benchmarks (e.g. MMLU, GSM8K and HumanEval) by a large margin. Samba can also achieve perfect long-context retrieval ability with minimal instruction tuning, while still maintaining its linear complexity with respect to sequence length. This ability leads to the impressive performance of Samba-3.8B-instruct on downstream tasks such as long-context summarization.

Performance 🚀

Model	MMLU	GSM8K	HumanEval	GovReport	SQuALITY
Phi-3-mini-4K-instruct	68.8	82.5	58.5	14.4	21.6
Samba-3.8B-instruct (preview)	71.9	87.6	62.8	18.9	21.2

We report 5-shot accuracy for MMLU, 8-shot CoT accuracy for GSM8K, 0-shot pass@1 for HumanEval and ROUGE-L for both GovReport and SQuALITY.

Updates

[June 11] Released the codebase for training Samba-421M and Samba-1.3B on SlimPajama.

Code Overview

Our training infrastructure on SlimPajama is a modified version of TinyLlama and LitGPT. One can easily specify different architectual configurations through modifying the model_name and the config file which includes tons of baseline architectures mentioned in the paper. Our RetNet and GLA implementations are from the awesome Flash Linear Attention repository.

Pretraining Samba from scratch

Please follow the Dockerfile to setup the environment. The data preparation mainly follows TinyLlama except that we only use the SlimPajama dataset.

Data Preparation

Download the Slimpajama dataset to your chosen directory.

cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

The SlimPajama dataset takes 893GB diskspace. Use the provided scripts to tokenize the datasets and divide them into chunks.

python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama  --destination_path data/slim --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama  --destination_path data/slim --split train --percentage 1.0

You are now ready to launch a job!

Training

The following script trains a default Samba-421M model on a single node of 8 GPUs with 20B tokens.

torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=samba-421M --rdzv_backend=c10d  --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} pretrain.py --train_data_dir data/slim --val_data_dir data/slim

You can modify model_name to "Samba_1.3B" and train_config to "tsz512x4k_100B" for training a Samba-1.3B model with 100B tokens. We assume that you have 8 nodes each with 8 GPUs, and you can modify the number of nodes for training on fewer gpus.

Citation

If you find our work useful, please consider citing:

@article{ren2024samba,
      title={Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling}, 
      author={Liliang Ren and Yang Liu and Yadong Lu and Yelong Shen and Chen Liang and Weizhu Chen},
      journal = {arXiv preprint},
      year={2024},
      url={https://arxiv.org/abs/2406.07522}
}

Contact

Liliang Ren ([email protected])

samba's People

Contributors

Stargazers

Watchers

Forkers

rminz jamesmtucker touristshaun eltociear hssn-20 tmendoza kaballas ishine win10ogod liygzting janrvdolf kiranvarghesev lulmer thomascherickal santyzenith acasanez omkarthawakar lihuibng jenzhu sonnydev johngmuender mbrukman andreslavescu gary109 stefanobranco cnp-ciimar obock1 rakataprime dyl777 dvtruongson zeroshotdave sathishcyberintelsys shubhkirti24 oytunturk zsjtiger imdtman ebergel clabra richardburleigh flazerain vineetp6 lazyyvenom hubayirp

samba's Issues

Error when using Docker

Hello,

I have built a Docker image based on the Dockerfile provided, but I am seeing the following error when trying to prepare the data.

Docker command: docker run -d --name samba -v ./data:/app/data -v /data/hf-cache:/data/hf-cache --runtime nvidia samba python scripts/prepare_slimpajama.py --source_path /data/hf-cache/datasets--cerebras--SlimPajama-627B/snapshots/2d0accdd58c5d5511943ca1f5ff0e3eb5e293543/ --tokenizer_path data/llama --destination_path data/slim --split validation --percentage 1.0

Container logs:


=============
== PyTorch ==
=============

NVIDIA Release 23.07 (build 63867923)
PyTorch Version 2.1.0a0+b5021ba

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Traceback (most recent call last):
  File "/app/scripts/prepare_slimpajama.py", line 21, in <module>
    import lit_gpt.packed_dataset as packed_dataset
  File "/app/lit_gpt/__init__.py", line 7, in <module>
    from lit_gpt.model import GPT
  File "/app/lit_gpt/model.py", line 15, in <module>
    from xformers.ops import SwiGLU
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/__init__.py", line 8, in <module>
    from .fmha import (
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 10, in <module>
    from . import (
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/triton_splitk.py", line 548, in <module>
    _get_splitk_kernel(num_groups)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/triton_splitk.py", line 503, in _get_splitk_kernel
    _fwd_kernel_splitK_unrolled = unroll_varargs(_fwd_kernel_splitK, N=num_groups)
  File "/usr/local/lib/python3.10/dist-packages/xformers/triton/vararg_kernel.py", line 166, in unroll_varargs
    jitted_fn = triton.jit(fn)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 864, in jit
    return decorator(fn)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 853, in decorator
    return JITFunction(
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 699, in __init__
    self.src = self.src[re.search(r"^def\s+\w+\s*\(", self.src, re.MULTILINE).start():]
AttributeError: 'NoneType' object has no attribute 'start'

How to evaluate?

We tried to pretrain a 421M Samba, but after pretraining, we find that you did not open source the evaluation script.

where is the architectural code for Samba

Thank you for your wonderful work, and I would like to ask where is the architectural code for Samba?

Is it just me or this is not working on windows machine as mamba_ssm requires Linux machine?

Is it just me or this is not working on windows machine as mamba_ssm requires Linux machine?
Please help me this is not working on my windows machine.

Mamba2

First, thanks for this great contribution and the well written paper!

As Mamba2 also was released very recently, do you have any thoughts on the potential integration or impact of Mamba2 on the Samba architecture?

Would be much appreciated.

Typo on paper?

Hi, I was just wondering if there is a typo in the Arxiv preprint https://arxiv.org/pdf/2406.07522 ?

The MMLU results for Mamba-2.8 seems a bit low (26, vs 45.28 for the Mamba-1.8) model?

Is it a typo or is there some other reason why the mamba-1.8b model is outperforming its larger cousin in MMLU?
(Or am I misreading something?)

Please advise.
Thank you.!

Support for Transformers library

Hi! Thank you for the great work on Samba! The hybrid model is very interesting.

Since Mamba-1 is now supported in the latest version of transformers, is there any plan for Samba to support the transformers library as well? This could benefit many researchers in the community.

Supporting transformers would allow for easier integration with existing workflows and tools, potentially increasing adoption and enabling more researchers to experiment with Samba.

inference code

I'm turning the learning. Is there any code to infer the model after learning? If not, when are you going to upload the inference code? And is there any reference or url to refer to for inference?

This repo is missing a LICENSE file

This repository is currently missing a LICENSE file.

A license helps users understand how to use your project in a compliant manner. You can find the standard MIT license Microsoft uses at: https://github.com/microsoft/repo-templates/blob/main/shared/LICENSE.

If you would like to learn more about open source licenses, please visit the document at https://aka.ms/license (Microsoft-internal guidance).

Inferrence Code

Amazing work, team! Thank you sincerely for sharing.

I have trained a toy model but have completely failed creating an inference script. Sharing one would be sincerely appreciated!

weight access

Great job! Will you be open-sourcing the model weights?

Too late for a name change?

This is seriously confusing with Microsoft SMB/Samba.

Samba is the standard Windows interoperability suite of programs for Linux and Unix. Samba is Free Software licensed under the GNU General Public License and the Samba project is a member of the Software Freedom Conservancy. Since 1992, Samba has provided secure, stable and fast file and print services for all clients using the SMB/CIFS protocol, such as all versions of DOS and Windows, OS/2, Linux and many others. Samba is an important component to seamlessly integrate Linux/Unix Servers and Desktops into Active Directory environments. It can function both as a domain controller or as a regular domain member.

https://github.com/samba-team/samba

Models

Hi, thanks for releasing Samba! Are there any plans to release the pretrained models? Thanks!

Vocab size mismatch between config.py and the paper

Hi,
Thanks for this very interesting paper !
Comparing the table 9 in the paper and the config.py file, it looks like there is a mismatch between the vocabulary size.
Which one did you actually used ?

Best Regards,

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

LitGPT

Congrats on this research milestone 🙌! And it’s nice to see that our LitGPT library has been has been useful for this project. However, note that LitGPT is an open-source project, and the Apache 2.0 open-source license requires including the original license and copyright notice. Right now, the files in Samba’s lit_gpt subfolder only say

# Copyright (c) Microsoft Corporation ...

which needs to be updated to also include

Copyright Lightning AI. Licensed under the Apache License 2.0,
see LICENSE file at https://github.com/Lightning-AI/litgpt/blob/main/LICENSE

Could you please correct this mistake as soon as possible and include the correct LitGPT license reference.