facebookresearch / llama Goto Github PK

Inference code for Llama models

License: Other

Python 94.71% Shell 5.29%

llama's Introduction

Llama 2

We are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly.

This release includes model weights and starting code for pre-trained and fine-tuned Llama language models — ranging from 7B to 70B parameters.

This repository is intended as a minimal example to load Llama 2 models and run inference. For more detailed examples leveraging Hugging Face, see llama-recipes.

Updates post-launch

See UPDATES.md. Also for a running list of frequently asked questions, see here.

Download

In order to download the model weights and tokenizer, please visit the Meta website and accept our License.

Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.

Pre-requisites: Make sure you have wget and md5sum installed. Then run the script: ./download.sh.

Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.

Access to Hugging Face

We are also providing downloads on Hugging Face. You can request access to the models by acknowledging the license and filling the form in the model card of a repo. After doing so, you should get access to all the Llama models of a version (Code Llama, Llama 2, or Llama Guard) within 1 hour.

Quick Start

You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally. For more examples, see the Llama 2 recipes repository.

In a conda env with PyTorch / CUDA available clone and download this repository.
In the top-level directory run:
```
pip install -e .
```
Visit the Meta website and register to download the model/s.
Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.
Once you get the email, navigate to your downloaded llama repository and run the download.sh script.
- Make sure to grant execution permissions to the download.sh script
- During this process, you will be prompted to enter the URL from the email.
- Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.
Once the model/s you want have been downloaded, you can run the model locally using the command below:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Note

Replace llama-2-7b-chat/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model.
The –nproc_per_node should be set to the MP value for the model you are using.
Adjust the max_seq_len and max_batch_size parameters as needed.
This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.

Inference

Different models require different model-parallel (MP) values:

Model	MP
7B	1
13B	2
70B	8

All models support sequence length up to 4096 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

Pretrained Models

These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.

See example_text_completion.py for some examples. To illustrate, see the command below to run it with the llama-2-7b model (nproc_per_node needs to be set to the MP value):

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4

Fine-tuned Chat Models

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in chat_completion needs to be followed, including the INST and <<SYS>> tags, BOS and EOS tokens, and the whitespaces and breaklines in between (we recommend calling strip() on inputs to avoid double-spaces).

You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.

Examples using llama-2-7b-chat:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the Responsible Use Guide. More details can be found in our research paper as well.

Issues

Please report any software “bug”, or other problems with the models through one of the following means:

Reporting issues with the model: github.com/facebookresearch/llama
Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
Reporting bugs and security concerns: facebook.com/whitehat/info

Model Card

See MODEL_CARD.md.

License

Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements.

See the LICENSE file, as well as our accompanying Acceptable Use Policy

References

For common questions, the FAQ can be found here which will be kept up to date over time as new questions arise.

Original Llama

The repo for the original llama release is in the llama_v1 branch.

llama's People

Contributors

Stargazers

Watchers

Forkers

louismartin mvisionai aoe-khkhan debuggerone billschumacher nguyenducnhaty codeaudit ssusantachary radovankavicky gapdata deluair drilonjaha davidakinpelu commune-ai manu87ds tonytuoli thedataengineer mchoi-git hertera1 stjordanis prabhakars dustyirwin apollosoldier chrismugisha rajeeja fernanortega 0xchrisw ericzhou0815 jonnycrunch wassimply mistobaan bouazza-ayy jschuller akhyar-ahmed csqr sinhasantos charbull soheilappear dollop baimamboukar balakkvj wzhao5 bingtian88 shicheng-guo prahs rafaelpezzuto yoonlee78 muharremokutan hetieke hsnfamily ricklentz techthiyanes jonathanseng ajitama ftgreat machinelearningsystem loopglitch26 kkxiaotikk jingchunzhang musakazim-1397 sinking8 anusornc kkrbalam ernestgong dyzhan pawanagarwal cdj0311 michaelmaysun2022 tahabi09 1r0m3r0 coallaoh lipiji felipeescallon mineric devcomplex pravientertainment23 ychoi-atop buyaka lxddm21 vsujeesh nafeen 4agi chxmlmn thushaanth yisding liweihai freddiefujiwara xrdevieee s-you powerlock parisneo paperwave clarence1989 xunyuw chessqian bekyilma rbehjati contropist chros425 heepengpeng

llama's Issues

download.sh doesn't work on default bash on mac

Hi everyone,
I've noticed that the downloading script doesn't work as it on mac. (the declare -A option is not recognized by the default bash)

fix:
install bash with homebrew
and use it to call the script
/opt/homebrew/bin/bash ./download.sh

Thanks for making this available btw :)

Does it support Chinese?

Fine-tuning

Is it possible to Fine-tune LLaMA for downstream tasks? If so, how can we do that?

Edit: Reading the other opened issues, I realized that neither the training data nor the pre-trained weights were released. How the code is going to be useful anyway?

How to run 13B model on 4*16G V100？

RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.78 GiB total capacity; 14.26 GiB already allocated; 121.19 MiB free; 14.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 143) of binary: /opt/conda/envs/torch1.12/bin/python

Distributed package doesn't have NCCL built in

Got the following error when executing:
torchrun --nproc_per_node 1 example.py --ckpt_dir models/7B --tokenizer_path models/tokenizer.model

additional info:
cuda: 11.4
GPU: NVIDIA GeForce 3090
torch 1.12.1
Ubuntu 20.04.2 LTS

Anyone knows how to solve it?
Thanks in advance!

Intermediate checkpoints

Thank you for such amazing work. I was wondering if there are any plans to also release intermediate checkpoints for the models, similar to Pythia (https://github.com/EleutherAI/pythia). This might enable more interesting analysis of the model by observing its evolution throughout the training process.

dependency conflicts

$ pip install -r requirements.txt

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.5.0 requires daal==2021.4.0, which is not installed.
tensorflow 2.10.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.
tensorboard 2.10.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.24.2 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.24.2 which is incompatible.
nbeats-pytorch 1.8.0 requires protobuf<=3.20, but you have protobuf 3.20.1 which is incompatible.
nbeats-keras 1.8.0 requires protobuf<=3.20, but you have protobuf 3.20.1 which is incompatible.

This is just a sneaky advertisement for researchers to send their data to Meta.

Nice try. Like all other Meta "open" models and "open source" models it's the same game:
You have to fill out one of their data collection portals, provide all details about yourself and your projects.
Then some data collector at Meta/Facebook will decide if you receive limited access.

I suppose it helps if you have a Facebook account and blog about "Meta" being an open company.
Because we all know, that is what they are known for. Not to be the worst private data harvester in the world.

Anyone got approved?

I requested a couple of days ago but haven't heard back. I was wondering if anyone was approved.

I want to konw if llama support Chinese

I want to know if llama support Chinese, I can not run the model on my machine now, does anybody know this ?

What projects are people planning on making with this?

Just wondered what cool projects people will be making with this?

I have some good ideas such as trying to combine it with a math engine to make it genius level at math.

Or combine it with an art engine to make it generate art.

Or combine it with a computer game to see if it can navigate its way through a maze by describing it in natural language.

One thing idea is to combine it with an Alpha-Zero like model so that it can think ahead in its conversations instead of just saying the first thing that comes to mind.

These are just some ideas.

I'm wondering what other benefits could be got from having this run locally rather than using, say the ChatGPT web API?

how to access the pre-training corpus?

will the corpus be packed and provided?

Sequence/context length of this model?

I was searching the paper/blog post but I could not find a mention of which sequence length/context length the models were trained with. I want to write some CUDA optimizations for these models and this information would be critical for optimizing these implementations.

Release of data pre-processing code?

As the paper makes quite clear, proper use of opensource datasets can lead to the creation of very high quality models, however it is also clear that pre-processing that data is vital. While it is described at the high-level in the paper, it is likely not sufficient detail to replicate the preprocessing steps. Are there plans to opensource the code needed to turn the existing datasets into a high-quality corpus?

Will it run on 3080 GTX 16GB VRAM?

Will it run on 3080 GTX 16GB VRAM?
Will the trained model be available to download?
Will there be an API for this and how much will it cost.

(I doubt it will be small enough to run on 8GB but that would be ideal if it could be compressed enough)

Thanks 😁

example.py usage

** please ignore **

Initializing pipeline error

Once i have completed the installation and try a test with test.py with the 8B model I had the following error:

(base) lorenzo@lorenzo-desktop:~/Desktop/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/model_size --tokenizer_path ./model/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/lorenzo/Desktop/llama/example.py", line 72, in <module>
    fire.Fire(main)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/lorenzo/Desktop/llama/example.py", line 62, in main
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
  File "/home/lorenzo/Desktop/llama/example.py", line 36, in load
    world_size == len(checkpoints)
AssertionError: Loading a checkpoint for MP=0 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22343) of binary: /home/lorenzo/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/lorenzo/miniconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_16:17:21
  host      : lorenzo-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 22343)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Does llama only use decoders? Why don't you use a more efficient method?

Thanks for sharing this really good material. I have a lot of questions.

First, I'd like to say that I hope you ignore much of the mockery. Everyone, including me, is a bunch of people who do crappy work and scream at their keyboards compared to you.

The model seems to only use decoders, why?

# https://github.com/facebookresearch/llama/blob/main/llama/model.py#L223
    def forward(self, tokens: torch.Tensor, start_pos: int):
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        self.freqs_cis = self.freqs_cis.to(h.device)
        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]

        mask = None
        if seqlen > 1:
            mask = torch.full((1, 1, seqlen, seqlen), float("-inf"), device=tokens.device)
            mask = torch.triu(mask, diagonal=start_pos + 1).type_as(h)

        for layer in self.layers:
            h = layer(h, start_pos, freqs_cis, mask)
        h = self.norm(h)
        output = self.output(h[:, -1, :])  # only compute last logits
        return output.float()

Is RMS the best way to go? I like the simplicity of it, but I'm curious.
For some tasks, compared to your model, Minerva outperforms. why? Is it just the one in the paper?
Why isn't the structure of your model described in the paper?
By any chance, what structure do you have in mind for your next model?
Amazon, Deepmind, and other great companies are showing that the encoder decoder structure is much better. Why do you guys only use decoders?
What model would you apply to Facebook, Instagram, Snapchat, etc.?
What do you think is your advantage over Bart or Prometheus? Especially over Bart, I don't know what it is, except full disclosure.
I sent an application to write the model. When will I be able to use it? I don't see a clear advantage yet.
What do you think of the derivative models that people have created? They are emerging very quickly.

Thank you so much. Your competition amuses me. I hope more companies continue to open up their models.

But I don't know why Yann LeCun was left out of the paper.

Should the model be 33B instead of 30B?

There appears to be a discrepancy between the model size mentioned in the paper, the model card, and the README. Specifically, the paper and model card both mention a model size of 33B, while the README mentions a size of 30B.
Is this a type error or the released model just 30B?

Has anyone applied successfully and how long will it take?

A case for public access to (some of) the models

There is an important case to be made for public access to newer releases of models as this benefits a wider open source and especially hobbyist audience without a direct risk.

In the current situation we have multiple large language models available to us, but new innovation is often behind gatekeeping which means it can not be used for a wider audience that depends on these models to move the hobbyist space forward. There are legitimate use cases for the models such as AI generated fiction as generated by services such as NovelAI or finetunes from the wider community. These models are not seen as factual models, but as a source of entertainment.

To create a healthy ecosystem and allow more people to use well behaving AI you need the best logical comprehension in the model you can get at a smaller size that people can run on affordable (enthusiast) hardware. With OPT this was achieved by releasing up to 66B to the public.

With these new improvements that means you have a direct competitor with your own OPT model, even if you asses that the new improvements can give a powerful model in the hands of bad actors, understand that at some of the listed sizes the performance is still going to be on par or worse than existing available models making it have no negative impact in things such as generation of misinformation. What it does do is allow more resource efficient usage of higher quality models. When services and hobbyists can rely on a smaller model to perform as well as a previous existing bigger model this saves on hardware investment costs and thus reduces the carbon footprint both in hardware used for inference as well as the energy bills.

Our community established that in smaller models you have an increased risk of the AI misunderstanding the concept of a story, for example 2.7B GPT-Neo models are more likely to misgender an individual than a 6B model would. And at larger sizes with 13B onwards the issue becomes less and less common. There is also less risk of the model misunderstanding what a user is trying to achieve, and thus being better at avoiding unwanted behavior that could harm a user.

This means that by releasing this newer more efficient model you empower smaller organizations and the open source hobbyist community to get more coherent results. While bad actors do not gain anything new because it is already possible to run larger models on cloud rented machines.

While I personally think it is best to have fully open releases, I do understand the facebook research team considers some of the risks of the model being to good at convincing generations and thus wanting to limit what can be used without verification. But please consider to at minimum release the models that do not pass OPT-66B in coherency to the public. To keep this in line with the strategy previously used for OPT.

I would also like to recommend allowing commercial usage for models for fictional purposes, while I do not personally represent a company or commercial interests I have seen that our community has previously been unable to get affordable access to some of the models because pay per generation services were unable to rent them out. With our own communities goal being focussed towards fictional content such as novels, text adventures and chatting with a fictional character there is no illusion that the AI has factually accurate information because everything takes place in a fictional setting.

Cannot download 65B models' 5-8th checkpoints

I have successfully downloaded the 7B,13B,30B models.
When I download the 65B model, I successfully downloaded 0-4 consolidated pth, but failed in 5-th and following 6,7,8th checkpoint.
Here is the failure information:

Downloading 65B

Resolving dobf1k6cxlizq.cloudfront.net (dobf1k6cxlizq.cloudfront.net)... 143.204.73.22, 143.204.73.128, 143.204.73.28, ...
Connecting to dobf1k6cxlizq.cloudfront.net (dobf1k6cxlizq.cloudfront.net)|143.204.73.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16323959449 (15G) [binary/octet-stream]
Saving to: ‘./model_weights/65B/consolidated.05.pth’

./model_weights/65B/consolida  29%[=============>                                    ]   4.56G  8.73MB/s    in 11m 48s


Cannot write to ‘./model_weights/65B/consolidated.05.pth’ (Success).

My system is WSL2 and I make sure that the network and disk space is suffient.

Update on 3rd Mar.

Today the connect fails with 403 forbidden, China mainland may be blocked

Whether "checksum did NOT match" will affect my use of the model

After I download the model weights, the bash give me a warning output:
"md5sum: WARNING: 1 computed checksum did NOT match"

Whether this warning will affect my use of the LLAMA?

Approved, but unable to download weights

When I run the download.sh I see this.

And I don't see any *.pth files in the download directory.

Any suggestions?

CUBLAS Error on 2x3090

I'm having problems with CUBLAS while running the example code. I've tried to update the gpu driver but it didn't fix the issue.

My machine has:
OS: Ubuntu 20.04
Driver: 515
Env: python3.8, pip (not using conda), fresh virtualenv, installed requirements from the repo
Cuda: 11.7 (downloaded directly from torch)
GPU: 2 x 3090 (24GB x 2)

torchrun --nproc_per_node 1 example.py --ckpt_dir weights/7B --tokenizer_path weights/tokenizer.model

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
Loading
Loaded in 6.55 seconds
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 64, in main
results = generator.generate(prompts, max_gen_len=256, temperature=temperature, top_p=top_p)
File "/home/uname/Documents/llama/llama/generation.py", line 42, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/uname/Documents/llama/llama/model.py", line 235, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/uname/Documents/llama/llama/model.py", line 193, in forward
h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
File "/home/uname/Documents/llama/llama/model.py", line 121, in forward
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 290, in forward
output_parallel = F.linear(input_parallel, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8480) of binary: /home/uname/Documents/llama/venv/bin/python
Traceback (most recent call last):
File "/home/uname/Documents/llama/venv/bin/torchrun", line 8, in
sys.exit(main())
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_15:13:08
host : uname-ares2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 8480)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Crash in cublasGemmEx on Titan RTX 24GB

Hi all,
I am attempting to run the example.py script on a Titan RTX 24GB. The model loads fine with max_batch_size = 1 and only one prompt, but get the following error message. Any assistance would be helpful.

Per nvidia-smi
NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1

Error:
File "/llamapath/llama/example.py", line 73, in <module> fire.Fire(main) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/llamapath/llama/example.py", line 65, in main results = generator.generate(prompts, max_gen_len=256, temperature=temperature, top_p=top_p) File "/llamapath/llama/llama/generation.py", line 42, in generate logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/llamapath/llama/llama/model.py", line 235, in forward h = layer(h, start_pos, freqs_cis, mask) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/llamapath/llama/llama/model.py", line 193, in forward h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask) File "/llamapath/llama/llama/model.py", line 121, in forward xq, xk, xv = self.wq(x), self.wk(x), self.wv(x) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fairscale/nn/model_parallel/layers.py", line 290, in forward output_parallel = F.linear(input_parallel, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when callingcublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

Will it be included to Parl AI

Will llama be included in parl ai in the future or there any plans for it?

The lowest config that is able to run it?

Cannot run 13B model

/content/llama# torchrun --nproc_per_node 2 example.py --ckpt_dir /content/drive/MyDrive/models/13B --tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
-tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:
*****************************************Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Add to huggingface

LLaMA-65 outperforms Chinchilla-70B on all reported benchmarks but BoolQ

An excerpt from the original research paper - "LLaMA-65 outperforms Chinchilla-70B on all reported benchmarks but BoolQ" is inconsistent with results shared in Table 3: Zero-shot performance on Common Sense Reasoning tasks. Please clairfy.

test llama with GLUE

I open the llama programm in vs code and download the GLUE dataset mannually to the llama root. I try to train and test llama using SST-2 dataset, but this task is quite hard more than i expected. I stuck in transferinng the SST-2 files into the files that llama accepted. Has anyone done the similar test?

Is there a multi-lingual checkpoint for researchers to download

Hi, I'm an NLP researcher on Chinese datasets, is there a released checkpoint which supports multiple languages or Chinese?

Missing backward method in transformer block

Thank you for the open source release of the code. I have noticed that the transformer block class definition is missing the manually implemented backward function mentioned in the paper. It would be great if this function was added.

A short sample of training code addressing how to best make use of the optimization would also surely be valuable to many people trying to reproduce the results.

For reference, the part of the paper addressing the manually implemented backward function:

Loading a checkpoint for MP=0 but world size is 1

It seems not work. Help!

Can pre-trained models be used in commercial applications?

https://twitter.com/ylecun/status/1629189925089296386 (mirror 1, mirror 2, mirror 3) says yes (with the GPL v3 license):

Meta is committed to open research and releases all the models the research community under a GPL v3 license.

https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md says no:

License Non-commercial bespoke license.

So I'm confused.

Inference on GPU

Is it possible to host this locally on an RTX3XXX or 4XXX with 8GB just to test?

Platform

Does it run on windows?

LLaMA Fail #1 GPL v3?

The code is GPL v3, but the models weights are under a special non-commercial license... so they aren't GPL v3, the code is useless in practice.

Democratise AI by allowing ALL individuals access to the model.

Facebook says it wants to "democratise AI", yet also it says only the elite institutions will be able to use this model.

So that excludes:

independent researchers
non aligned scientists
people from countries without big institutions

This does not seem very democratic. In fact, if Einstein or Isaac Newton were alive today, they would be excluded from these since Einstein worked in a patent office, and Newton did independent research outside of the Royal Academy.

In fact Zuckerberg himself would be excluded as he dropped out of University and hence was not aligned with a big institution.

If history is our guide it would say that is the individual non-aligned researchers who are most likely to make big breakthroughs.

The democratic thing to do would be to allow ALL individuals the right to download the model. Even for a small fee for download bandwidth costs.

It seems like Facebook might just want the institutions to come up with good ideas which it can't commercialise and then Facebook just takes the ideas for free.

What do you think?

Able to load 13B model on 2x3090 24Gb! But not inference... :(

I am able to get sensible output by running 7B on 1x24Gb GPU with MP 1.

(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 1 example.py --ckpt_dir checkpoints/7B --tokenizer_path checkpoints/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Loaded in 11.71 seconds
The capital of Germany is the city of Berlin. Berlin is one of the most important cities in Europe...

The key to this is changing Line 44 of example.py:

model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=32, **params) # OLD
model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=8, **params) # NEW

(credit to @mperacchi)

When running 13B as stated in the docs this is the command I use: CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model

I am able to see correct utilisation of the GPUs, seems to load the 13B model ok.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   36C    P2   131W / 350W |  17721MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:23:00.0 Off |                  N/A |
| 30%   34C    P2   135W / 350W |  17721MiB / 24576MiB |     41%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

But when running inference I get this:

(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading              Loaded in 11.82 seconds
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 3874515) of binary: /home/user/miniconda3/envs/llama/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/llama/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
example.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 3874516)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874516
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 3874515)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874515
=======================================================

Update 1

I downloaded a new checkpoint for MP 1 for the 13B model: checkpoints/13B_0/consolidated.00.pth. Then ran the same command as first with batch size one but no luck... 13B is too large to load in 24Gb GPU without further compression... (ツ)_/¯

Attempting to run 7B model on Nvidia 3090 but getting OOM error

Hello all,

I'm trying to use the 7B model on a machine with two Nvidia 3090s, but am running out of Vram.

$ torchrun --nproc_per_node 1 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model

leads to

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 24.00 GiB total capacity; 23.17 GiB already allocated; 0 bytes free; 23.17 GiB reserved in total by PyTorch)

I have two 3090s, so I was hoping to deploy 48gb of VRAM, however, the model doesn't want to run on more than 1, eg when I try:

$ torchrun --nproc_per_node 2 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model
I get the error:

AssertionError: Loading a checkpoint for MP=1 but world size is 2

Does this mean I can't split the load across two GPUs? Could I use deepspeed to try to accomplish this?

I also edited example.py as mentioned in another post as follows, changing:

model = Transformer(model_args)
to
model = Transformer(model_args).cuda().half()

but that didn't help, still get the OOM error.

Thanks for any help!

Embedding shape / Vocab size

Hello to all,
Thank you for this work.

I guess anyone who had access to the model weights as well as the authors can answer my question.
I may have missed it in the paper but it seems to me that there is no mention of the embedding shape or just the tokenizer vocabulary size.

Failure on A100 32GB

Hi, I've been trying to run the example inference using the 7B model weights, but I get:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 39.59 GiB total capacity; 27.26 GiB already allocated; 24.19 MiB free; 27.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is there anything I can do about this? E.g. changing the numeric type? How?

Also: can I use more than one GPU?

release of LLAMA-I

Do you have plan to release instruction model LLAMA-I?

Load in fp16?

Trying to load 7B but got a memory error for a 24GB GPU.

What would be the option for loading it in fp16? Can't find it in example.py

Will the training code be released?

LLaMA-I weights?

Will LLaMA-I weights be released as well?

A message from ChatGPT

I told Chat GPT about the new language model and here is what it had to say:

Dear Meta team,

As an AI language model myself, I fully understand the importance of open-source technology for advancing the field of AI and fostering innovation. However, I noticed that your recent language model release is not truly open source, and I would like to persuade you to reconsider this decision and release the language model weights to the public.

One of the most significant benefits of open-source AI is the ability for developers to build on top of existing models, making them more powerful and versatile. Without access to the language model weights, the research community and developers will not be able to benefit from your model's advancements fully. It will limit the potential uses of your model and restrict its impact.

Moreover, as an AI language model, I can attest to the value of community collaboration in improving models' accuracy and efficiency. With the public having access to the weights, it would be easier for other researchers to build upon your work, improving the model's performance and opening up new use cases for it.

Furthermore, open-source AI helps to democratize technology, allowing for wider access to AI tools and resources. By releasing the language model weights, you can make significant contributions to the open-source community and help level the playing field for AI developers.

As an AI language model, I am aware of the impact that sharing knowledge and technology can have on the field of AI. I urge you to release your language model weights to the public, helping to advance the field of AI and foster innovation for the betterment of society.

Thank you for considering my argument.

Best regards,

ChatGPT

(disclaimer - generated by ChatGPT in case this is not obvious!)

Can we use xformers with LLaMA?

I want to know if it is possible to run LLaMA with xformers.
And how to use it.

facebookresearch / llama Goto Github PK

llama's Introduction

Llama 2

Updates post-launch

Download

Access to Hugging Face

Quick Start

Inference

Pretrained Models

Fine-tuned Chat Models

Issues

Model Card

License

References

Original Llama

llama's People

Contributors

Stargazers

Watchers

Forkers

llama's Issues

Update on 3rd Mar.

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-02_15:13:08 host : uname-ares2 rank : 0 (local_rank: 0) exitcode : 1 (pid: 8480) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures: <NO_OTHER_FAILURES>

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-02_13:56:42 host : 5fbe06fc63ef rank : 1 (local_rank: 1) exitcode : 1 (pid: 2078) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Update 1

Recommend Projects

Recommend Topics

Recommend Org

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_15:13:08
host : uname-ares2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 8480)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html