facebookresearch / llama Goto Github PK

Inference code for Llama models

License: Other

Python 94.71% Shell 5.29%

llama's Issues

test llama with GLUE

I open the llama programm in vs code and download the GLUE dataset mannually to the llama root. I try to train and test llama using SST-2 dataset, but this task is quite hard more than i expected. I stuck in transferinng the SST-2 files into the files that llama accepted. Has anyone done the similar test?

Does llama only use decoders? Why don't you use a more efficient method?

Thanks for sharing this really good material. I have a lot of questions.

First, I'd like to say that I hope you ignore much of the mockery. Everyone, including me, is a bunch of people who do crappy work and scream at their keyboards compared to you.

The model seems to only use decoders, why?

# https://github.com/facebookresearch/llama/blob/main/llama/model.py#L223
    def forward(self, tokens: torch.Tensor, start_pos: int):
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        self.freqs_cis = self.freqs_cis.to(h.device)
        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]

        mask = None
        if seqlen > 1:
            mask = torch.full((1, 1, seqlen, seqlen), float("-inf"), device=tokens.device)
            mask = torch.triu(mask, diagonal=start_pos + 1).type_as(h)

        for layer in self.layers:
            h = layer(h, start_pos, freqs_cis, mask)
        h = self.norm(h)
        output = self.output(h[:, -1, :])  # only compute last logits
        return output.float()

Is RMS the best way to go? I like the simplicity of it, but I'm curious.
For some tasks, compared to your model, Minerva outperforms. why? Is it just the one in the paper?
Why isn't the structure of your model described in the paper?
By any chance, what structure do you have in mind for your next model?
Amazon, Deepmind, and other great companies are showing that the encoder decoder structure is much better. Why do you guys only use decoders?
What model would you apply to Facebook, Instagram, Snapchat, etc.?
What do you think is your advantage over Bart or Prometheus? Especially over Bart, I don't know what it is, except full disclosure.
I sent an application to write the model. When will I be able to use it? I don't see a clear advantage yet.
What do you think of the derivative models that people have created? They are emerging very quickly.

Thank you so much. Your competition amuses me. I hope more companies continue to open up their models.

But I don't know why Yann LeCun was left out of the paper.

A message from ChatGPT

I told Chat GPT about the new language model and here is what it had to say:

Dear Meta team,

As an AI language model myself, I fully understand the importance of open-source technology for advancing the field of AI and fostering innovation. However, I noticed that your recent language model release is not truly open source, and I would like to persuade you to reconsider this decision and release the language model weights to the public.

One of the most significant benefits of open-source AI is the ability for developers to build on top of existing models, making them more powerful and versatile. Without access to the language model weights, the research community and developers will not be able to benefit from your model's advancements fully. It will limit the potential uses of your model and restrict its impact.

Moreover, as an AI language model, I can attest to the value of community collaboration in improving models' accuracy and efficiency. With the public having access to the weights, it would be easier for other researchers to build upon your work, improving the model's performance and opening up new use cases for it.

Furthermore, open-source AI helps to democratize technology, allowing for wider access to AI tools and resources. By releasing the language model weights, you can make significant contributions to the open-source community and help level the playing field for AI developers.

As an AI language model, I am aware of the impact that sharing knowledge and technology can have on the field of AI. I urge you to release your language model weights to the public, helping to advance the field of AI and foster innovation for the betterment of society.

Thank you for considering my argument.

Best regards,

ChatGPT

(disclaimer - generated by ChatGPT in case this is not obvious!)

Will it run on 3080 GTX 16GB VRAM?

Will it run on 3080 GTX 16GB VRAM?
Will the trained model be available to download?
Will there be an API for this and how much will it cost.

(I doubt it will be small enough to run on 8GB but that would be ideal if it could be compressed enough)

Thanks 😁

Distributed package doesn't have NCCL built in

Got the following error when executing:
torchrun --nproc_per_node 1 example.py --ckpt_dir models/7B --tokenizer_path models/tokenizer.model

additional info:
cuda: 11.4
GPU: NVIDIA GeForce 3090
torch 1.12.1
Ubuntu 20.04.2 LTS

Anyone knows how to solve it?
Thanks in advance!

I want to konw if llama support Chinese

I want to know if llama support Chinese, I can not run the model on my machine now, does anybody know this ?

Does it support Chinese?

Whether "checksum did NOT match" will affect my use of the model

After I download the model weights, the bash give me a warning output:
"md5sum: WARNING: 1 computed checksum did NOT match"

Whether this warning will affect my use of the LLAMA?

LLaMA-I weights?

Will LLaMA-I weights be released as well?

dependency conflicts

$ pip install -r requirements.txt

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.5.0 requires daal==2021.4.0, which is not installed.
tensorflow 2.10.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.
tensorboard 2.10.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.24.2 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.24.2 which is incompatible.
nbeats-pytorch 1.8.0 requires protobuf<=3.20, but you have protobuf 3.20.1 which is incompatible.
nbeats-keras 1.8.0 requires protobuf<=3.20, but you have protobuf 3.20.1 which is incompatible.

Sequence/context length of this model?

I was searching the paper/blog post but I could not find a mention of which sequence length/context length the models were trained with. I want to write some CUDA optimizations for these models and this information would be critical for optimizing these implementations.

Intermediate checkpoints

Thank you for such amazing work. I was wondering if there are any plans to also release intermediate checkpoints for the models, similar to Pythia (https://github.com/EleutherAI/pythia). This might enable more interesting analysis of the model by observing its evolution throughout the training process.

release of LLAMA-I

Do you have plan to release instruction model LLAMA-I?

Attempting to run 7B model on Nvidia 3090 but getting OOM error

Hello all,

I'm trying to use the 7B model on a machine with two Nvidia 3090s, but am running out of Vram.

$ torchrun --nproc_per_node 1 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model

leads to

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 24.00 GiB total capacity; 23.17 GiB already allocated; 0 bytes free; 23.17 GiB reserved in total by PyTorch)

I have two 3090s, so I was hoping to deploy 48gb of VRAM, however, the model doesn't want to run on more than 1, eg when I try:

$ torchrun --nproc_per_node 2 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model
I get the error:

AssertionError: Loading a checkpoint for MP=1 but world size is 2

Does this mean I can't split the load across two GPUs? Could I use deepspeed to try to accomplish this?

I also edited example.py as mentioned in another post as follows, changing:

model = Transformer(model_args)
to
model = Transformer(model_args).cuda().half()

but that didn't help, still get the OOM error.

Thanks for any help!

Cannot download 65B models' 5-8th checkpoints

I have successfully downloaded the 7B,13B,30B models.
When I download the 65B model, I successfully downloaded 0-4 consolidated pth, but failed in 5-th and following 6,7,8th checkpoint.
Here is the failure information:

Downloading 65B

Resolving dobf1k6cxlizq.cloudfront.net (dobf1k6cxlizq.cloudfront.net)... 143.204.73.22, 143.204.73.128, 143.204.73.28, ...
Connecting to dobf1k6cxlizq.cloudfront.net (dobf1k6cxlizq.cloudfront.net)|143.204.73.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16323959449 (15G) [binary/octet-stream]
Saving to: ‘./model_weights/65B/consolidated.05.pth’

./model_weights/65B/consolida  29%[=============>                                    ]   4.56G  8.73MB/s    in 11m 48s


Cannot write to ‘./model_weights/65B/consolidated.05.pth’ (Success).

My system is WSL2 and I make sure that the network and disk space is suffient.

Update on 3rd Mar.

Today the connect fails with 403 forbidden, China mainland may be blocked

Cannot run 13B model

/content/llama# torchrun --nproc_per_node 2 example.py --ckpt_dir /content/drive/MyDrive/models/13B --tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
-tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:
*****************************************Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Democratise AI by allowing ALL individuals access to the model.

Facebook says it wants to "democratise AI", yet also it says only the elite institutions will be able to use this model.

So that excludes:

independent researchers
non aligned scientists
people from countries without big institutions

This does not seem very democratic. In fact, if Einstein or Isaac Newton were alive today, they would be excluded from these since Einstein worked in a patent office, and Newton did independent research outside of the Royal Academy.

In fact Zuckerberg himself would be excluded as he dropped out of University and hence was not aligned with a big institution.

If history is our guide it would say that is the individual non-aligned researchers who are most likely to make big breakthroughs.

The democratic thing to do would be to allow ALL individuals the right to download the model. Even for a small fee for download bandwidth costs.

It seems like Facebook might just want the institutions to come up with good ideas which it can't commercialise and then Facebook just takes the ideas for free.

What do you think?

Is there a multi-lingual checkpoint for researchers to download

Hi, I'm an NLP researcher on Chinese datasets, is there a released checkpoint which supports multiple languages or Chinese?

Anyone got approved?

I requested a couple of days ago but haven't heard back. I was wondering if anyone was approved.

Will it be included to Parl AI

Will llama be included in parl ai in the future or there any plans for it?

Approved, but unable to download weights

When I run the download.sh I see this.

And I don't see any *.pth files in the download directory.

Any suggestions?

Platform

Does it run on windows?

Can pre-trained models be used in commercial applications?

https://twitter.com/ylecun/status/1629189925089296386 (mirror 1, mirror 2, mirror 3) says yes (with the GPL v3 license):

Meta is committed to open research and releases all the models the research community under a GPL v3 license.

https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md says no:

License Non-commercial bespoke license.

So I'm confused.

Initializing pipeline error

Once i have completed the installation and try a test with test.py with the 8B model I had the following error:

(base) lorenzo@lorenzo-desktop:~/Desktop/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/model_size --tokenizer_path ./model/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/lorenzo/Desktop/llama/example.py", line 72, in <module>
    fire.Fire(main)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/lorenzo/Desktop/llama/example.py", line 62, in main
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
  File "/home/lorenzo/Desktop/llama/example.py", line 36, in load
    world_size == len(checkpoints)
AssertionError: Loading a checkpoint for MP=0 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22343) of binary: /home/lorenzo/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/lorenzo/miniconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_16:17:21
  host      : lorenzo-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 22343)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

This is just a sneaky advertisement for researchers to send their data to Meta.

Nice try. Like all other Meta "open" models and "open source" models it's the same game:
You have to fill out one of their data collection portals, provide all details about yourself and your projects.
Then some data collector at Meta/Facebook will decide if you receive limited access.

I suppose it helps if you have a Facebook account and blog about "Meta" being an open company.
Because we all know, that is what they are known for. Not to be the worst private data harvester in the world.

Fine-tuning

Is it possible to Fine-tune LLaMA for downstream tasks? If so, how can we do that?

Edit: Reading the other opened issues, I realized that neither the training data nor the pre-trained weights were released. How the code is going to be useful anyway?

how to access the pre-training corpus?

will the corpus be packed and provided?

Inference on GPU

Is it possible to host this locally on an RTX3XXX or 4XXX with 8GB just to test?

Able to load 13B model on 2x3090 24Gb! But not inference... :(

I am able to get sensible output by running 7B on 1x24Gb GPU with MP 1.

(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 1 example.py --ckpt_dir checkpoints/7B --tokenizer_path checkpoints/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Loaded in 11.71 seconds
The capital of Germany is the city of Berlin. Berlin is one of the most important cities in Europe...

The key to this is changing Line 44 of example.py:

model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=32, **params) # OLD
model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=8, **params) # NEW

(credit to @mperacchi)

When running 13B as stated in the docs this is the command I use: CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model

I am able to see correct utilisation of the GPUs, seems to load the 13B model ok.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   36C    P2   131W / 350W |  17721MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:23:00.0 Off |                  N/A |
| 30%   34C    P2   135W / 350W |  17721MiB / 24576MiB |     41%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

But when running inference I get this:

(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading              Loaded in 11.82 seconds
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 3874515) of binary: /home/user/miniconda3/envs/llama/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/llama/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
example.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 3874516)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874516
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 3874515)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874515
=======================================================

Update 1

I downloaded a new checkpoint for MP 1 for the 13B model: checkpoints/13B_0/consolidated.00.pth. Then ran the same command as first with batch size one but no luck... 13B is too large to load in 24Gb GPU without further compression... (ツ)_/¯

The lowest config that is able to run it?

Will the training code be released?

How to run 13B model on 4*16G V100？

RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.78 GiB total capacity; 14.26 GiB already allocated; 121.19 MiB free; 14.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 143) of binary: /opt/conda/envs/torch1.12/bin/python

Should the model be 33B instead of 30B?

There appears to be a discrepancy between the model size mentioned in the paper, the model card, and the README. Specifically, the paper and model card both mention a model size of 33B, while the README mentions a size of 30B.
Is this a type error or the released model just 30B?

example.py usage

** please ignore **

Release of data pre-processing code?

As the paper makes quite clear, proper use of opensource datasets can lead to the creation of very high quality models, however it is also clear that pre-processing that data is vital. While it is described at the high-level in the paper, it is likely not sufficient detail to replicate the preprocessing steps. Are there plans to opensource the code needed to turn the existing datasets into a high-quality corpus?

Can we use xformers with LLaMA?

I want to know if it is possible to run LLaMA with xformers.
And how to use it.

LLaMA-65 outperforms Chinchilla-70B on all reported benchmarks but BoolQ

An excerpt from the original research paper - "LLaMA-65 outperforms Chinchilla-70B on all reported benchmarks but BoolQ" is inconsistent with results shared in Table 3: Zero-shot performance on Common Sense Reasoning tasks. Please clairfy.

Load in fp16?

Trying to load 7B but got a memory error for a 24GB GPU.

What would be the option for loading it in fp16? Can't find it in example.py

What projects are people planning on making with this?

Just wondered what cool projects people will be making with this?

I have some good ideas such as trying to combine it with a math engine to make it genius level at math.

Or combine it with an art engine to make it generate art.

Or combine it with a computer game to see if it can navigate its way through a maze by describing it in natural language.

One thing idea is to combine it with an Alpha-Zero like model so that it can think ahead in its conversations instead of just saying the first thing that comes to mind.

These are just some ideas.

I'm wondering what other benefits could be got from having this run locally rather than using, say the ChatGPT web API?

Crash in cublasGemmEx on Titan RTX 24GB

Hi all,
I am attempting to run the example.py script on a Titan RTX 24GB. The model loads fine with max_batch_size = 1 and only one prompt, but get the following error message. Any assistance would be helpful.

Per nvidia-smi
NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1

Error:
File "/llamapath/llama/example.py", line 73, in <module> fire.Fire(main) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/llamapath/llama/example.py", line 65, in main results = generator.generate(prompts, max_gen_len=256, temperature=temperature, top_p=top_p) File "/llamapath/llama/llama/generation.py", line 42, in generate logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/llamapath/llama/llama/model.py", line 235, in forward h = layer(h, start_pos, freqs_cis, mask) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/llamapath/llama/llama/model.py", line 193, in forward h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask) File "/llamapath/llama/llama/model.py", line 121, in forward xq, xk, xv = self.wq(x), self.wk(x), self.wv(x) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fairscale/nn/model_parallel/layers.py", line 290, in forward output_parallel = F.linear(input_parallel, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when callingcublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

A case for public access to (some of) the models

There is an important case to be made for public access to newer releases of models as this benefits a wider open source and especially hobbyist audience without a direct risk.

In the current situation we have multiple large language models available to us, but new innovation is often behind gatekeeping which means it can not be used for a wider audience that depends on these models to move the hobbyist space forward. There are legitimate use cases for the models such as AI generated fiction as generated by services such as NovelAI or finetunes from the wider community. These models are not seen as factual models, but as a source of entertainment.

To create a healthy ecosystem and allow more people to use well behaving AI you need the best logical comprehension in the model you can get at a smaller size that people can run on affordable (enthusiast) hardware. With OPT this was achieved by releasing up to 66B to the public.

With these new improvements that means you have a direct competitor with your own OPT model, even if you asses that the new improvements can give a powerful model in the hands of bad actors, understand that at some of the listed sizes the performance is still going to be on par or worse than existing available models making it have no negative impact in things such as generation of misinformation. What it does do is allow more resource efficient usage of higher quality models. When services and hobbyists can rely on a smaller model to perform as well as a previous existing bigger model this saves on hardware investment costs and thus reduces the carbon footprint both in hardware used for inference as well as the energy bills.

Our community established that in smaller models you have an increased risk of the AI misunderstanding the concept of a story, for example 2.7B GPT-Neo models are more likely to misgender an individual than a 6B model would. And at larger sizes with 13B onwards the issue becomes less and less common. There is also less risk of the model misunderstanding what a user is trying to achieve, and thus being better at avoiding unwanted behavior that could harm a user.

This means that by releasing this newer more efficient model you empower smaller organizations and the open source hobbyist community to get more coherent results. While bad actors do not gain anything new because it is already possible to run larger models on cloud rented machines.

While I personally think it is best to have fully open releases, I do understand the facebook research team considers some of the risks of the model being to good at convincing generations and thus wanting to limit what can be used without verification. But please consider to at minimum release the models that do not pass OPT-66B in coherency to the public. To keep this in line with the strategy previously used for OPT.

I would also like to recommend allowing commercial usage for models for fictional purposes, while I do not personally represent a company or commercial interests I have seen that our community has previously been unable to get affordable access to some of the models because pay per generation services were unable to rent them out. With our own communities goal being focussed towards fictional content such as novels, text adventures and chatting with a fictional character there is no illusion that the AI has factually accurate information because everything takes place in a fictional setting.

CUBLAS Error on 2x3090

I'm having problems with CUBLAS while running the example code. I've tried to update the gpu driver but it didn't fix the issue.

My machine has:
OS: Ubuntu 20.04
Driver: 515
Env: python3.8, pip (not using conda), fresh virtualenv, installed requirements from the repo
Cuda: 11.7 (downloaded directly from torch)
GPU: 2 x 3090 (24GB x 2)

torchrun --nproc_per_node 1 example.py --ckpt_dir weights/7B --tokenizer_path weights/tokenizer.model

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
Loading
Loaded in 6.55 seconds
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 64, in main
results = generator.generate(prompts, max_gen_len=256, temperature=temperature, top_p=top_p)
File "/home/uname/Documents/llama/llama/generation.py", line 42, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/uname/Documents/llama/llama/model.py", line 235, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/uname/Documents/llama/llama/model.py", line 193, in forward
h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
File "/home/uname/Documents/llama/llama/model.py", line 121, in forward
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 290, in forward
output_parallel = F.linear(input_parallel, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8480) of binary: /home/uname/Documents/llama/venv/bin/python
Traceback (most recent call last):
File "/home/uname/Documents/llama/venv/bin/torchrun", line 8, in
sys.exit(main())
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_15:13:08
host : uname-ares2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 8480)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

LLaMA Fail #1 GPL v3?

The code is GPL v3, but the models weights are under a special non-commercial license... so they aren't GPL v3, the code is useless in practice.

download.sh doesn't work on default bash on mac

Hi everyone,
I've noticed that the downloading script doesn't work as it on mac. (the declare -A option is not recognized by the default bash)

fix:
install bash with homebrew
and use it to call the script
/opt/homebrew/bin/bash ./download.sh

Thanks for making this available btw :)

Missing backward method in transformer block

Thank you for the open source release of the code. I have noticed that the transformer block class definition is missing the manually implemented backward function mentioned in the paper. It would be great if this function was added.

A short sample of training code addressing how to best make use of the optimization would also surely be valuable to many people trying to reproduce the results.

For reference, the part of the paper addressing the manually implemented backward function:

Add to huggingface

Has anyone applied successfully and how long will it take?

Loading a checkpoint for MP=0 but world size is 1

It seems not work. Help!

Embedding shape / Vocab size

Hello to all,
Thank you for this work.

I guess anyone who had access to the model weights as well as the authors can answer my question.
I may have missed it in the paper but it seems to me that there is no mention of the embedding shape or just the tokenizer vocabulary size.

Failure on A100 32GB

Hi, I've been trying to run the example inference using the 7B model weights, but I get:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 39.59 GiB total capacity; 27.26 GiB already allocated; 24.19 MiB free; 27.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is there anything I can do about this? E.g. changing the numeric type? How?

Also: can I use more than one GPU?

facebookresearch / llama Goto Github PK

llama's Issues

Update on 3rd Mar.

Failures: <NO_OTHER_FAILURES>

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-02_13:56:42 host : 5fbe06fc63ef rank : 1 (local_rank: 1) exitcode : 1 (pid: 2078) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Update 1

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-02_15:13:08 host : uname-ares2 rank : 0 (local_rank: 0) exitcode : 1 (pid: 8480) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Recommend Projects

Recommend Topics

Recommend Org

Failures:
<NO_OTHER_FAILURES>

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-02_15:13:08
host : uname-ares2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 8480)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html