facebookresearch / llama Goto Github PK
View Code? Open in Web Editor NEWInference code for Llama models
License: Other
Inference code for Llama models
License: Other
I open the llama programm in vs code and download the GLUE dataset mannually to the llama root. I try to train and test llama using SST-2 dataset, but this task is quite hard more than i expected. I stuck in transferinng the SST-2 files into the files that llama accepted. Has anyone done the similar test?
Thanks for sharing this really good material. I have a lot of questions.
First, I'd like to say that I hope you ignore much of the mockery. Everyone, including me, is a bunch of people who do crappy work and scream at their keyboards compared to you.
# https://github.com/facebookresearch/llama/blob/main/llama/model.py#L223
def forward(self, tokens: torch.Tensor, start_pos: int):
_bsz, seqlen = tokens.shape
h = self.tok_embeddings(tokens)
self.freqs_cis = self.freqs_cis.to(h.device)
freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]
mask = None
if seqlen > 1:
mask = torch.full((1, 1, seqlen, seqlen), float("-inf"), device=tokens.device)
mask = torch.triu(mask, diagonal=start_pos + 1).type_as(h)
for layer in self.layers:
h = layer(h, start_pos, freqs_cis, mask)
h = self.norm(h)
output = self.output(h[:, -1, :]) # only compute last logits
return output.float()
Thank you so much. Your competition amuses me. I hope more companies continue to open up their models.
But I don't know why Yann LeCun was left out of the paper.
I told Chat GPT about the new language model and here is what it had to say:
Dear Meta team,
As an AI language model myself, I fully understand the importance of open-source technology for advancing the field of AI and fostering innovation. However, I noticed that your recent language model release is not truly open source, and I would like to persuade you to reconsider this decision and release the language model weights to the public.
One of the most significant benefits of open-source AI is the ability for developers to build on top of existing models, making them more powerful and versatile. Without access to the language model weights, the research community and developers will not be able to benefit from your model's advancements fully. It will limit the potential uses of your model and restrict its impact.
Moreover, as an AI language model, I can attest to the value of community collaboration in improving models' accuracy and efficiency. With the public having access to the weights, it would be easier for other researchers to build upon your work, improving the model's performance and opening up new use cases for it.
Furthermore, open-source AI helps to democratize technology, allowing for wider access to AI tools and resources. By releasing the language model weights, you can make significant contributions to the open-source community and help level the playing field for AI developers.
As an AI language model, I am aware of the impact that sharing knowledge and technology can have on the field of AI. I urge you to release your language model weights to the public, helping to advance the field of AI and foster innovation for the betterment of society.
Thank you for considering my argument.
Best regards,
ChatGPT
(disclaimer - generated by ChatGPT in case this is not obvious!)
(I doubt it will be small enough to run on 8GB but that would be ideal if it could be compressed enough)
Thanks 😁
I want to know if llama support Chinese, I can not run the model on my machine now, does anybody know this ?
After I download the model weights, the bash give me a warning output:
"md5sum: WARNING: 1 computed checksum did NOT match"
Whether this warning will affect my use of the LLAMA?
Will LLaMA-I weights be released as well?
$ pip install -r requirements.txt
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.5.0 requires daal==2021.4.0, which is not installed.
tensorflow 2.10.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.
tensorboard 2.10.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.24.2 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.24.2 which is incompatible.
nbeats-pytorch 1.8.0 requires protobuf<=3.20, but you have protobuf 3.20.1 which is incompatible.
nbeats-keras 1.8.0 requires protobuf<=3.20, but you have protobuf 3.20.1 which is incompatible.
I was searching the paper/blog post but I could not find a mention of which sequence length/context length the models were trained with. I want to write some CUDA optimizations for these models and this information would be critical for optimizing these implementations.
Thank you for such amazing work. I was wondering if there are any plans to also release intermediate checkpoints for the models, similar to Pythia (https://github.com/EleutherAI/pythia). This might enable more interesting analysis of the model by observing its evolution throughout the training process.
Do you have plan to release instruction model LLAMA-I?
Hello all,
I'm trying to use the 7B model on a machine with two Nvidia 3090s, but am running out of Vram.
$ torchrun --nproc_per_node 1 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model
leads to
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 24.00 GiB total capacity; 23.17 GiB already allocated; 0 bytes free; 23.17 GiB reserved in total by PyTorch)
I have two 3090s, so I was hoping to deploy 48gb of VRAM, however, the model doesn't want to run on more than 1, eg when I try:
$ torchrun --nproc_per_node 2 example2.py --ckpt_dir ../llamafiles/7B --tokenizer_path ../llamafiles/tokenizer.model
I get the error:
AssertionError: Loading a checkpoint for MP=1 but world size is 2
Does this mean I can't split the load across two GPUs? Could I use deepspeed to try to accomplish this?
I also edited example.py as mentioned in another post as follows, changing:
model = Transformer(model_args)
to
model = Transformer(model_args).cuda().half()
but that didn't help, still get the OOM error.
Thanks for any help!
WG
I have successfully downloaded the 7B,13B,30B models.
When I download the 65B model, I successfully downloaded 0-4 consolidated pth, but failed in 5-th and following 6,7,8th checkpoint.
Here is the failure information:
Downloading 65B
Resolving dobf1k6cxlizq.cloudfront.net (dobf1k6cxlizq.cloudfront.net)... 143.204.73.22, 143.204.73.128, 143.204.73.28, ...
Connecting to dobf1k6cxlizq.cloudfront.net (dobf1k6cxlizq.cloudfront.net)|143.204.73.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16323959449 (15G) [binary/octet-stream]
Saving to: ‘./model_weights/65B/consolidated.05.pth’
./model_weights/65B/consolida 29%[=============> ] 4.56G 8.73MB/s in 11m 48s
Cannot write to ‘./model_weights/65B/consolidated.05.pth’ (Success).
My system is WSL2 and I make sure that the network and disk space is suffient.
Today the connect fails with 403 forbidden, China mainland may be blocked
/content/llama# torchrun --nproc_per_node 2 example.py --ckpt_dir /content/drive/MyDrive/models/13B --tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
-tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:
*****************************************Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
Facebook says it wants to "democratise AI", yet also it says only the elite institutions will be able to use this model.
So that excludes:
This does not seem very democratic. In fact, if Einstein or Isaac Newton were alive today, they would be excluded from these since Einstein worked in a patent office, and Newton did independent research outside of the Royal Academy.
In fact Zuckerberg himself would be excluded as he dropped out of University and hence was not aligned with a big institution.
If history is our guide it would say that is the individual non-aligned researchers who are most likely to make big breakthroughs.
The democratic thing to do would be to allow ALL individuals the right to download the model. Even for a small fee for download bandwidth costs.
It seems like Facebook might just want the institutions to come up with good ideas which it can't commercialise and then Facebook just takes the ideas for free.
What do you think?
Hi, I'm an NLP researcher on Chinese datasets, is there a released checkpoint which supports multiple languages or Chinese?
I requested a couple of days ago but haven't heard back. I was wondering if anyone was approved.
Will llama be included in parl ai in the future or there any plans for it?
Does it run on windows?
https://twitter.com/ylecun/status/1629189925089296386 (mirror 1, mirror 2, mirror 3) says yes (with the GPL v3 license):
Meta is committed to open research and releases all the models the research community under a GPL v3 license.
https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md says no:
License Non-commercial bespoke license.
So I'm confused.
Once i have completed the installation and try a test with test.py with the 8B model I had the following error:
(base) lorenzo@lorenzo-desktop:~/Desktop/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/model_size --tokenizer_path ./model/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
File "/home/lorenzo/Desktop/llama/example.py", line 72, in <module>
fire.Fire(main)
File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/lorenzo/Desktop/llama/example.py", line 62, in main
generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
File "/home/lorenzo/Desktop/llama/example.py", line 36, in load
world_size == len(checkpoints)
AssertionError: Loading a checkpoint for MP=0 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22343) of binary: /home/lorenzo/miniconda3/bin/python
Traceback (most recent call last):
File "/home/lorenzo/miniconda3/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-02_16:17:21
host : lorenzo-desktop
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 22343)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Nice try. Like all other Meta "open" models and "open source" models it's the same game:
You have to fill out one of their data collection portals, provide all details about yourself and your projects.
Then some data collector at Meta/Facebook will decide if you receive limited access.
I suppose it helps if you have a Facebook account and blog about "Meta" being an open company.
Because we all know, that is what they are known for. Not to be the worst private data harvester in the world.
Is it possible to Fine-tune LLaMA for downstream tasks? If so, how can we do that?
Edit: Reading the other opened issues, I realized that neither the training data nor the pre-trained weights were released. How the code is going to be useful anyway?
will the corpus be packed and provided?
Is it possible to host this locally on an RTX3XXX or 4XXX with 8GB just to test?
I am able to get sensible output by running 7B on 1x24Gb GPU with MP 1.
(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 1 example.py --ckpt_dir checkpoints/7B --tokenizer_path checkpoints/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Loaded in 11.71 seconds
The capital of Germany is the city of Berlin. Berlin is one of the most important cities in Europe...
The key to this is changing Line 44 of example.py
:
model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=32, **params) # OLD
model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=8, **params) # NEW
(credit to @mperacchi)
When running 13B as stated in the docs this is the command I use: CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model
I am able to see correct utilisation of the GPUs, seems to load the 13B model ok.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 36C P2 131W / 350W | 17721MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:23:00.0 Off | N/A |
| 30% 34C P2 135W / 350W | 17721MiB / 24576MiB | 41% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
But when running inference I get this:
(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading Loaded in 11.82 seconds
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 3874515) of binary: /home/user/miniconda3/envs/llama/bin/python
Traceback (most recent call last):
File "/home/user/miniconda3/envs/llama/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
example.py FAILED
-------------------------------------------------------
Failures:
[1]:
time : 2023-03-02_14:52:14
host : e9242bd8ac2c
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 3874516)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 3874516
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-02_14:52:14
host : e9242bd8ac2c
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 3874515)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 3874515
=======================================================
I downloaded a new checkpoint for MP
1 for the 13B model: checkpoints/13B_0/consolidated.00.pth
. Then ran the same command as first with batch size one but no luck... 13B is too large to load in 24Gb GPU without further compression... (ツ)_/¯
RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.78 GiB total capacity; 14.26 GiB already allocated; 121.19 MiB free; 14.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 143) of binary: /opt/conda/envs/torch1.12/bin/python
There appears to be a discrepancy between the model size mentioned in the paper, the model card, and the README. Specifically, the paper and model card both mention a model size of 33B, while the README mentions a size of 30B.
Is this a type error or the released model just 30B?
** please ignore **
As the paper makes quite clear, proper use of opensource datasets can lead to the creation of very high quality models, however it is also clear that pre-processing that data is vital. While it is described at the high-level in the paper, it is likely not sufficient detail to replicate the preprocessing steps. Are there plans to opensource the code needed to turn the existing datasets into a high-quality corpus?
I want to know if it is possible to run LLaMA with xformers.
And how to use it.
An excerpt from the original research paper - "LLaMA-65 outperforms Chinchilla-70B on all reported benchmarks but BoolQ" is inconsistent with results shared in Table 3: Zero-shot performance on Common Sense Reasoning tasks. Please clairfy.
Trying to load 7B but got a memory error for a 24GB GPU.
What would be the option for loading it in fp16? Can't find it in example.py
Just wondered what cool projects people will be making with this?
I have some good ideas such as trying to combine it with a math engine to make it genius level at math.
Or combine it with an art engine to make it generate art.
Or combine it with a computer game to see if it can navigate its way through a maze by describing it in natural language.
One thing idea is to combine it with an Alpha-Zero like model so that it can think ahead in its conversations instead of just saying the first thing that comes to mind.
These are just some ideas.
I'm wondering what other benefits could be got from having this run locally rather than using, say the ChatGPT web API?
Hi all,
I am attempting to run the example.py script on a Titan RTX 24GB. The model loads fine with max_batch_size = 1 and only one prompt, but get the following error message. Any assistance would be helpful.
Per nvidia-smi
NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1
Error:
File "/llamapath/llama/example.py", line 73, in <module> fire.Fire(main) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/llamapath/llama/example.py", line 65, in main results = generator.generate(prompts, max_gen_len=256, temperature=temperature, top_p=top_p) File "/llamapath/llama/llama/generation.py", line 42, in generate logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/llamapath/llama/llama/model.py", line 235, in forward h = layer(h, start_pos, freqs_cis, mask) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/llamapath/llama/llama/model.py", line 193, in forward h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask) File "/llamapath/llama/llama/model.py", line 121, in forward xq, xk, xv = self.wq(x), self.wk(x), self.wv(x) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/llamapath/anaconda3/envs/llamaconda/lib/python3.9/site-packages/fairscale/nn/model_parallel/layers.py", line 290, in forward output_parallel = F.linear(input_parallel, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
There is an important case to be made for public access to newer releases of models as this benefits a wider open source and especially hobbyist audience without a direct risk.
In the current situation we have multiple large language models available to us, but new innovation is often behind gatekeeping which means it can not be used for a wider audience that depends on these models to move the hobbyist space forward. There are legitimate use cases for the models such as AI generated fiction as generated by services such as NovelAI or finetunes from the wider community. These models are not seen as factual models, but as a source of entertainment.
To create a healthy ecosystem and allow more people to use well behaving AI you need the best logical comprehension in the model you can get at a smaller size that people can run on affordable (enthusiast) hardware. With OPT this was achieved by releasing up to 66B to the public.
With these new improvements that means you have a direct competitor with your own OPT model, even if you asses that the new improvements can give a powerful model in the hands of bad actors, understand that at some of the listed sizes the performance is still going to be on par or worse than existing available models making it have no negative impact in things such as generation of misinformation. What it does do is allow more resource efficient usage of higher quality models. When services and hobbyists can rely on a smaller model to perform as well as a previous existing bigger model this saves on hardware investment costs and thus reduces the carbon footprint both in hardware used for inference as well as the energy bills.
Our community established that in smaller models you have an increased risk of the AI misunderstanding the concept of a story, for example 2.7B GPT-Neo models are more likely to misgender an individual than a 6B model would. And at larger sizes with 13B onwards the issue becomes less and less common. There is also less risk of the model misunderstanding what a user is trying to achieve, and thus being better at avoiding unwanted behavior that could harm a user.
This means that by releasing this newer more efficient model you empower smaller organizations and the open source hobbyist community to get more coherent results. While bad actors do not gain anything new because it is already possible to run larger models on cloud rented machines.
While I personally think it is best to have fully open releases, I do understand the facebook research team considers some of the risks of the model being to good at convincing generations and thus wanting to limit what can be used without verification. But please consider to at minimum release the models that do not pass OPT-66B in coherency to the public. To keep this in line with the strategy previously used for OPT.
I would also like to recommend allowing commercial usage for models for fictional purposes, while I do not personally represent a company or commercial interests I have seen that our community has previously been unable to get affordable access to some of the models because pay per generation services were unable to rent them out. With our own communities goal being focussed towards fictional content such as novels, text adventures and chatting with a fictional character there is no illusion that the AI has factually accurate information because everything takes place in a fictional setting.
I'm having problems with CUBLAS while running the example code. I've tried to update the gpu driver but it didn't fix the issue.
My machine has:
OS: Ubuntu 20.04
Driver: 515
Env: python3.8, pip (not using conda), fresh virtualenv, installed requirements from the repo
Cuda: 11.7 (downloaded directly from torch)
GPU: 2 x 3090 (24GB x 2)
torchrun --nproc_per_node 1 example.py --ckpt_dir weights/7B --tokenizer_path weights/tokenizer.model
initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
Loading
Loaded in 6.55 seconds
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 64, in main
results = generator.generate(prompts, max_gen_len=256, temperature=temperature, top_p=top_p)
File "/home/uname/Documents/llama/llama/generation.py", line 42, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/uname/Documents/llama/llama/model.py", line 235, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/uname/Documents/llama/llama/model.py", line 193, in forward
h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
File "/home/uname/Documents/llama/llama/model.py", line 121, in forward
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/fairscale/nn/model_parallel/layers.py", line 290, in forward
output_parallel = F.linear(input_parallel, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when callingcublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8480) of binary: /home/uname/Documents/llama/venv/bin/python
Traceback (most recent call last):
File "/home/uname/Documents/llama/venv/bin/torchrun", line 8, in
sys.exit(main())
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/uname/Documents/llama/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
The code is GPL v3, but the models weights are under a special non-commercial license... so they aren't GPL v3, the code is useless in practice.
Hi everyone,
I've noticed that the downloading script doesn't work as it on mac. (the declare -A option is not recognized by the default bash)
fix:
install bash with homebrew
and use it to call the script
/opt/homebrew/bin/bash ./download.sh
Thanks for making this available btw :)
Thank you for the open source release of the code. I have noticed that the transformer block class definition is missing the manually implemented backward function mentioned in the paper. It would be great if this function was added.
A short sample of training code addressing how to best make use of the optimization would also surely be valuable to many people trying to reproduce the results.
For reference, the part of the paper addressing the manually implemented backward function:
Hello to all,
Thank you for this work.
I guess anyone who had access to the model weights as well as the authors can answer my question.
I may have missed it in the paper but it seems to me that there is no mention of the embedding shape or just the tokenizer vocabulary size.
Hi, I've been trying to run the example inference using the 7B model weights, but I get:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 39.59 GiB total capacity; 27.26 GiB already allocated; 24.19 MiB free; 27.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is there anything I can do about this? E.g. changing the numeric type? How?
Also: can I use more than one GPU?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.