togethercomputer / openchatkit Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Does it run on single nVidia RTX A4000 or do I need two or more?
Hi, it will be super nice if you provide LORA training, to reduce the computational cost. Because 8x80 A100 is too expensive
After viewing your code , I found that you haven't support RLHF training yet. Your code is mainly about distributed training using pipeline & data parallel.
Do you have the plan to support RLHF training?Do you think it is necessary?
When I try to creat the environment,it happens.
I run it on the Windows.
I'm trying to convert the weights as per the example but running into an issue.
After mkdir huggingface_models \ && python tools/convert_to_hf_gptneox.py \ --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_5 --save-path /huggingface_models/GPT-NeoXT-Chat-Base-20B --n-stages 8 --n-layer-per-stage 6
I'm getting this error:
Traceback (most recent call last): File "/mnt/c/Users/name/OpenChatKit/tools/convert_to_hf_gptneox.py", line 102, in <module> assert args.save_path is not None AssertionError --save-path: command not found --n-stages: command not found --n-layer-per-stage: command not found
I'm using Windows 11 WSL Ubuntu 22.04.2 LTS
ChatGPT supports multi-language question answering and reasoning, although in most cases, English answers are generated first and then translated into other languages. So I want to ask whether OpenChatKit supports direct Chinese Q&A, or do I need to use Chinese data set for training before I can conduct Chinese Q&A?
Test
Test
Test!
git clone https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B
run python inference/bot.py --model GPT-NeoXT-Chat-Base-20B
Loading GPT-NeoXT-Chat-Base-20B to cuda:0...
Killed
run python inference/bot.py
OSError: Can't load the configuration of '/root/test/OpenChatKit-main/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/root/test/OpenChatKit-main/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B' is the correct path to a directory containing a config.json file
so cp -r GPT-NeoXT-Chat-Base-20B huggingface_models/
root@msi:~/test/OpenChatKit-main# python inference/bot.py
Loading /root/test/OpenChatKit-main/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:0...
Killed
I am confused, it is running in docker, is the gpu not enough video memory?
I noticed that OIG dataset adds human and bot tag in each sample. In your code, you directly pack samples to max seq length and calculate cross entropy on whole sentence. Will this make the model output human, bot tag and not knowing when to stop? Does only calculate the last bot response loss be more suitable?
I want to know the format of my documents if I want to fine-tune a model on my domain knowledge.
If my documents are many complete articles should I split them into many small questions :
: questions from articles :answers from articles
or can I feed the model with original article(how can I feed the model with my whole article?).
many thanks!
What kind of data to provide for finetuning ? What are the best practices for finetuning ?
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Describe the bug
When trying to set up the conda environment, it is failing to install the nccl package.
(base) PS D:\OpenChatKit> conda env create -f environment.yml
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
- nccl=2.12.12.1
To Reproduce
Steps to reproduce the behavior:
conda env create -f environment.yml
Expected behavior
It should install all of the packages
Desktop (please complete the following information):
In theory the LLaMa 30b & 65b should be much more capable than the GPT-NeoX 20b.
Does OpenChatKit support LLaMa? If not, is it on the roadmap?
I appreciate that togethercomputer might not be able to release pretrained LLaMa weights due to the licence, but it'd be great if researches can at least play with it.
The bot commands for inference aren't very well documented. Add more documentation about what they do.
Is your feature request related to a problem? Please describe.
Looks like there are not clear on installation in Chinese
Describe the solution you'd like
I can help to translate it into Chinese
Describe alternatives you've considered
Additional context
Is your feature request related to a problem? Please describe.
A docker image might be easier for people to use.
Describe the solution you'd like
We could add a /docker folder or a simple dockerfile to the repo, so people could build the image by themselves. And maybe we could push the image to dockerhub so they could just pull and test.
I started a training process with 4*V100S(32GB VRAM each) at 18:00, and i got a "training starts..." prompt.
With nvidia-smi, i can see that 3 GPUs are running with utils 100%.
The next morning, the processes are still running, but nothing in output folder, neither the log message.
So, is there someway to see how the training job is going?
Can you introduce the computing resources needed for the experiment
TODO: Add in the training script for the moderation model
It is recommended to describe the operating environment required for installation (for example, macos is not recommended), cpu, memory, storage and other conditions in the readme.md file
I run the following command:
python prepare.py
The result is as follows:
error: RPC failed; curl 56 GnuTLS recv error (-110): The TLS connection was non-properly terminated.
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
Traceback (most recent call last):
File "prepare.py", line 18, in
process = subprocess.run(
File "/root/miniconda3/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'git clone https://huggingface.co/datasets/laion/OIG /www/wwwroot/OpenChatKit/data/OIG/files' returned non-zero exit status 128.
Add:
curl https://huggingface.co/datasets/laion/OIG is OK.
And Permission is 777 in /www/wwwroot/OpenChatKit/data/OIG/files
Why?
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Feedback from a user: When running the training script, it's not clear that it's making progress. The only way to know that it's doing something is by looking at nvidia-smi.
Is it possible to reduce the amount of resources needed to run the system on Google Colab ?
because not everyone has the means to experiment with A100 80gb
Hi
What is minimum specification to replicate it on local machine.
Describe the bug
I've downloaded the corpus and the model weights, I ran the command bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
and I got the following:
https://gist.github.com/riatzukiza/0930307fc90bf940103364be2d3db5c1
To Reproduce
Steps to reproduce the behavior:
bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
Expected behavior
To fine tune the model, or get an out of memory error
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
胜多负少的
所得到的多多
点对点
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
What’s the roadmap for the project becoming a true open alternative to chatgpt?
While its capabilities are impressive on their own, stacked against ChatGPT there’s lot lacking.
For example…
To me it seems that it is good at generating coherent sentences, but massively lacks reasoning.
Hopefully this feedback doesn’t come across as harsh or critical. It seems this project is the closest there is to a ChatGPT alternative. Impressive work everyone who contributed so far. I’m rooting for this projects success and hope it will truly rival ChatGPT someday.
As mentioned above
Could you please tell me what's the meaning of 0.2? Can I add my own data to the DATASETS? If so, how should i do? Thanks so much!
Describe the bug
(base) samchen@Sams-MacBook-Pro OpenChatKit % conda env create -f environment.yml
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
.
Screenshots
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
While trying out python inference/bot.py --retrieval --model togethercomputer/GPT-NeoXT-Chat-Base-20B
I got this error on A100 GPU:
File "inference/bot.py", line 185, in <module>
main()
File "inference/bot.py", line 173, in main
OpenChatKitShell(
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/cmd.py", line 138, in cmdloop
stop = self.onecmd(line)
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/cmd.py", line 217, in onecmd
return func(arg)
File "inference/bot.py", line 87, in do_say
output = self._model.do_inference(
File "inference/bot.py", line 32, in do_inference
outputs = self._model.generate(
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/transformers/generation_utils.py", line 1326, in generate
return self.sample(
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/transformers/generation_utils.py", line 1944, in sample
outputs = self(
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 619, in forward
outputs = self.gpt_neox(
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 511, in forward
outputs = layer(
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 319, in forward
attention_layer_outputs = self.attention(
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 115, in forward
qkv = self.query_key_value(hidden_states)
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/admin/home/anaconda3/envs/openkit/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
I think the explanation of train and fine-tune process is much few, can Can you show some specific examples of ipynb documentation reference? Many thanks!
One of our users tried running inference and got an error saying that there was no package called retrieval.
Describe the bug
One user reported conda env create -f environment.yml
taking over 60 minutes. We need a better solution.
To Reproduce
Steps to reproduce the behavior:
conda env create -f environment.yml
from the root of the repo.Expected behavior
Should finish in a "reasonable" amount of time.
Describe the bug
pretrained/GPT-NeoX-20B/prepare.py
can take a long time to prepare the base model. It should print progress as it's converting.
To Reproduce
Steps to reproduce the behavior:
python pretrained/GPT-NeoX-20B/prepare.py
from the root of the repo.Expected behavior
The script should print progress.
Ubuntu Ubuntu 22.04.2 LTS
After downloading the model and now trying to convert:
(OpenChatKit) georgi@georgi-hackintosh:~/Documents/GitHub/OpenChatKit$ python3.10 tools/convert_to_hf_gptneox.py --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_5 --save-path huggingface_models/GPT-NeoXT-Chat-Base-20B --n-stages 8 --n-layer-per-stage 6
loading stage 0
Traceback (most recent call last):
File "/home/georgi/Documents/GitHub/OpenChatKit/tools/convert_to_hf_gptneox.py", line 110, in <module>
load_decentralized_checkpoint(
File "/home/georgi/Documents/GitHub/OpenChatKit/tools/convert_to_hf_gptneox.py", line 43, in load_decentralized_checkpoint
checkpoint = torch.load(os.path.join(input_path, f'prank_{i}_checkpoint.pt'), map_location=torch.device("cpu"))
File "/home/georgi/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/georgi/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/georgi/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/serialization.py", line 251, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_5/prank_0_checkpoint.pt'
Any ideas?
(OpenChatKit) georgi@georgi-hackintosh:~/Documents/GitHub/OpenChatKit$ python pretrained/GPT-NeoX-20B/prepare.py
Downloading config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 272kB/s]
Downloading tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 55.0kB/s]
Downloading vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.03M/1.03M [00:01<00:00, 748kB/s]
Downloading merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 446k/446k [00:00<00:00, 555kB/s]
Downloading tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.02M/2.02M [00:01<00:00, 1.61MB/s]
Downloading special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 39.5kB/s]
Downloading pytorch_model.bin.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 56.4k/56.4k [00:00<00:00, 3.44MB/s]
Downloading pytorch_model-00001-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 883M/883M [01:16<00:00, 12.1MB/s]
Downloading pytorch_model-00002-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.0MB/s]
Downloading pytorch_model-00003-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00004-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:16<00:00, 11.9MB/s]
Downloading pytorch_model-00005-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:14<00:00, 12.2MB/s]
Downloading pytorch_model-00006-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00007-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.0MB/s]
Downloading pytorch_model-00008-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:14<00:00, 12.2MB/s]
Downloading pytorch_model-00009-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:14<00:00, 12.2MB/s]
Downloading pytorch_model-00010-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:14<00:00, 12.1MB/s]
Downloading pytorch_model-00011-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.0MB/s]
Downloading pytorch_model-00012-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.0MB/s]
Downloading pytorch_model-00013-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00014-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00015-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:19<00:00, 11.5MB/s]
Downloading pytorch_model-00016-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:16<00:00, 11.9MB/s]
Downloading pytorch_model-00017-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:18<00:00, 11.6MB/s]
Downloading pytorch_model-00018-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.0MB/s]
Downloading pytorch_model-00019-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.0MB/s]
Downloading pytorch_model-00020-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.0MB/s]
Downloading pytorch_model-00021-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.0MB/s]
Downloading pytorch_model-00022-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:14<00:00, 12.1MB/s]
Downloading pytorch_model-00023-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:14<00:00, 12.2MB/s]
Downloading pytorch_model-00024-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00025-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00026-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00027-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:14<00:00, 12.2MB/s]
Downloading pytorch_model-00028-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00029-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00030-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:16<00:00, 12.0MB/s]
Downloading pytorch_model-00031-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:17<00:00, 11.7MB/s]
Downloading pytorch_model-00032-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00033-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:18<00:00, 11.7MB/s]
Downloading pytorch_model-00034-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:16<00:00, 11.9MB/s]
Downloading pytorch_model-00035-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:15<00:00, 12.1MB/s]
Downloading pytorch_model-00036-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:26<00:00, 10.5MB/s]
Downloading pytorch_model-00037-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:24<00:00, 10.7MB/s]
Downloading pytorch_model-00038-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:18<00:00, 11.6MB/s]
Downloading pytorch_model-00039-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:29<00:00, 10.2MB/s]
Downloading pytorch_model-00040-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:18<00:00, 11.7MB/s]
Downloading pytorch_model-00041-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:21<00:00, 11.2MB/s]
Downloading pytorch_model-00042-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:25<00:00, 10.6MB/s]
Downloading pytorch_model-00043-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:16<00:00, 11.9MB/s]
Downloading pytorch_model-00044-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 868M/868M [01:23<00:00, 10.9MB/s]
Downloading pytorch_model-00045-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 576M/576M [00:50<00:00, 11.9MB/s]
Downloading pytorch_model-00046-of-00046.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 591M/591M [00:54<00:00, 11.3MB/s]
Killed
(base) georgi@georgi-hackintosh:~/Documents/GitHub/OpenChatKit/pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b$ ls
config.json special_tokens_map.json tokenizer_config.json tokenizer.json
(base) georgi@georgi-hackintosh:~/Documents/GitHub/OpenChatKit/pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b$
However /home/georgi/.cache/huggingface/transformers is 41.3 GB. Any ideas what goes wrong?
python data/OIG/prepare.py
File "data/OIG/prepare.py", line 27
gzip.open(f, 'rb') as infile,
^
SyntaxError: invalid syntax
Tell me why, thank you!
Describe the bug
Using retrieval-augmented models, a sequence of prompts leads to a runtime error (size mismatch between two tensors).
To Reproduce
Steps to reproduce the behavior:
python inference/bot.py --retrieval
>>> Where is Bern?
...
>>> Where is Switzerland?
...
>>> Is Switzerland in Europe or in America?
Traceback
The queries lead to the following error:
Traceback (most recent call last):
File "/home/fsuser/OpenChatKit/inference/bot.py", line 185, in <module>
main()
File "/home/fsuser/OpenChatKit/inference/bot.py", line 181, in main
).cmdloop()
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/cmd.py", line 138, in cmdloop
stop = self.onecmd(line)
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/cmd.py", line 217, in onecmd
return func(arg)
File "/home/fsuser/OpenChatKit/inference/bot.py", line 87, in do_say
output = self._model.do_inference(
File "/home/fsuser/OpenChatKit/inference/bot.py", line 32, in do_inference
outputs = self._model.generate(
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/generation_utils.py", line 1326, in generate
return self.sample(
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/generation_utils.py", line 1944, in sample
outputs = self(
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 619, in forward
outputs = self.gpt_neox(
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 511, in forward
outputs = layer(
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 319, in forward
attention_layer_outputs = self.attention(
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 153, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/home/fsuser/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 220, in _attn
attn_scores = torch.where(causal_mask, attn_scores, mask_value)
RuntimeError: The size of tensor a (2048) must match the size of tensor b (2247) at non-singleton dimension 3
Environment
Setup using mamba in root dir: mamba env create -f environment.yml
Hardware:
Describe the bug
The bash script to train the model does not work because of a Cupy error:
(OpenChatKit-Test) user@pc:~/OpenChatKit$ bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
Traceback (most recent call last):
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
Traceback (most recent call last):
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy.cuda.Device(cuda_id).use()
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0
Traceback (most recent call last):
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
main()
File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
_PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy.cuda.Device(cuda_id).use()
File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
To Reproduce
Steps to reproduce the behavior:
bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
Expected behavior
The code is supposed to execute.
Screenshots
NA
Desktop (please complete the following information):
Additional context
Also, the previous steps to download the data and weights also gave me errors. These steps:
python data/OIG/prepare.py
python pretrained/GPT-NeoX-20B/prepare.py
Ended after a couple minutes/hours with the error message "Killed". I was able to acquire the data sets with a simple wget command but I thought that was weird too.
Describe the bug
Followed the instructions but could not get
conda env create -f environment.yml
to work because of
ResolvePackageNotFound:
- cudatoolkit=11.6.0
- faiss-gpu=1.7.2
- nccl=2.12.12.1
- cupy=10.4.0
To Reproduce
Steps to reproduce the behavior:
Intall miniconda
run
conda env create -f environment.yml
Expected behavior
Create an environment called OpenChatKit but can't create
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Mac
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Hello
What is minimum specification to launch (but not train) it on local machine with normal speed?
Thank you
Describe the bug
(base) samchen@Sams-MacBook-Pro miniconda3 % conda env create -f environment.yml
EnvironmentFileNotFound: '/Users/samchen/miniconda3/environment.yml' file not found
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Should be move to next step
Screenshots
(base) samchen@Sams-MacBook-Pro miniconda3 % conda env create -f environment.yml
EnvironmentFileNotFound: '/Users/samchen/miniconda3/environment.yml' file not found
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.