mallorbc / finetune_llms Goto Github PK

View Code? Open in Web Editor NEW

438.0 10.0 82.0 8.42 MB

Repo for fine-tuning Casual LLMs

License: GNU Affero General Public License v3.0

Python 96.37% Shell 1.23% Dockerfile 2.39%

docker falcon gpt gpt-3 gpt-35-turbo gpt-4 gpt-j-6b llama llama2 llm

finetune_llms's Introduction

Hi there, I'm Blake Mallory 👋

I'm a Machine Learning Engineer, Educator, Freelancer. Check out my YouTube!

🧠 I have a Bachelor of Science in Computer Engineering from the University of Cincinnati.
🔭 Through my contract work and my Youtube channel I explore the latest in technology, typically with regards to large NLP models.
💬 Ask me about : Machine learning, Deep learning, Computer vision, Natural language processing, and Cryptocurrency
📫 How to reach me: [email protected] or blakemallory.com

finetune_llms's People

Contributors

Stargazers

Watchers

Forkers

santoshsrinivas79 hercules261188 mihaibalint shir1917 maoramon pawelgnatowski jordancole21 amrita-voicebase shrahimim jsbrain ngoodger thisisanshgupta kirchner-jan oguzhanyetimoglu abhay-agarwal sunnydigital csnelsonchu sciai-ai urantialife jasperls 0xakihiko vvr-rao dhruvhs limitlessmatrix xysunn prajeshtejanifast-ai-movies sirsha-chatterjee puterhimself b-y-t-e deanhuang-git ianeng techthiyanes akila-rb voltek62 raomanohar leonnewton dulanjaya13 nomiscientist oyfml ukaserge mervynzhang smurf-1119 ddkang1 b1sounours sopherwang msgpo francescotommaso joskid kudige saurabh-tripathi jewbot s-i-t-a dimitrius-ion solanovisitor sjinjala23 stounolog yizhang-unifr alejouribe najiaboo zhuifeng414 philipffmc michael-tabernero mitzen tsheasha sgholamian zizoadam dopu2k16 herbiherb wjfu99 jan-karsten-kuhnke tonywhite11 nikoma shaunstoltz mvandermeulen bgorlick stophobia ken-arf balakreshnan mahrukhs preduct0r sorokinvld

finetune_llms's Issues

gradient overflow when training 13b Llama Model on 7 a100s

Getting gradient overflow and skipped step every 2 or so steps. Training the 13b llama model on 7 a100s with context window of 512. Below is the command line run. When I tried to config state 3 or tried to get rid of gradient accumulation steps the GPU would run out of memory when attempting to load the model into memory at the start of training. Any suggestions on how to get rid of the gradient overflow issue or how to partition the model and load parts of it into multiple GPUs at the start of training? Would be super grateful for help!

deepspeed --num_gpus=7 run_clm.py --deepspeed ds_config_stage2.json --model_name_or_path decapoda-research/llama-13b-hf --train_file train.csv --validation_file validation.csv --do_train --do_eval --bf16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 400 --gradient_accumulation_steps 3 --per_device_train_batch_size 2 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --save_steps 400 --save_strategy steps --load_best_model_at_end=True --block_size=512 --report_to=wandb

Unable to find image 'gpt:latest' locally

After running the build_image.sh and then running run_image.sh I get the following error

Unable to find image 'gpt:latest' locally
docker: Error response from daemon: pull access denied for gpt, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.

RuntimeError: The expanded size of the tensor (50257) must match the existing size (0) at non-singleton dimension 0. Target sizes: [50257]. Tensor sizes: [0]

Using --model_name_or_path hivemind/gpt-j-6B-8bit

RuntimeError: The expanded size of the tensor (50257) must match the existing size (0) at non-singleton dimension 0. Target sizes: [50257]. Tensor sizes: [0]

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 394.00 MiB

Hello,
I am trying to finetune GPT-j-6b.
I followed the instructions provided in the documentation. But, I get this error.

I tried by changing batch size =1, gradient_accumulation_steps=4.

Any idea how can i solve this.

Training data format for generating Scenario based MCQ's

I am following your rep to fine tune the model for generating Scenario based MCQ's based on text extracted from pdf's. The text extracted is unstructured and in a .txt file. I am unsure of how the training data format should look like and would appreciate some guidance on it. This is the expected format of the output:

Scenario: A driver checks "Yes" to "Neck or back problems" and "Fainting or passing out" on the Driver Health History. He indicated he sustained a back injury 2 years ago. He takes duloxetine 40 mg/day and 12 over-the-counter ibuprofen each day for lumbar degenerative disc disease.
Question: What should a medical examiner be most concerned about?
Options:
a) Nerve root compression on lumbar MRI or myelography.
b) Nystagmus on Hallpike vestibular provocative tests.
c) Orthostatic hypotension and a positive hemoccult test.
d) The renal side effects of both medications

https://6b.eleuther.ai/ This link has a playground of GPT-J where if you put 2-3 prompts like the above mentioned format, it will generate a new Scenario based MCQ. I want to fine tune the model such that given a couple prompts like above, the model should be able to generate a new Scenario based MCQ.

Any help/guidance is appreciated!
Thank you

"nvcc fatal : Unsupported gpu architechture 'compute_89'" with docker image

Running the docker image, and with the default everything (except dataset, and the json renamed as I was tweaking it but this run is with no tweaks) I get the following stacktrace

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
FAILED: custom_cuda_kernel.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
nvcc fatal   : Unsupported gpu architecture 'compute_89'
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
    subprocess.run(
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_clm.py", line 543, in <module>
    main()
  File "run_clm.py", line 506, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1759, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1178, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1505, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1216, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
    op_module = load(name=self.name,
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f6442c82160>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
[2023-07-04 13:40:10,706] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1114
[2023-07-04 13:40:10,706] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config_stage3_edit.json', '--model_name_or_path', 'EleutherAI/gpt-j-6B', '--train_file', 'train_data.csv', '--validation_file', 'test_data.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--num_train_epochs', '12', '--eval_steps', '20', '--gradient_accumulation_steps', '32', '--per_device_train_batch_size', '1', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '1', '--save_steps', '20', '--save_strategy', 'steps', '--tokenizer_name', 'gpt2', '--load_best_model_at_end=True', '--block_size=2048', '--report_to=wandb'] exits with return code = 1

This is running on an RTX 4090 and an AMD 7700x with 128GB of RAM, if that helps at all.

RuntimeError: Error building extension 'cpu_adam'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:

Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[INFO|trainer.py:414] 2023-01-09 19:06:58,180 >> Using amp fp16 backend
[2023-01-09 19:06:58,187] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.7, git-hash=unknown, git-branch=unknown
[2023-01-09 19:06:58,191] [WARNING] [config_utils.py:67:process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-01-09 19:07:05,242] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu117/cpu_adam...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/root/anaconda3/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX256 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/root/anaconda3/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX256 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
In file included from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:11:0,
from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:11,
from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:1:
/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/gemm_test.h:6:10: fatal error: cuda_profiler_api.h: No such file or directory
#include <cuda_profiler_api.h>
^~~~~~~~~
compilation terminated.
[2/3] /root/anaconda3/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -D_CUDA_NO_HALF_CONVERSIONS -D_CUDA_NO_BFLOAT16_CONVERSIONS -D_CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U_CUDA_NO_HALF_OPERATORS -U_CUDA_NO_HALF_CONVERSIONS_ -U_CUDA_NO_HALF2_OPERATORS_ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
FAILED: custom_cuda_kernel.cuda.o
/root/anaconda3/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -D_CUDA_NO_HALF_CONVERSIONS -D_CUDA_NO_BFLOAT16_CONVERSIONS_ -D_CUDA_NO_HALF2_OPERATORS_ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U_CUDA_NO_HALF_OPERATORS_ -U_CUDA_NO_HALF_CONVERSIONS_ -U_CUDA_NO_HALF2_OPERATORS_ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
In file included from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:11:0,
from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:1:
/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/gemm_test.h:6:10: fatal error: cuda_profiler_api.h: No such file or directory
#include <cuda_profiler_api.h>
^~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/bitten/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 441, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/transformers/trainer.py", line 1112, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/transformers/deepspeed.py", line 355, in deepspeed_init
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/init.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 330, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1195, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1266, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
self.ds_opt_adam = CPUAdamBuilder().load()
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 460, in load
return self.jit_load(verbose)
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 495, in jit_load
op_module = load(
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7efc99bbd1f0>
Traceback (most recent call last):
File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 108, in del
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
[2023-01-09 19:07:12,824] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 7728
[2023-01-09 19:07:12,826] [ERROR] [launch.py:324:sigkill_handler]
I am using 2X3090 , please guide me how i resolve this issue

`RuntimeError: Error building extension 'cpu_adam'AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

I can't figure out how to fix this error. I am trying to run the example run.txt from here https://github.com/mallorbc/Finetune_GPTNEO_GPTJ6B/blob/main/finetuning_repo/example_run.txt

I run it and get this error, it has an error building with cpu_adam

`RuntimeError: Error building extension 'cpu_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f37056321f0>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 97, in del

here is the full traceback

`Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /usr/lib/python3/dist-packages/torch/include -isystem /usr/lib/python3/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/lib/python3/dist-packages/torch/include/TH -isystem /usr/lib/python3/dist-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /usr/lib/python3/dist-packages/torch/include -isystem /usr/lib/python3/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/lib/python3/dist-packages/torch/include/TH -isystem /usr/lib/python3/dist-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
In file included from /usr/lib/python3/dist-packages/torch/include/torch/csrc/api/include/torch/python.h:12,
from /usr/lib/python3/dist-packages/torch/include/torch/extension.h:6,
from /home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:5:
/usr/lib/python3/dist-packages/torch/include/torch/csrc/utils/pybind.h:7:10: fatal error: pybind11/pybind11.h: No such file or directory
7 | #include <pybind11/pybind11.h>
| ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run_clm.py", line 485, in
main()
File "run_clm.py", line 448, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1165, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py", line 426, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/init.py", line 120, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1098, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in init
self.ds_opt_adam = CPUAdamBuilder().load()
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 463, in load
return self.jit_load(verbose)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 505, in jit_load
op_module = load(
File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1144, in load
return _jit_compile(
File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile
_write_ninja_file_and_build_library(
File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f33ab3681f0>
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 97, in del
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
[2022-06-09 21:12:00,293] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 61288
[2022-06-09 21:12:00,293] [ERROR] [launch.py:184:sigkill_handler] ['/usr/bin/python3', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config_gptj6b.json', '--model_name_or_path', 'EleutherAI/gpt-j-6B', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--num_train_epochs', '12', '--eval_steps', '1', '--gradient_accumulation_steps', '32', '--per_device_train_batch_size', '1', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '20', '--save_steps', '2', '--save_strategy', 'steps', '--tokenizer_name', 'gpt2'] exits with return code = 1
ubuntu@104-171-200-151:~/ai-storage/Finetune_GPTNEO_GPTJ6B/fin`

Error while running convert_model_to_torch script

Hi,

First of all, great work! I was trying to follow your video on converting the GPT-J weights to PyTorch weights on colab. But, while running the python_convert_model_to_torch.py script, I get the following error. I suspect that the error is either due to colab memory or disk usage, but not sure.

Here's the nvidia-smi from my colab session:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Also, here's the error stack trace:

loading shards for part 0
read from checkpoint
< (8, 4096) to (4096,)
> transformer.wte.bias torch.Size([4096])
< (8, 6300, 4096) to (1, 50400, 4096)
> transformer.wte.weight torch.Size([4096, 50400])
< (8, 4096, 512) to (1, 4096, 4096)
convert_model_to_torch.py:147: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  params = torch.tensor(params.copy()).half()
> transformer.h.0.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.0.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.0.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.0.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.0.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.0.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.0.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.0.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.0.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.0.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.1.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.1.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.1.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.1.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.1.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.1.mlp.c_fc.weight torch.Size([16384, 4096])
loading shards for part 1
read from checkpoint
< (8, 4096) to (4096,)
> transformer.h.1.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.1.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.1.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.1.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.10.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.10.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.10.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.10.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.10.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.10.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.10.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.10.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.10.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.10.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.11.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.11.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.11.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.11.attn.attention.out_proj.weight torch.Size([4096, 4096])
loading shards for part 2
read from checkpoint
< (8, 2048) to (1, 16384)
> transformer.h.11.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.11.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.11.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.11.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.11.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.11.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.12.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.12.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.12.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.12.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.12.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.12.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.12.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.12.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.12.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.12.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.13.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.13.attn.attention.v_proj.weight torch.Size([4096, 4096])
loading shards for part 3
read from checkpoint
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.13.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.13.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.13.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.13.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.13.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.13.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.13.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.13.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.14.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.14.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.14.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.14.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.14.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.14.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.14.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.14.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.14.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.14.ln_1.weight torch.Size([4096])
loading shards for part 4
read from checkpoint
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.15.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.15.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.15.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.15.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.15.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.15.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.15.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.15.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.15.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.15.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.16.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.16.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.16.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.16.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.16.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.16.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.16.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.16.mlp.c_proj.weight torch.Size([4096, 16384])
loading shards for part 5
read from checkpoint
< (8, 4096) to (4096,)
> transformer.h.16.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.16.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.17.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.17.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.17.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.17.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.17.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.17.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.17.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.17.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.17.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.17.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.18.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.18.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.18.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.18.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.18.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.18.mlp.c_fc.weight torch.Size([16384, 4096])
loading shards for part 6
read from checkpoint
< (8, 4096) to (4096,)
> transformer.h.18.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.18.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.18.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.18.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.19.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.19.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.19.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.19.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.19.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.19.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.19.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.19.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.19.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.19.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.2.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.2.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.2.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.2.attn.attention.out_proj.weight torch.Size([4096, 4096])
loading shards for part 7
read from checkpoint
< (8, 2048) to (1, 16384)
> transformer.h.2.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.2.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.2.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.2.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.2.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.2.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.20.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.20.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.20.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.20.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.20.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.20.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.20.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.20.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.20.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.20.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.21.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.21.attn.attention.v_proj.weight torch.Size([4096, 4096])
loading shards for part 8
read from checkpoint
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.21.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.21.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.21.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.21.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.21.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.21.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.21.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.21.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.22.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.22.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.22.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.22.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.22.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.22.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.22.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.22.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.22.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.22.ln_1.weight torch.Size([4096])
loading shards for part 9
read from checkpoint
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.23.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.23.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.23.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.23.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.23.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.23.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.23.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.23.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.23.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.23.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.24.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.24.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.24.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.24.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.24.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.24.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.24.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.24.mlp.c_proj.weight torch.Size([4096, 16384])
loading shards for part 10
read from checkpoint
< (8, 4096) to (4096,)
> transformer.h.24.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.24.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.25.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.25.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.25.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.25.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.25.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.25.mlp.c_fc.weight torch.Size([16384, 4096])
< (8, 4096) to (4096,)
> transformer.h.25.mlp.c_proj.bias torch.Size([4096])
< (8, 2048, 4096) to (1, 16384, 4096)
> transformer.h.25.mlp.c_proj.weight torch.Size([4096, 16384])
< (8, 4096) to (4096,)
> transformer.h.25.ln_1.bias torch.Size([4096])
< (8, 4096) to (4096,)
> transformer.h.25.ln_1.weight torch.Size([4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.26.attn.attention.q_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.26.attn.attention.v_proj.weight torch.Size([4096, 4096])
< (8, 4096, 512) to (1, 4096, 4096)
> transformer.h.26.attn.attention.k_proj.weight torch.Size([4096, 4096])
< (8, 512, 4096) to (1, 4096, 4096)
> transformer.h.26.attn.attention.out_proj.weight torch.Size([4096, 4096])
< (8, 2048) to (1, 16384)
> transformer.h.26.mlp.c_fc.bias torch.Size([16384])
< (8, 4096, 2048) to (1, 4096, 16384)
> transformer.h.26.mlp.c_fc.weight torch.Size([16384, 4096])
loading shards for part 11
read from checkpoint
/bin/bash: line 1:  3502 Killed                  python convert_model_to_torch.py

real	6m31.465s
user	1m48.769s
sys	0m22.349s

[QUESTION] single_texts vs group_texts

Hi @mallorbc!
Thank you so much for your work on the repo and your tutorials!

Did you experiment with the different data preprocessing settings (single_texts, group_texts) in different task scenarios?
I am running experiments on the quotes dataset and I am getting very different losses in these two settings: single ~ 0.89 vs group ~ 3.3.

Single is padded to a certain length, while group is concatenated with eos token in between.

Do you have any idea when to use which setting or why there is this difference in loss?

Any hints are appreciated!

Running super slow on 4 a100 gpus

I'm finetuning the 7b llama model on your quotes dataset on a runpod instance with 4 a100 gpus yet its training at a rate of about 230 seconds per step. Any idea what could be going wrong here? Would be super grateful for your help. Also pasted below what I ran in command prompt as well as a screenshot of the w&b charts if that helps.

deepspeed --num_gpus=4 run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path decapoda-research/llama-7b-hf --
train_file train.csv --validation_file validation.csv --do_train --do_eval --bf16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1
--eval_steps 20 --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --
save_steps 20 --save_strategy steps --load_best_model_at_end=True --block_size=200 --report_to=wandb

cannot import name 'GPTNeoXForCausalLM' from 'transformers'

I followed the install_requirements.sh, but meet transformers error:

pip list:
torch 2.0.0
transformers 4.14.1

finetuning_repo# deepspeed --num_gpus=4 run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path decapoda-research/llama-7b-hf --train_file train.csv --validation_file validation.csv --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1  --eval_steps 20 --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --save_steps 20 --save_strategy steps --load_best_model_at_end=True --block_size=176 --report_to=wandb
[2023-04-09 09:20:55,965] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-09 09:20:56,017] [INFO] [runner.py:550:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path decapoda-research/llama-7b-hf --train_file train.csv --validation_file validation.csv --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 20 --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --save_steps 20 --save_strategy steps --load_best_model_at_end=True --block_size=176 --report_to=wandb
[2023-04-09 09:20:57,540] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-04-09 09:20:57,540] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-04-09 09:20:57,540] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-04-09 09:20:57,540] [INFO] [launch.py:162:main] dist_world_size=4
[2023-04-09 09:20:57,541] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
Traceback (most recent call last):
  File "run_clm.py", line 51, in <module>
    from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
ImportError: cannot import name 'GPTNeoXForCausalLM' from 'transformers' (/usr/local/lib/python3.8/dist-packages/transformers/__init__.py)
Traceback (most recent call last):
  File "run_clm.py", line 51, in <module>
    from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
ImportError: cannot import name 'GPTNeoXForCausalLM' from 'transformers' (/usr/local/lib/python3.8/dist-packages/transformers/__init__.py)
Traceback (most recent call last):
  File "run_clm.py", line 51, in <module>
    from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
ImportError: cannot import name 'GPTNeoXForCausalLM' from 'transformers' (/usr/local/lib/python3.8/dist-packages/transformers/__init__.py)
Traceback (most recent call last):
  File "run_clm.py", line 51, in <module>
    from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
ImportError: cannot import name 'GPTNeoXForCausalLM' from 'transformers' (/usr/local/lib/python3.8/dist-packages/transformers/__init__.py)

I think it is due to transformers version, so i updated the transformers to 4.27.4, then this issue got fixed but there is another error:

Traceback (most recent call last):
  File "run_clm.py", line 548, in <module>
    main()
  File "run_clm.py", line 315, in main
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 917, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 623, in __getitem__
    raise KeyError(key)
KeyError: 'llama'
Traceback (most recent call last):
  File "run_clm.py", line 548, in <module>
    main()
  File "run_clm.py", line 315, in main
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 917, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 623, in __getitem__
    raise KeyError(key)
KeyError: 'llama'
Traceback (most recent call last):
  File "run_clm.py", line 548, in <module>
    main()
  File "run_clm.py", line 315, in main
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 917, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 623, in __getitem__
    raise KeyError(key)
KeyError: 'llama'
[INFO|configuration_utils.py:668] 2023-04-09 10:32:46,645 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/config.json
Traceback (most recent call last):
  File "run_clm.py", line 548, in <module>
    main()
  File "run_clm.py", line 315, in main
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 917, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 623, in __getitem__
    raise KeyError(key)
KeyError: 'llama'

DeepSpeedZeRoOffload initialize [end]

Can't find a valid checkpoint

Hi,

I downloaded all the files from https://huggingface.co/decapoda-research/llama-7b-hf/tree/main into folder /Finetune_LLMs/finetuning_repo/model/llama-7b-hf/, and tried to do fine-tune with command :

Finetune_LLMs/finetuning_repo# deepspeed --num_gpus=2 run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path ./model/llama-7b-hf --train_file train.csv --validation_file validation.csv --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1  --eval_steps 20 --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --save_steps 20 --save_strategy steps --load_best_model_at_end=True --block_size=176 --report_to=wandb

But I always get following error showing "Can't find a valid checkpoint at ./model/llama-7b-hf".
Where can I get this checkpoint?

No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003452301025390625 seconds
Traceback (most recent call last):
  File "run_clm.py", line 548, in <module>
    main()
  File "run_clm.py", line 511, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1628, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1697, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py", line 398, in deepspeed_init
    raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at ./model/llama-7b-hf

Incorrect block size?

In your example_run.txt command line example for deepspeed, should "--block_size 2048" perhaps be set?

Without this, it looks like it's picking up the GPT2 default of 1024, but GPT-J rather than GPT-J's expected 2048.

It should also be OK to leave "--tokenizer_name gpt2" off entirely as then it should correctly initialize the default for GPT-J. In that case, specifying the block size probably would not be needed.

deepspeed>=0.5.7 is required by recent versions of the transformers package

As of Dec. 15, the install_requirements.sh script will pull in transformers version 4.14.1 which no longer works with deepspeed 0.5.3, the following exception is raised when starting training with deepspeed:

[2021-12-15 20:34:58,064] [INFO] [runner.py:360:main] cmd = /home/mihai/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_clm.py --deepspeed ds_config_gptj6b.json --model_name_or_path EleutherAI/gpt-j-6B --train_file train.csv --validation_file validation.csv --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 12 --eval_steps 1 --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 20 --save_steps 10 --save_strategy steps --tokenizer_name gpt2
[2021-12-15 20:34:58,876] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
[2021-12-15 20:34:58,876] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-12-15 20:34:58,876] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-12-15 20:34:58,876] [INFO] [launch.py:102:main] dist_world_size=1
[2021-12-15 20:34:58,876] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0
[2021-12-15 20:35:00,186] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
  File "/home/mihai/workspace/tagarer-conda/finetuning_repo/run_clm.py", line 485, in <module>
    main()
  File "/home/mihai/workspace/tagarer-conda/finetuning_repo/run_clm.py", line 195, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/mihai/anaconda3/lib/python3.9/site-packages/transformers/hf_argparser.py", line 206, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 87, in __init__
  File "/home/mihai/anaconda3/lib/python3.9/site-packages/transformers/training_args.py", line 885, in __post_init__
    self.hf_deepspeed_config = HfTrainerDeepSpeedConfig(self.deepspeed)
  File "/home/mihai/anaconda3/lib/python3.9/site-packages/transformers/deepspeed.py", line 173, in __init__
    super().__init__(config_file_or_dict)
  File "/home/mihai/anaconda3/lib/python3.9/site-packages/transformers/deepspeed.py", line 61, in __init__
    dep_version_check("deepspeed")
  File "/home/mihai/anaconda3/lib/python3.9/site-packages/transformers/dependency_versions_check.py", line 47, in dep_version_check
    require_version(deps[pkg], hint)
  File "/home/mihai/anaconda3/lib/python3.9/site-packages/transformers/utils/versions.py", line 114, in require_version
    _compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
  File "/home/mihai/anaconda3/lib/python3.9/site-packages/transformers/utils/versions.py", line 49, in _compare_versions
    raise ImportError(
ImportError: deepspeed>=0.5.7 is required for a normal functioning of this module, but found deepspeed==0.5.3+30537e7.
Killing subprocess 35522

Can't perform example_run, getting an error after deepspeed is initialized

I am trying to run the repo on a fresh Ubuntu 20.04 install. My hardware is a 3090 with 128gb RAM. After I try the example_run string, the code runs fine until it activates DeepSpeed (ZeRO-3: activating zero.init() for this model) and starts loading up on RAM. When it gets to about 37gb, I get this error:

Traceback (most recent call last):
File "run_clm.py", line 485, in
main()
File "run_clm.py", line 330, in main
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 419, in from_pretrained
return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1452, in from_pretrained
model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_state_dict_into_model(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1602, in _load_state_dict_into_model
load(model_to_load, prefix=start_prefix)
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1587, in load
load(child, prefix + name + ".")
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1587, in load
load(child, prefix + name + ".")
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1581, in load
module._load_from_state_dict(*args)
File "/home/spupe/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1327, in exit
self.params[0].partition(param_list=self.params, has_been_updated=True)
File "/home/spupe/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 605, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/spupe/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 717, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/spupe/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 834, in _partition_param
param.data = torch.ones(1, dtype=self.dtype).to(param.device)
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[2021-12-06 10:12:10,624] [INFO] [launch.py:131:sigkill_handler] Killing subprocess 45634
[2021-12-06 10:12:10,624] [ERROR] [launch.py:137:sigkill_handler] ['/usr/bin/python3', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config_gptj6b.json', '--model_name_or_path', 'EleutherAI/gpt-j-6B', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--num_train_epochs', '12', '--eval_steps', '1', '--gradient_accumulation_steps', '32', '--per_device_train_batch_size', '1', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '20', '--save_steps', '2', '--save_strategy', 'steps', '--tokenizer_name', 'gpt2'] exits with return code = 1

It seems to me this could be a CUDA error, but I can run the other repo (gpt-j-6B) with no issues, and I also tried reinstalling CUDA already. Any thoughts about what could be the problem here?

edit: by the way, I did try running CUDA_LAUNCH_BLOCKING=1 but that didn't produce any meaningful differences in the Traceback, the two sentences after RuntimeError were simply removed then