Comments (3)
And then execute converting weights to Huggingface format, met the dimension and target size of the extension operation do not match.
(myconda) root@ZaodV6:/mnt/tet/OpenChatKit-main# mkdir huggingface_models \
&& python tools/convert_to_hf_gptneox.py
--config-name EleutherAI/pythia-6.9b-deduped
--ckpt-path model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10
--save-path huggingface_models/Pythia-Chat-Base-3B
--n-stages 4
--n-layer-per-stage 8
--fp16
loading config...
loaded config.
loading tokenizer...
loaded tokenizer.
creating empty model...
created empty model.
loading model ckpt...
loading stage 0
Traceback (most recent call last):
File "tools/convert_to_hf_gptneox.py", line 123, in
load_decentralized_checkpoint(
File "tools/convert_to_hf_gptneox.py", line 48, in load_decentralized_checkpoint
model.gpt_neox.embed_in.weight.data[:] = _tmp['embed_in.weight']
RuntimeError: The expanded size of the tensor (4096) must match the existing size (2560) at non-singleton dimension 1. Target sizes: [50432, 4096]. Tensor sizes: [50432, 2560]
from openchatkit.
bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh My configurations changes as below: --lr 1e-5 --seq-length 2048 --batch-size 8 --micro-batch-size 1 --gradient-accumulate-step 1 --num-layers 2 --embedding-dim 2560 --world-size 1 --pipeline-group-size 1 --data-group-size 1 (trap 'kill 0' SIGINT; python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 0 --rank 0 & My environment: GPU: NVIDIA RTX A4000 Graphics card memory:16 GB Number of CPUs available for use: 8 Memory : 60 GB Free space: 200 GB
error log:
Rank 0 node forward pass 0/1 takes 1.84s
{'loss': 16.892578125, 'lr': 1e-05}
Rank 0 node backward pass 0/1 takes 1.09s
cuda:0 cuda:0 cuda:0
!!! Warning: find inf in fp16 optimizer-step() !!!
/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of before . In PyTorch 1.1.0 and later, you should call them in the opposite order: before . Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of before . "
after cuda sync 0
Rank 0 node optimizer step takes 0.07s
Rank 0 node whole iteration takes 3.00slr_scheduler.step()``optimizer.step()``optimizer.step()``lr_scheduler.step()``lr_scheduler.step()``optimizer.step()
Rank 0 node forward pass 0/1 takes 0.36s {'loss': 16.744140625, 'lr': 9e-06}With this error log, it seems also produce one model locate in model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10/, I don't if it's model is right?
This model has 32 layers, so if you only have one GPU, --num-layers must be 32
https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1/blob/main/config.json#L16
from openchatkit.
And then execute converting weights to Huggingface format, met the dimension and target size of the extension operation do not match. (myconda) root@ZaodV6:/mnt/tet/OpenChatKit-main# mkdir huggingface_models \
&& python tools/convert_to_hf_gptneox.py
--config-name EleutherAI/pythia-6.9b-deduped
--ckpt-path model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10
--save-path huggingface_models/Pythia-Chat-Base-3B
--n-stages 4
--n-layer-per-stage 8
--fp16
loading config...
loaded config.
loading tokenizer...
loaded tokenizer.
creating empty model...
created empty model.
loading model ckpt...
loading stage 0
Traceback (most recent call last):
File "tools/convert_to_hf_gptneox.py", line 123, in
load_decentralized_checkpoint(
File "tools/convert_to_hf_gptneox.py", line 48, in load_decentralized_checkpoint
model.gpt_neox.embed_in.weight.data[:] = _tmp['embed_in.weight']
RuntimeError: The expanded size of the tensor (4096) must match the existing size (2560) at non-singleton dimension 1. Target sizes: [50432, 4096]. Tensor sizes: [50432, 2560]
from openchatkit.
Related Issues (20)
- When do model offline Training , met below issue HOT 5
- We couldn't connect to 'https://huggingface.co' HOT 1
- Add CodeAlpaca-20k dataset to improve coding skills.
- -
- Environment Issues On Mac HOT 1
- Example script for continued pre-training? HOT 2
- How to disable AWS_ACCESS_KEY_ID when fine tuning? HOT 2
- LOST in the MIDDLE
- how many card days to Fine-tuning Llama-2-7B-32K-beta
- An error occurred while fine-tuning the model. HOT 3
- Cannot setup environment HOT 1
- Training on BookSum HOT 1
- how to train Fine-tuning Llama-2-7B-32K-beta?
- How to start the combined server/ send commands over HTTP?
- API is not working when inferenced with streamlit
- NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.
- ModuleNotFoundError: No module named 'flash_attn' HOT 1
- What is minimum resource requirement to fine-tuning Llama-2-7B-32K-beta model.
- H
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openchatkit.