Code Monkey home page Code Monkey logo

chatglm-finetune-lora's People

Contributors

lich99 avatar paplorinc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

chatglm-finetune-lora's Issues

报错

return {'input_ids': torch.tensor(input_ids).long(),
'attention_mask': attention_mask,
'labels': torch.stack(labels),
'position_ids':torch.stack(position_ids)}
这里为什么有torch.tensor()和torch.stack()呢?
不写成torch.stack()有时会报ValueError: expected sequence of length 50 at dim 1 (got 49)这个错,可以解答一下吗?

[deepspeed] OVERFLOW!

训练时,经常报疑似梯度溢出警告?请问大佬能解答下原因和解决方法么?
Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768

显存问题

可不可以使用22gb甚至20gb的设备训练?

位置编码

项目里用的二维位置编码吗?
但是chatGLM不是用的旋转位置编码吗?
@lich99

position_ids.append(torch.stack([torch.arange(0, _max_length, device=device),
torch.concat([torch.zeros(context_length - 1, device=device),
torch.arange(0, _max_length - context_length + 1, device=device)])]).long())

能否使用量化后的chatGLM-6b-int4小模型进行微调?

学生党在colab上想尝试您的代码
但是显卡只有15G ,没法跑这个大模型,于是将模型路径换成chatGLM-6b-int4 这个量化后的小模型
但是运行时在运行train.py时215行outputs = model(**batch) 这里报错:self and mat2 must have same dtype
这是因为量化后的模型的参数和输入不匹配导致的吗?那么可以在int4量化后的模型上进行finetune吗?
您方便的话能否写一个量化后模型微调的demo,非常感谢!

关于ZeRO的疑问?

文中写道:Try ZeRO 2 and no offload first, unless you encounter OOM. ZeRO 2 (no offload) > ZeRO 2 (offload) > ZeRO 3 (no offload) > ZeRO 3 (offload)。
请问为什么推荐使用ZeRO2-offload,是因为ZeRO3以及ZeRO3-offload等虽然高效利用了显存,但是牺牲了参数的计算精度等,这会导致模型的训练效果变差吗?
感谢回复!

CUDA error: device-side assert triggered

模型正常加载 发送prompt后报错
Loading checkpoint shards: 14%|████████▏ | 1/7 [00:00<00:05, 1Loading checkpoint shards: 29%|████████████████▎ | 2/7 [00:01<0Loading checkpoint shards: 43%|████████████████████████▍ | 3/7 Loading checkpoint shards: 57%|████████████████████████████████▌ Loading checkpoint shards: 71%|████████████████████████████████████████▋ Loading checkpoint shards: 86%|████████████████████████████████████████████Loading checkpoint shards: 100%|████████████████████████████████████████████Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 7/7 [00:05<00:00, 1.18it/s]
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

binary_path: F:\Anaconda\envs\chatglm\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary F:\Anaconda\envs\chatglm\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.

报错信息如下:
┌─────────────────────────────── Traceback (most recent call last) ────────────── ──────────────────┐
│ F:\ChatGLM\chatglm3_6b_finetune\inference_hf.py:51 in main │
│ │
│ 48 │ │ prompt: Annotated[str, typer.Option(help='')], │
│ 49 ): │
│ 50 │ model, tokenizer = load_model_and_tokenizer(model_dir) │
│ > 51 │ response, _ = model.chat(tokenizer, prompt) │
│ 52 │ print(response) │
│ 53 │
│ 54 │
│ │
│ F:\Anaconda\envs\chatglm\lib\site-packages\torch\autograd\grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ > 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ C:\Users\Administrator.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chat │
│ glm.py:1042 in chat │
│ │
│ 1039 │ │ inputs = inputs.to(self.device) │
│ 1040 │ │ eos_token_id = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"), │
│ 1041 │ │ │ │ │ │ tokenizer.get_command("<|observation|>")] │
│ > 1042 │ │ outputs = self.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id) │
│ 1043 │ │ outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):-1] │
│ 1044 │ │ response = tokenizer.decode(outputs) │
│ 1045 │ │ history.append({"role": role, "content": query}) │
│ │
│ F:\Anaconda\envs\chatglm\lib\site-packages\torch\autograd\grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ > 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ F:\Anaconda\envs\chatglm\lib\site-packages\transformers\generation\utils.py:1575 in generate │
│ │
│ 1572 │ │ │ ) │
│ 1573 │ │ │ │
│ 1574 │ │ │ # 13. run sample │
│ > 1575 │ │ │ result = self._sample( │
│ 1576 │ │ │ │ input_ids, │
│ 1577 │ │ │ │ logits_processor=prepared_logits_processor, │
│ 1578 │ │ │ │ logits_warper=logits_warper, │
│ │
│ F:\Anaconda\envs\chatglm\lib\site-packages\transformers\generation\utils.py:2697 in _sample │
│ │
│ 2694 │ │ │ model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) │
│ 2695 │ │ │ │
│ 2696 │ │ │ # forward pass to get next token │
│ > 2697 │ │ │ outputs = self( │
│ 2698 │ │ │ │ **model_inputs, │
│ 2699 │ │ │ │ return_dict=True, │
│ 2700 │ │ │ │ output_attentions=output_attentions, │
│ │
│ F:\Anaconda\envs\chatglm\lib\site-packages\torch\nn\modules\module.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ > 1130 │ │ │ return forward_call(*input, **kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ C:\Users\Administrator.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chat │
│ glm.py:941 in forward │
│ │
│ 938 │ │ use_cache = use_cache if use_cache is not None else self.config.use_cache │
│ 939 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │
│ 940 │ │ │
│ > 941 │ │ transformer_outputs = self.transformer( │
│ 942 │ │ │ input_ids=input_ids, │
│ 943 │ │ │ position_ids=position_ids, │
│ 944 │ │ │ attention_mask=attention_mask, │
│ │
│ F:\Anaconda\envs\chatglm\lib\site-packages\torch\nn\modules\module.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ > 1130 │ │ │ return forward_call(*input, **kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ C:\Users\Administrator.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chat │
│ glm.py:822 in forward │
│ │
│ 819 │ │ │ │ │ │ │ │ │ │ │ attention_mask], dim=-1) │
│ 820 │ │ │
│ 821 │ │ if full_attention_mask is None: │
│ > 822 │ │ │ if (attention_mask is not None and not attention_mask.all()) or (past_key_va │
│ 823 │ │ │ │ full_attention_mask = self.get_masks(input_ids, past_key_values, padding │
│ 824 │ │ │
│ 825 │ │ # Rotary positional embeddings │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

环境如下:
absl-py==2.1.0
accelerate==0.27.2
aiofiles==23.2.1
aiohttp==3.9.3
aiosignal==1.3.1
altair==5.2.0
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
anyio==4.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arxiv==2.1.0
async-timeout==4.0.3
attrs==23.2.0
azure-core==1.30.1
azure-storage-blob==12.19.1
backoff==2.2.1
beautifulsoup4==4.12.3
bitsandbytes==0.37.1
bitsandbytes-windows==0.37.5
blinker==1.7.0
blis==0.7.11
Brotli==1.1.0
cachetools==5.3.3
catalogue==2.0.10
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cloudpathlib==0.16.0
colorama==0.4.6
coloredlogs==15.0.1
confection==0.1.4
contourpy==1.2.0
cpm-kernels==1.0.11
cryptography==42.0.5
curl_cffi==0.6.2
cycler==0.12.1
cymem==2.0.8
dashscope==1.13.6
dataclasses-json==0.6.4
datasets==2.18.0
deepdiff==6.7.1
Deprecated==1.2.14
deprecation==2.1.0
dill==0.3.8
distro==1.9.0
duckduckgo_search==5.1.0
effdet==0.4.1
einops==0.7.0
emoji==2.10.1
environs==9.5.0
et-xmlfile==1.1.0
exceptiongroup==1.2.0
faiss-cpu==1.7.4
fake-useragent==1.5.1
fastapi==0.109.0
feedparser==6.0.10
ffmpy==0.3.2
filelock==3.13.1
filetype==1.2.0
flatbuffers==24.3.7
fonttools==4.49.0
frozenlist==1.4.1
fschat==0.2.35
fsspec==2024.2.0
gitdb==4.0.11
GitPython==3.1.42
google-auth==2.29.0
google-auth-oauthlib==0.4.6
gradio==3.50.0
gradio_client==0.6.1
greenlet==3.0.3
grpcio==1.60.0
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.4
httpx==0.27.0
httpx-sse==0.4.0
huggingface-hub==0.21.4
humanfriendly==10.0
hyperframe==6.0.1
idna==3.4
imageio==2.34.0
importlib_metadata==7.0.2
importlib_resources==6.3.1
iniconfig==2.0.0
iopath==0.1.10
isodate==0.6.1
jieba==0.42.1
Jinja2==3.1.3
joblib==1.3.2
jsonpatch==1.33
jsonpath-python==1.0.6
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
langchain==0.0.354
langchain-community==0.0.20
langchain-core==0.1.23
langchain-experimental==0.0.47
langcodes==3.3.0
langdetect==1.0.9
langsmith==0.0.87
latex2mathml==3.77.0
layoutparser==0.3.4
lazy_loader==0.3
llama-index==0.9.35
loguru==0.7.2
lxml==5.1.0
Markdown==3.5.2
markdown-it-py==3.0.0
markdown2==2.4.13
markdownify==0.11.6
MarkupSafe==2.1.5
marshmallow==3.21.1
matplotlib==3.8.3
mdtex2html==1.3.0
mdurl==0.1.2
metaphor-python==0.1.23
minio==7.2.5
mkl-fft==1.3.8
mkl-random==1.2.4
mkl-service==2.4.0
mpmath==1.3.0
msg-parser==1.2.0
multidict==6.0.5
multiprocess==0.70.16
murmurhash==1.0.10
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.2.1
nh3==0.2.15
nltk==3.8.1
numexpr==2.8.6
numpy==1.24.4
oauthlib==3.2.2
olefile==0.47
omegaconf==2.3.0
onnx==1.15.0
onnxruntime==1.15.1
openai==1.9.0
opencv-python==4.9.0.80
openpyxl==3.1.2
ordered-set==4.1.0
orjson==3.9.15
packaging==23.2
pandas==2.0.3
pathlib==1.0.1
pdf2image==1.17.0
pdfminer.six==20231228
pdfplumber==0.11.0
peft==0.9.0
pikepdf==8.4.1
Pillow==9.5.0
pillow_heif==0.15.0
pip==23.3.1
pluggy==1.4.0
portalocker==2.8.2
preshed==3.0.9
prompt-toolkit==3.0.43
protobuf==3.20.3
psutil==5.9.8
pyarrow==15.0.1
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pyclipper==1.3.0.post5
pycocotools==2.0.7
pycparser==2.21
pycryptodome==3.20.0
pydantic==1.10.13
pydantic_core==2.16.3
pydash==7.0.7
pydeck==0.8.1b0
pydub==0.25.1
PyExecJS==1.5.1
Pygments==2.17.2
PyJWT==2.8.0
pymilvus==2.4.0
PyMuPDF==1.23.16
PyMuPDFb==1.23.9
pypandoc==1.13
pyparsing==3.1.2
pypdf==4.1.0
pypdfium2==4.28.0
pyreadline3==3.4.1
pytesseract==0.3.10
pytest==7.4.3
python-dateutil==2.9.0.post0
python-decouple==3.8
python-docx==1.1.0
python-dotenv==1.0.1
python-iso639==2024.2.7
python-magic==0.4.27
python-magic-bin==0.4.14
python-multipart==0.0.9
python-pptx==0.6.23
pytz==2024.1
pywencai==0.12.2
pywin32==306
PyYAML==6.0.1
rapidfuzz==3.6.2
rapidocr-onnxruntime==1.3.8
referencing==0.33.0
regex==2023.12.25
requests==2.31.0
requests-oauthlib==2.0.0
rich==13.7.1
rouge-chinese==1.0.3
rpds-py==0.18.0
rsa==4.9
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
ruff==0.3.3
safetensors==0.4.2
scikit-image==0.22.0
scikit-learn==1.4.1.post1
scipy==1.12.0
semantic-version==2.10.0
sentence-transformers==2.2.2
sentencepiece==0.2.0
setuptools==68.2.2
sgmllib3k==1.0.0
shapely==2.0.3
shellingham==1.5.4
shortuuid==1.0.13
simplejson==3.19.2
six==1.16.0
smart-open==6.4.0
smmap==5.0.1
sniffio==1.3.1
socksio==1.0.0
soupsieve==2.5
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
SQLAlchemy==2.0.19
srsly==2.4.8
sse-starlette==1.8.2
starlette==0.35.0
streamlit==1.30.0
streamlit-aggrid==0.3.4.post3
streamlit-antd-components==0.3.1
streamlit-chatbox==1.1.11
streamlit-feedback==0.1.3
streamlit-modal==0.1.0
streamlit-option-menu==0.3.12
strsimpy==0.2.1
svgwrite==1.4.3
sympy==1.12
tabulate==0.9.0
tenacity==8.2.3
tensorboard==2.10.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
text2vec==1.2.9
thinc==8.2.3
threadpoolctl==3.3.0
tifffile==2024.2.12
tiktoken==0.5.2
timm==0.9.16
tokenizers==0.15.2
toml==0.10.2
tomli==2.0.1
tomlkit==0.12.0
toolz==0.12.1
torch==1.12.0+cu113
torchaudio==0.12.0+cu113
torchvision==0.13.0+cu113
tornado==6.4
tqdm==4.66.1
transformers==4.39.3
transformers-stream-generator==0.0.4
typer==0.9.0
typing_extensions==4.10.0
typing-inspect==0.9.0
tzdata==2024.1
tzlocal==5.2
ujson==5.9.0
unstructured==0.11.0
unstructured-client==0.22.0
unstructured-inference==0.7.15
unstructured.pytesseract==0.3.12
urllib3==2.1.0
uvicorn==0.28.0
validators==0.22.0
visdom==0.2.4
wasabi==1.1.2
watchdog==3.0.0
wavedrom==2.0.3.post3
wcwidth==0.2.13
weasel==0.3.4
websocket-client==1.7.0
websockets==12.0
Werkzeug==3.0.2
wheel==0.41.2
win32-setctime==1.1.0
wrapt==1.16.0
xformers==0.0.23.post1
xlrd==2.0.1
XlsxWriter==3.2.0
xxhash==3.4.1
yarl==1.9.4
youtube-search==2.1.2
zipp==3.18.0

About multi-GPU

Is there any command and ymal example about using accelerate to launch train.py?

example.ipynb中进行训练测试loss为nan

前面代码不变

model(**batch).loss

改为

for i in range(10):
    output = model(**batch)
    loss = output.loss
    loss.backward()
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    print(loss.detach().float())

输出

tensor(3.2207, device='cuda:0')
tensor(nan, device='cuda:0')
tensor(nan, device='cuda:0')
tensor(nan, device='cuda:0')
tensor(nan, device='cuda:0')
tensor(nan, device='cuda:0')
tensor(nan, device='cuda:0')
tensor(nan, device='cuda:0')
tensor(nan, device='cuda:0')
tensor(nan, device='cuda:0')

只有第1个step时loss正常计算,请问这是为啥?
lr设置1e-8

NameError: name 'train_dataloader' is not defined

lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=int(len(train_dataloader) / accumulate_step),
num_training_steps=(int(len(train_dataloader) / accumulate_step) * NUM_EPOCHS),
)

RuntimeError: CUDA error: invalid device ordinal

运行accelerate launch --config_file config/default_config.yaml train_new.py,然后报错:

Traceback (most recent call last):
  File "/home/searchgpt/yq/ChatGLM-finetune-LoRA/train_new.py", line 38, in <module>
    accelerator = Accelerator(mixed_precision=mixed_precision, gradient_accumulation_steps=accumulate_step, deepspeed_plugin=deepspeed_plugin)
  File "/home/searchgpt/anaconda3/envs/stanford_alpaca/lib/python3.10/site-packages/accelerate/accelerator.py", line 340, in __init__
    self.state = AcceleratorState(
  File "/home/searchgpt/anaconda3/envs/stanford_alpaca/lib/python3.10/site-packages/accelerate/state.py", line 539, in __init__
    PartialState(cpu, **kwargs)
  File "/home/searchgpt/anaconda3/envs/stanford_alpaca/lib/python3.10/site-packages/accelerate/state.py", line 123, in __init__
    torch.cuda.set_device(self.device)
  File "/home/searchgpt/anaconda3/envs/stanford_alpaca/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

运行环境为:单机2卡A100

训练loss变为NaN

进行反向传播以后,loss就会变为NaN。
设置的accumlate_step = 16。
image
目前系统的CUDA是10.2,pytorch中的cuda是11.6。
代码如下,单GPU:

import os
import tqdm
import json
import torch
import loralib as lora
import lora_utils.insert_lora
import dataset.GLM as GLM_Data
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModel
# from accelerate import Accelerator, DeepSpeedPlugin
from transformers import get_linear_schedule_with_warmup

device = "cuda"
checkpoint = "THUDM/chatglm-6b"
# mixed_precision = 'bf16'
lora_config = {
    'r': 32,
    'lora_alpha':32,
    'lora_dropout':0.1,
    'enable_lora':[True, True, True],
}
max_length = 256
LR = 2e-5
NUM_EPOCHS = 2
batch_size = 4
accumulate_step = 8
warm_up_ratio = 0.1


tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, revision = 'main')
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True, revision = 'main')
model = lora_utils.insert_lora.get_lora_model(model, lora_config)


import dataset.Alpaca as Alpaca_Data

pairs = Alpaca_Data.load('./data/alpaca_data.json')
GLM_Data.device = device
pairs_encoded = GLM_Data.encode_pairs(pairs, tokenizer)
pairs_encoded = list(filter(lambda pair: len(pair['prompt'])+len(pair['completion']) <= max_length, pairs_encoded))
train_dataset = GLM_Data.GLMDataset(pairs_encoded)
train_dataloader = DataLoader(dataset=train_dataset, collate_fn = GLM_Data.collate_fn, shuffle=True, batch_size=batch_size)


optimizer = torch.optim.AdamW(model.parameters(), lr=LR)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=int(len(train_dataloader) / accumulate_step * warm_up_ratio),
    num_training_steps=(int(len(train_dataloader) / accumulate_step) * NUM_EPOCHS),
)

# model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)
# model.to(device).train()
from torch.cuda.amp import autocast

LR = 2e-5
NUM_EPOCHS = 2
accumulate_step = 16
version = 'test'

optimizer = torch.optim.AdamW(model.parameters(), lr=LR)

lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=int(len(train_dataloader) / accumulate_step),
    num_training_steps=(int(len(train_dataloader) / accumulate_step) * NUM_EPOCHS),
)


model.half().to(device).train()

for epoch in range(NUM_EPOCHS):
    epoch_loss_local = 0
    for step, batch in enumerate(t:=tqdm.tqdm(train_dataloader)):
        batch = {k: v.to('cuda') for k, v in batch.items()}
        outputs = model(**batch)
        loss_d = outputs.loss.detach()
        epoch_loss_local += loss_d
        t.set_description(f"loss: {epoch_loss_local.cpu().float() / step}")
        loss = outputs.loss / accumulate_step
        loss.backward()
        if (step+1) % accumulate_step == 0:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

LORAConfig报错:ValueError: Target modules ['q', 'k', 'v'] not found in the base model. Please check the target modules and try again.

我使用train.py完成了训练,并且获得了./saved/finetune_0.pt文件,但是由于没有inference部分,所以使用LoRA_finetune_with_stanford_alpaca.ipynb中最后inference模块的时候报错:

Traceback (most recent call last):
  File "inference.py", line 49, in <module>
    module.query_key_value = peft.tuners.lora.LoraModel(config, module.query_key_value)
  File "C:\Users\xxxx\AppData\Local\Programs\Python\Python38\lib\site-packages\peft\tuners\lora.py", line 118, in __init__
    self._find_and_replace()
  File "C:\Users\xxxx\AppData\Local\Programs\Python\Python38\lib\site-packages\peft\tuners\lora.py", line 181, in _find_and_replace
    raise ValueError(
ValueError: Target modules ['q', 'k', 'v'] not found in the base model. Please check the target modules and try again.

使用的LORAConfig是:(这个target在ipynb和train.py里是一致的)

config = LoraConfig(
              peft_type="LORA", 
              r=32, 
              lora_alpha=32, 
              target_modules=["q", "k", "v"],
              lora_dropout=0.1, 
              )

为什么会报这个错?之前遇到过但是也没搜到…不知道大家能不能跑通inference。

问答数据集如何构建

image 自己构建问答对的数据集,按照这样格式整理有什么问题?

还请指教一下,多谢!

训练超显存

请问这个模型训练显存至少要多少?我这里是V100 32GB

here are my questions,I have more than 4 gpus to run the train.py,but it still out of memory,I check the usage of memory and find that one of them overflows and produce the bug,how can I solve it?

WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "", line 1, in
FileNotFoundError: [Errno 2] No such file or directory: '/home/cike/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/551a50efec3acc5a9b94de8ec46d33d0f81919f7/modeling_chatglm.py'
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.92s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.95s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.96s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.97s/it]
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
[2023-04-06 08:33:02,712] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,747] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,756] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,874] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:24,442] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-04-06 08:33:24,445] [INFO] [logging.py:93:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-04-06 08:33:24,445] [INFO] [logging.py:93:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-04-06 08:33:24,509] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-04-06 08:33:24,509] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-04-06 08:33:24,509] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:144:init] Reduce bucket size 500,000,000
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:145:init] Allgather bucket size 500,000,000
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:146:init] CPU Offload: False
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:147:init] Round robin gradient partitioning: False
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/cike/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.35059165954589844 seconds
Loading extension module utils...
Time to load utils op: 0.40517687797546387 seconds
Loading extension module utils...
Time to load utils op: 0.40523695945739746 seconds
Loading extension module utils...
Time to load utils op: 0.40430521965026855 seconds
Rank: 3 partition count [4] and sizes[(5505024, False)]
Rank: 2 partition count [4] and sizes[(5505024, False)]
Rank: 0 partition count [4] and sizes[(5505024, False)]
Rank: 1 partition count [4] and sizes[(5505024, False)]
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00042438507080078125 seconds
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Time to load utils op: 0.00040459632873535156 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00035071372985839844 seconds
0%| | 0/12241 [00:00<?, ?it/s][2023-04-06 08:33:25,838] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-04-06 08:33:25,839] [INFO] [utils.py:830:see_memory_usage] MA 11.71 GB Max_MA 11.72 GB CA 11.75 GB Max_CA 12 GB
[2023-04-06 08:33:25,839] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,025] [INFO] [utils.py:829:see_memory_usage] After initializing optimizer states
[2023-04-06 08:33:26,025] [INFO] [utils.py:830:see_memory_usage] MA 11.76 GB Max_MA 11.82 GB CA 11.85 GB Max_CA 12 GB
[2023-04-06 08:33:26,026] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,026] [INFO] [stage_1_and_2.py:520:init] optimizer state initialized
[2023-04-06 08:33:26,092] [INFO] [utils.py:829:see_memory_usage] After initializing ZeRO optimizer
[2023-04-06 08:33:26,093] [INFO] [utils.py:830:see_memory_usage] MA 11.76 GB Max_MA 11.76 GB CA 11.85 GB Max_CA 12 GB
[2023-04-06 08:33:26,093] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,094] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
[2023-04-06 08:33:26,096] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] amp_enabled .................. False
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] amp_params ................... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] bfloat16_enabled ............. True
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_parallel_write_pipeline False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_tag_validation_enabled True
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_tag_validation_fail False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f15140266a0>
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] communication_data_type ...... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] curriculum_enabled_legacy .... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] curriculum_params_legacy ..... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] data_efficiency_enabled ...... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dataloader_drop_last ......... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] disable_allgather ............ False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dump_state ................... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dynamic_loss_scale_args ...... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_enabled ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_gas_boundary_resolution 1
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_layer_num ......... 0
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_max_iter .......... 100
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_stability ......... 1e-06
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_tol ............... 0.01
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_verbose ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] elasticity_enabled ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_auto_cast ............... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_enabled ................. False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_master_weights_and_gradients False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] global_rank .................. 0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] grad_accum_dtype ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_accumulation_steps .. 8
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_clipping ............ 0.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_predivide_factor .... 1.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] initial_dynamic_scale ........ 1
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] load_universal_checkpoint .... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] loss_scale ................... 1.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] memory_breakdown ............. False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_legacy_fusion ...... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_name ............... None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_params ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pld_enabled .................. False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pld_params ................... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] prescale_gradients ........... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] scheduler_name ............... None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] scheduler_params ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] sparse_attention ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] sparse_gradients_enabled ..... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] steps_per_print .............. inf
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] train_batch_size ............. 32
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] train_micro_batch_size_per_gpu 1
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] use_node_local_storage ....... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] wall_clock_breakdown ......... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] world_size ................... 4
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_allow_untested_optimizer True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_enabled ................. True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_force_ds_cpu_optimizer .. True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_optimization_stage ...... 2
[2023-04-06 08:33:26,099] [INFO] [config.py:1007:print_user_config] json = {
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none"
},
"offload_param": {
"device": "none"
},
"stage3_gather_16bit_weights_on_model_save": false
},
"steps_per_print": inf,
"bf16": {
"enabled": true
},
"fp16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00032019615173339844 seconds
loss: 2.640625: 0%| | 1/12241 [00:04<14:19:20, 4.21s/it]
Traceback (most recent call last):
File "/home/cike/zzp/LoRA/ChatGLM-finetune-LoRA/train.py", line 220, in
accelerator.backward(loss)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1677, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2008, in backward
self.allreduce_gradients()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1918, in allreduce_gradients
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 834, in overlapping_partition_gradients_reduce_epilogue
self.independent_gradient_partition_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 720, in independent_gradient_partition_epilogue
self.reduce_ipg_grads()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1287, in reduce_ipg_grads
self.average_tensor(self.ipg_buffer[self.ipg_index])
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1018, in average_tensor
tensor_to_reduce = tensor.to(self.communication_data_type)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 1; 15.90 GiB total capacity; 12.84 GiB already allocated; 927.75 MiB free; 14.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
loss: 2.75: 0%| | 1/12241 [00:04<15:35:54, 4.59s/it]
Traceback (most recent call last):
File "/home/cike/zzp/LoRA/ChatGLM-finetune-LoRA/train.py", line 220, in
accelerator.backward(loss)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1677, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2008, in backward
self.allreduce_gradients()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1918, in allreduce_gradients
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 834, in overlapping_partition_gradients_reduce_epilogue
self.independent_gradient_partition_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 720, in independent_gradient_partition_epilogue
self.reduce_ipg_grads()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1287, in reduce_ipg_grads
self.average_tensor(self.ipg_buffer[self.ipg_index])
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1018, in average_tensor
tensor_to_reduce = tensor.to(self.communication_data_type)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 3; 15.90 GiB total capacity; 12.88 GiB already allocated; 903.75 MiB free; 14.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107844 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107846 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 107845) of binary: /home/cike/anaconda/envs/lora/bin/python
Traceback (most recent call last):
File "/home/cike/anaconda/envs/lora/bin/accelerate", line 8, in
sys.exit(main())
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/launch.py", line 908, in launch_command
deepspeed_launcher(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/launch.py", line 647, in deepspeed_launcher
distrib_run.run(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-04-06_08:33:33
host : 4d9275d5570f
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 107847)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-04-06_08:33:33
host : 4d9275d5570f
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 107845)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

关于分布式GPU训练

这个训练是如何支持分布式GPU训练的,有什么参数是能设置让模型进行并行训练的,我只知道修改那个num_processes的参数只能在每个GPU下分别训练一个模型,但是能不能让一个模型在多个GPU下训练呢,因为我的显卡只有16G

finetune没效果

针对你是谁finetune,输出还是原模型的回答,新finetune的pt模型加载没有问题吧,如下:
tokenizer = AutoTokenizer.from_pretrained("ChatGLM-6B/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("ChatGLM-6B/chatglm-6b", trust_remote_code=True).half().cuda()

加载finetune模型

peft_path = "ChatGLM-finetune-LoRA/saved/finetune_test/finetune_test_epoch_2.pt"
model.load_state_dict(torch.load(peft_path), strict=False)
model.eval()

有大佬们试过用8卡训练的吗?

我想多卡微调模型,用超长文本(超过1024)来微调。因为1024用4张卡已经很勉强了,所以我想用8张卡微调,但是我用超过4张卡运行就会报错,有大佬们试过用8卡微调的吗?最长能支持多长的文本长度?

报错

SelfAttention' object has no attribute 'query_key_valu 有人知道这怎么解决吗

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.