huanglk / transpeeder Goto Github PK
View Code? Open in Web Editor NEWtrain llama on a single A100 80G node using ๐คโtransformers and ๐โDeepspeed Pipeline Parallelism
License: Apache License 2.0
train llama on a single A100 80G node using ๐คโtransformers and ๐โDeepspeed Pipeline Parallelism
License: Apache License 2.0
I have tried finetuning LLaMa 30B on an A100 with 2 GPUs with 80 GB each. The script got completed running in 5 min and there is no output generated. I couldn't find any error as well.
The command used to run the script:
deepspeed --include A1:0,1 --master_port 22384 train.py --output_dir output --init_ckpt /root/llama-30b-init-ckpt/ --data_path /root/alpaca_deepspeed.json --max_seq_len 1024 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 2 --model_parallel_size 1 --use_flash_attn true --deepspeed_config ./configs/ds_config_zero1.json
I had flash_attn reinstalled
(gh_llama-deepspeed) r730ub20@r730ub20-M0:/llm_dev/llama-deepspeed$ python3 scripts/convert2ckpt.py --model_name_or_path /data-ssd-1t/hf_model/llama-7b-hf/ --output_dir llama-7b-init-ckpt//llm_dev/llama-deepspeed$
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ /home/r730ub20/llm_dev/llama-deepspeed/scripts/convert2ckpt.py:11 in โ
โ โ
โ 8 import torch โ
โ 9 import transformers โ
โ 10 โ
โ โฑ 11 from models.patching import ( โ
โ 12 โ smart_tokenizer_and_embedding_resize, โ
โ 13 ) โ
โ 14 from feeder import ( โ
โ โ
โ /home/r730ub20/llm_dev/llama-deepspeed/./models/patching.py:11 in โ
โ โ
โ 8 from transformers.models.llama.modeling_llama import apply_rotary_pos_emb โ
โ 9 โ
โ 10 from einops import rearrange โ
โ โฑ 11 from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func โ
โ 12 from flash_attn.bert_padding import unpad_input, pad_input โ
โ 13 โ
โ 14 โ
โ โ
โ /home/r730ub20/.local/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py:5 in โ
โ โ
โ โ
โ 2 import torch.nn as nn โ
โ 3 import torch.nn.functional as F โ
โ 4 โ
โ โฑ 5 import flash_attn_cuda โ
โ 6 โ
โ 7 โ
โ 8 def _get_block_size(device, head_dim, is_dropout): โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
ImportError: /home/r730ub20/.local/lib/python3.8/site-packages/flash_attn_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE
(gh_llama-deepspeed) r730ub20@r730ub20-M0:
็นๅซๆฏๅๅงๅ่ฟไธชwandb็้ถๆฎต๏ผๅกไบ10ๅ ๅ้๏ผ่ฏท้ฎไฝ ๆ้ๅฐๅคๅก่ฎญ็ปๅฏๅจๆ
ข็้ฎ้ขๅ๏ผๆไปไนๅฏ่ฝ็ๆนๅๆนๆกๅ@HuangLK
ๅจ๏ผๅธ ๅฅ๏ผไฝ ่ฟ่พน็ๅทฅ็จๅพๆฃ๏ผๆๅจๅญฆไน ็ๅทฅ็จไธญๆไธไบ็้ฎ๏ผๅธๆไฝ ่ฝๆฝ็ฉบ่งฃ็ญไธไธใๅ ทไฝ้ฎๆๅฆไธ๏ผ
# pipeline model
model = get_model(model_config, ds_args, activation_checkpointing_config)
engine, _, _, _ = deepspeed.initialize(
ds_args,
model=model,
model_parameters=[p for p in model.parameters() if p.requires_grad]
)
# use `convert2ckpt.py`
engine.load_checkpoint(model_args.init_ckpt, load_module_only=True)
Traceback (most recent call last): File "train.py", line 131, in <module> main() File "train.py", line 109, in main engine.load_checkpoint(model_args.init_ckpt,load_module_only=True)#load_module_only=True File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2769, in load_checkpoint success = self._load_zero_checkpoint( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2948, in _load_zero_checkpoint zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3042, in _get_all_zero_checkpoints return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3014, in _get_all_zero_checkpoint_state_dicts _state = self.checkpoint_engine.load( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in load partition = torch.load(path, map_location=map_location) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 699, in load with _open_file_like(f, 'rb') as opened_file: File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 231, in _open_file_like return _open_file(name_or_buffer, mode) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 212, in __init__ super(_open_file, self).__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: './llama-7B-init-ckpt/global_step001/zero_pp_rank_0_mp_rank_01_optim_states.pt' [2023-08-13 20:35:08,552] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from ./llama-7B-init-ckpt/global_step001/zero_pp_rank_0_mp_rank_02_optim_states.pt...
ไฝ ๅฅฝ๏ผconfig.json้train_micro_batch_size_per_gpuๅจpipelineๆบๅถไธๆฏ่กจ็คบchunkๅ๏ผtrain_batch_sizeๆฏๆป็batch sizeใ
deling_llama.py:134 in apply_rotary_pos_emb โ
โ โ
โ 131 โ
โ 132 โ
โ 133 def apply_rotary_pos_emb(q, k, cos, sin, position_ids): โ
โ โฑ 134 โ gather_indices = position_ids[:, None, :, None] # [bs, 1, seq_len โ
โ 135 โ gather_indices = gather_indices.repeat(1, cos.shape[1], 1, cos.sha โ
โ 136 โ cos = torch.gather(cos.repeat(gather_indices.shape[0], 1, 1, 1), 2 โ
โ 137 โ sin = torch.gather(sin.repeat(gather_indices.shape[0], 1, 1, 1), 2 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
TypeError: 'NoneType' object is not subscriptable
[2023-04-13 11:32:44,508] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 197
[2023-04-13 11:32:44,508] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 198
[2023-04-13 11:32:47,255] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 199
[2023-04-13 11:32:49,894] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 200
@HuangLK do you know how this happend and how to solve it ?
Hi,
I saw your code have used PipeModelDataParallelTopology
API to specify the group size of pipeline parallelism and data parallel. However, I didn't see DistribtuedSampler
is used for dataset shard. May I know if you have explored about this?
Thanks for your help very much!
Fangkai
how to support bf16?
I got GPU OOM
(gh_llama-deepspeed) amd00@asus00:/llm_dev/llama-deepspeed$/llm_dev/llama-deepspeed$ deepspeed --include localhost:0 --master_port 22384 train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
(gh_llama-deepspeed) amd00@asus00:
[2023-05-31 17:15:04,883] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-31 17:15:04,892] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=22384 --enable_each_rank_log=None train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
[2023-05-31 17:15:06,134] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-31 17:15:06,134] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-31 17:15:06,134] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-31 17:15:06,134] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-31 17:15:06,134] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-31 17:15:07,635] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:00<00:00, 3358.13it/s]
total samples num: 50
Traceback (most recent call last):
File "train.py", line 130, in
main()
File "train.py", line 99, in main
model = get_model(model_config, ds_args, activation_checkpointing_config)
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 167, in get_model
print("pp is %d, mp is %d, world_size is:", pp, mp, args.world_size)
UnboundLocalError: local variable 'pp' referenced before assignment
[2023-05-31 17:15:08,142] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26374
[2023-05-31 17:15:08,143] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--output_dir', 'out_dir', '--init_ckpt', 'llama-7b-init-ckpt/', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '8', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '1', '--model_parallel_size', '1', '--use_flash_attn', 'false', '--deepspeed_config', './configs/ds_config_zero1.json'] exits with return code = 1
(gh_llama-deepspeed) amd00@asus00:/llm_dev/llama-deepspeed$ vim train.py/llm_dev/llama-deepspeed$ vim models/llama_pipeline_model.py
(gh_llama-deepspeed) amd00@asus00:
(gh_llama-deepspeed) amd00@asus00:/llm_dev/llama-deepspeed$ deepspeed --include localhost:0 --master_port 22384 train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json/llm_dev/llama-deepspeed$
[2023-05-31 17:16:32,333] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-31 17:16:32,342] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=22384 --enable_each_rank_log=None train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
[2023-05-31 17:16:33,582] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-31 17:16:33,582] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-31 17:16:33,582] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-31 17:16:33,582] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-31 17:16:33,582] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-31 17:16:35,093] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:00<00:00, 3368.92it/s]
total samples num: 50
pp is %d, mp is %d, world_size is: 1 1 1
SEED_LAYERS=False BASE_SEED=42 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0}
[2023-05-31 17:16:35,204] [INFO] [module.py:358:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=35
0: EmbeddingPipe
1: ParallelTransformerLayerPipe
2: ParallelTransformerLayerPipe
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
6: ParallelTransformerLayerPipe
7: ParallelTransformerLayerPipe
8: ParallelTransformerLayerPipe
9: ParallelTransformerLayerPipe
10: ParallelTransformerLayerPipe
11: ParallelTransformerLayerPipe
12: ParallelTransformerLayerPipe
13: ParallelTransformerLayerPipe
14: ParallelTransformerLayerPipe
15: ParallelTransformerLayerPipe
16: ParallelTransformerLayerPipe
17: ParallelTransformerLayerPipe
18: ParallelTransformerLayerPipe
19: ParallelTransformerLayerPipe
20: ParallelTransformerLayerPipe
21: ParallelTransformerLayerPipe
22: ParallelTransformerLayerPipe
23: ParallelTransformerLayerPipe
24: ParallelTransformerLayerPipe
25: ParallelTransformerLayerPipe
26: ParallelTransformerLayerPipe
27: ParallelTransformerLayerPipe
28: ParallelTransformerLayerPipe
29: ParallelTransformerLayerPipe
30: ParallelTransformerLayerPipe
31: ParallelTransformerLayerPipe
32: ParallelTransformerLayerPipe
33: LayerNormPipe
34: LMLayerPipe
loss: loss_fn
Traceback (most recent call last):
File "train.py", line 130, in
main()
File "train.py", line 99, in main
model = get_model(model_config, ds_args, activation_checkpointing_config)
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 182, in get_model
return GPT2ModelPipe(model_config,
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 157, in init
super().init(
File "/home/amd00/.local/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 200, in init
self.to(get_accelerator().device_name(self.local_rank))
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
return self._apply(convert)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
param_applied = fn(param)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.70 GiB total capacity; 22.83 GiB already allocated; 97.88 MiB free; 22.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-31 17:17:30,649] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26532
[2023-05-31 17:17:30,650] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--output_dir', 'out_dir', '--init_ckpt', 'llama-7b-init-ckpt/', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '8', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '1', '--model_parallel_size', '1', '--use_flash_attn', 'false', '--deepspeed_config', './configs/ds_config_zero1.json'] exits with return code = 1
(gh_llama-deepspeed) amd00@asus00:
ๆจๅฅฝ๏ผๆ็ๆจๅจParallelTransformerLayerPipe้ๅขๅ ไบself.activation_checkpointing = activation_checkpointing,ไฝๆฏ่ฟไธชๅๆฐๅจllamaๆจกๅ้ๆฏๆฒกๆ็๏ผๅ ่ฝฝllama็ๆจกๅไธไผๅบ้ๅใ
ๆ็ๅจๆดๆฐ็ไปฃ็ ไธญ๏ผๆฏๅ
ๆhfๆ ผๅผ่ฝฌๅไธบdeepspeed็ๆ ผๅผ๏ผ็ถๅengine.load_checkpoint(model_args.init_ckpt, load_module_only=True)ๅ ่ฝฝ๏ผ่ฟไธชๅฐๆนๅ ่ฝฝ็่ฟ็จไธญไผ้ป่ฎคไธๅ ่ฝฝๅ๏ผ
ๆจๅฅฝ๏ผ @HuangLK ๏ผ
ๅจ่ฎญ็ปๆถ๏ผloss=nan๏ผ่ฟไธชๆจ้ๅฐ่ฟๅ๏ผ
train.pyไธญ็็ฌฌ108่ก
engine.load_checkpoint(model_args.init_ckpt, load_module_only=True)
ๆๆฒกๆ่ฟไธ่ก๏ผ่ฎญ็ปๅๅง็loss้ฝไธๆ ทใๅฅฝๅๅนถๆฒกๆๆๅๅ ่ฝฝๅฐๆจกๅๅๆฐ
ๆ็ไปฃ็ ่ฒไผผๆฏ่ฟๆ ท
Hello,
when I try to use flash attention, I have encountered the following problem:
โ /export/home2/fangkai/merit-v2/trainer_base_ds_mp.py:346 in main โ
โ โ
โ 343 โ โ โ logger.info("Resuming training from the latest checkpoint: โ
โ 344 โ โ โ continue_from_global_step = int(checkpoint.split('-')[-1]) โ
โ 345 โ โ โ
โ โฑ 346 โ โ global_step, tr_loss = train(cfg, model_pipe, tokenizer, conti โ
โ 347 โ โ logger.info(" global_step = %s, average loss = %s", global_ste โ
โ 348 โ
โ 349 โ
โ โ
โ /export/home2/fangkai/merit-v2/trainer_base_ds_mp.py:236 in train โ
โ โ
โ 233 โ โ โ โ โ continue โ
โ 234 โ โ โ โ โ
โ 235 โ โ โ โ model.train() โ
โ โฑ 236 โ โ โ โ loss = model.train_batch(data_iter=sub_train_dataloade โ
โ 237 โ โ โ โ global_step += 1 โ
โ 238 โ โ โ โ โ
โ 239 โ โ โ โ tr_loss += loss.item() โ
โ โ
โ /export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/de โ
โ epspeed/runtime/pipe/engine.py:336 in train_batch โ
โ โ
โ 333 โ โ sched = schedule.TrainSchedule(micro_batches=self.micro_batch โ
โ 334 โ โ โ โ โ โ โ โ โ stages=self.num_stages, โ
โ 335 โ โ โ โ โ โ โ โ โ stage_id=self.stage_id) โ
โ โฑ 336 โ โ self._exec_schedule(sched) โ
โ 337 โ โ self.agg_train_loss = self._aggregate_total_loss() โ
โ 338 โ โ โ
โ 339 โ โ self.timers('train_batch').stop() โ
โ โ
โ /export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/de โ
โ epspeed/runtime/pipe/engine.py:1307 in _exec_schedule โ
โ โ
โ 1304 โ โ โ โ โ
โ 1305 โ โ โ โ # Equivalent to: self._exec_forward_pass(buffer_id=0) โ
โ 1306 โ โ โ โ self._exec_instr = MethodType(self._INSTRUCTION_MAP[t โ
โ โฑ 1307 โ โ โ โ self._exec_instr(**cmd.kwargs) โ
โ 1308 โ
โ โ
โ /export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/de โ
โ epspeed/runtime/pipe/engine.py:996 in _exec_send_grads โ
โ โ
โ 993 โ โ โ โ โ if not buffer.is_floating_point(): โ
โ 994 โ โ โ โ โ โ assert buffer.grad is None โ
โ 995 โ โ โ โ โ โ continue โ
โ โฑ 996 โ โ โ โ โ assert buffer.grad is not None โ
โ 997 โ โ โ โ โ p2p.send(buffer.grad, self.prev_stage) โ
โ 998 โ โ โ
โ 999 โ โ # We can free up the input buffer now โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
AssertionError
I also test it by using the torch.nn.functional.scaled_dot_product_attention, which implements flash attention in torch2.0, but I met the same problem. May I know if you have encountered with the problem?
Thanks for your help very much!
Best,
Fangkai
ไฝฟ็จ้ป่ฎค็ds_config.json้
็ฝฎๆไปถ๏ผๅชไฟฎๆนไบwandb้จๅไธบfalse(ๅ ไธบๆ
ข)๏ผ็ถๅๅฐฑๅ็ฐๆพๅญๅ้
ไบๅดไธๅผๅง่ฎญ็ป๏ผๅกๅจUsing /root/.cache/torch_extensions as PyTorch extensions root...๏ผ
ไบๆฏๆธ
็ฉบroot/.cacheๅๅ้ๆฐ่ฎญ็ป๏ผๅฐฑๅ็ฐๆฅ้ไบ๏ผerrorไฟกๆฏๅฆไธ
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu116/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] =/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
=/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
/bin/sh: 1: =/usr/local/cuda-11.6/bin/nvcc: not found
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
[2023-04-21 17:47:56,170] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown
[2023-04-21 17:47:56,315] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, _, _, _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
_write_ninja_file_and_build_library(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fused_adam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, _, _, _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Loading extension module fused_adam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, _, _, _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Loading extension module fused_adam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, _, _, _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
[2023-04-21 17:48:12,493] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105683
[2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105684
[2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105685
[2023-04-21 17:48:12,845] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105686
[2023-04-21 17:48:12,847] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--output_dir', '/root/nas-private/output', '--init_ckpt', '/root/nas-private/llama-7B-init-ckpt', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '1024', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '4', '--model_parallel_size', '1', '--use_flash_attn', 'true', '--deepspeed_config', './configs/ds_config.json'] exits with return code = 1
ไฝ ๅฅฝ๏ผๆๅจๆง่กpython convert2ckpt.py --mp_world_size 4 --model_name_or_path /path/to/llama-7b-hf --output_dir /path/to/llama-7b-init-ckpt
ๆถๆฅไบไปฅไธ้่ฏฏ๏ผ
`ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface'
็ไบไธflash_attn.flash_attn_interface
่ๆฌ้้ข็กฎๅฎๆฒกๆflash_attn_unpadded_qkvpacked_func
ๅฝๆฐ๏ผๆ็จ็็ฏๅขๆฏpytorch1.13, python3.10, flash-attn.2.0.8, ่ฝๅฆๆไพไธไฝ ็็ฏๅขๆ่
่งฃๅณๆนๆกๅ๏ผ
ๅจfeeder.py้็ปๆจกๅๆไพ็ๆฏๅ ๆmask๏ผไฝๆฏๆฒกๆๆไพpad mask๏ผ่ฟไธชๅฐๆนไผผไน้่ฆๆน่ฟไธไธใ
Hi Huang, nice work!
when I tried to train with a 13B model, I got the error:
[Errno 2] No such file or directory: 'llama_13b_pp/global_step001/zero_pp_rank_0_mp_rank_03_optim_states.pt'
Any ideas on this? The 'convert2ckpt.py' script does not generate files with prefix 'zero_pp_....'
Hi, wonderful work!
I didn't use your code but I following your code to implement my own llama-pipeline parallelism. But I'm encountering the following problem. May I know if you have encountered similar problems? I have no ideas about the solution.
Thanks for your help very much!
The error message:
Traceback (most recent call last)
File "/home/fangkai/merit-v2/trainer_base_ds_mp.py", line 418, in <module>
main()
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/main.py", line 90, in decorated_main
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
raise ex
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
lambda: hydra.run(
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/fangkai/merit-v2/trainer_base_ds_mp.py", line 352, in main
global_step, tr_loss = train(cfg, model, tokenizer, continue_from_global_step)
File "/home/fangkai/merit-v2/trainer_base_ds_mp.py", line 212, in train
loss = model.train_batch(sub_train_dataloader)
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
self._exec_schedule(sched)
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 733, in _exec_backward_pass
torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)
File "/home/fangkai/anaconda3/envs/py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn
Here is a toy dataset:
class TestDataset(Dataset):
def __init__(self, file_path, tokenizer):
super().__init__()
self.data = ["My name is Jiao Fangkai."]
def __len__(self):
return 100000000
def __getitem__(self, index):
return {"flan": {
"inputs": self.data[0],
"targets": self.data[0],
}}
Here is the collator:
def vanilla_seq2seq_convertor(examples, tokenizer: PreTrainedTokenizer, max_seq_length, decoder_only: bool = False):
inputs = []
outputs = []
for exp in examples:
inputs.append(exp["inputs"])
if decoder_only:
outputs.append(exp["inputs"] + " " + exp["targets"] + tokenizer.eos_token)
else:
outputs.append(exp["targets"])
model_inputs = tokenizer(inputs, text_target=outputs, max_length=max_seq_length, padding="longest",
truncation=True, return_tensors="pt")
if decoder_only:
input_lens = model_inputs["input_ids"].ne(tokenizer.pad_token_id).sum(dim=1)
model_inputs = tokenizer(outputs, max_length=max_seq_length, padding="longest",
truncation=True, return_tensors="pt")
new_input_lens = model_inputs["input_ids"].ne(tokenizer.pad_token_id).sum(dim=1)
input_lens = input_lens - input_lens.eq(new_input_lens).to(input_lens.dtype) * (input_lens // 2)
input_lens = input_lens.to(torch.long)
model_inputs["input_lens"] = input_lens
return model_inputs
def get_lm_labels(input_lens, input_ids, pad_token_id):
labels = input_ids.clone()
label_mask = labels.ne(pad_token_id)
lens_mask = torch.arange(labels.size(1))[None, :] >= input_lens[:, None]
label_mask = label_mask & lens_mask
labels = labels.masked_fill(~label_mask, -100).contiguous()
return labels
class FlanCollatorOverCollator:
def __init__(self, tokenizer: str, max_seq_length: int, decoder_only: bool = False):
self.tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(tokenizer, use_fast=False)
expand_special_tokenizer(self.tokenizer)
self.max_seq_length = max_seq_length
self.decoder_only = decoder_only
def __call__(self, batch):
flan_batch = []
for item in batch:
flan_batch.append(item.pop("flan"))
model_inputs = vanilla_seq2seq_convertor(flan_batch, self.tokenizer, self.max_seq_length, self.decoder_only)
# Add suffix `input_ids` to tackle the deepspeed logic.
seq_length = model_inputs["input_ids"].size(1)
position_ids = torch.arange(0, seq_length, dtype=torch.long)
position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
return (
(
model_inputs["input_ids"],
model_inputs["attention_mask"],
# position_ids,
# model_inputs["input_lens"],
# model_inputs["input_ids"].detach().clone()
),
# model_inputs["input_ids"].detach().clone()
get_lm_labels(model_inputs["input_lens"], model_inputs["input_ids"], self.tokenizer.pad_token_id)
)
return model_inputs
And the initialization:
topo = PipeModelDataParallelTopology(num_pp=4, num_mp=1, num_dp=1)
model = PipelineModule(layers=layers,
# num_stages=cfg.num_stages,
topology=topo,
loss_fn=models.llama_ds_mp_wrap.loss_fn,
activation_checkpoint_interval=getattr(cfg, "activation_checkpoint_interval", 0))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.