epfllm / megatron-llm Goto Github PK

distributed trainer for LLMs

License: Other

Python 90.71% Makefile 0.02% C++ 6.35% Shell 0.02% C 0.20% Cuda 2.52% HTML 0.18%

megatron-llm's Introduction

Megatron-LLM

This library enables pre-training and fine-tuning of large language models (LLMs) at scale. Our repository is a modification of the original Megatron-LM codebase by Nvidia.

Added key features include:

architectures supported: Llama, Llama 2, Code Llama, Falcon and Mistral
support training of large models (70B Llama 2, 65B Llama 1, 34B Code Llama, 40B Falcon and Mistral) on commodity hardware on multiple nodes
3-way parallelism: tensor parallel, pipeline parallel and data parallel training (inherited from Megatron)
full pretraining, finetuning and instruct tuning support
Support for special tokens & tokenizers
grouped-query attention (GQA) and multi-query attention (MQA)
Rotary Position Embeddings (RoPE), RMS layer norm, Lima dropout
RoPE scaling for longer attention context support
FlashAttention 2
BF16 / FP16 training
WandB integration
Metrics support: Ease to add custom metrics to evaluate on the validation set while training
Conversion to and from Hugging Face hub

Documentation

Take a look at the online documentation.

Alternatively, build the docs from source:

cd docs/
pip install -r requirements.txt
make html

Example models trained with Megatron-LLM

(Let us know about yours!)

Citation

If you use this software please cite it:

@software{epfmgtrn,
  author       = {Alejandro Hernández Cano  and
                  Matteo Pagliardini  and
                  Andreas Köpf  and
                  Kyle Matoba  and
                  Amirkeivan Mohtashami  and
                  Xingyao Wang  and
                  Olivia Simin Fan  and
                  Axel Marmet  and
                  Deniz Bayazit  and
                  Igor Krawczuk  and
                  Zeming Chen  and
                  Francesco Salvi  and
                  Antoine Bosselut  and
                  Martin Jaggi},
  title        = {epfLLM Megatron-LLM},
  year         = 2023,
  url          = {https://github.com/epfLLM/Megatron-LLM}
}

megatron-llm's People

Contributors

Stargazers

Watchers

Forkers

andreaskoepf apollohuang1 brewswang cdj0311 haorenkk123 dumpmemory sohaib0399 xwjim pemywei ai-jie01 sundogs8603 keyzf panx27 malteos eric11eca atcbosselut likeucode xiechengmude epfl-nlp shamspias ipackhu xhqglorry11 jon-tow mars-wei siviltaram taishi-n324 techthiyanes okoge-kaz earthxp dthulke kaihemo wul8 nhsjgczryf xingyaoww tmsagarofficial superf0sh ting2938 wangguojim xincxiong longxudou haotian89 3x0dv5 polya20 turbo-agi emanuelaboros ethicalsecurity-agency hungrygnu efelem aboysky sail-sg zhengmk321 lihuibng th789 swiss-ai phymucs c-tc aiva-inc yushengsu-thu smamooler jitendriyag2 amathislab xiyang-aads-lilly bmedi yangzhou08 qqq-tech sycao5 sanhariharan deeplink-org

megatron-llm's Issues

how to convert baichuan-13b into megatron weights?

RuntimeError: seq_len <= 2048 INTERNAL ASSERT FAILED

Hello, this is my finetune script:(when I set --seq_length=4096,it run error, but set --seq_length=2048, it can run)--880GA800

export CUDA_DEVICE_MAX_CONNECTIONS=1
LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 100 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008
--hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion
--no_bias_dropout_fusion --use_checkpoint_args
--attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5
--layernorm_epsilon 1e-6
--weight_decay 0.1 --sequence_parallel --recompute_granularity selective
--log_timers_to_tensorboard
--rope_scaling_factor 1.0"

torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /Megatron-LLM-sharded-weights
--save /Megatron-LLM-sharded-weights
--tensorboard_dir /Megatron-LLM-sharded-weights/tensorboard/
--data_path /Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--vocab_file=/megatron-llama-2-7b-checkpoint_TP2_PP1_DP4/tokenizer.model
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 1000
--micro_batch_size 2
--use_checkpoint_args
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS

======================================================================
Error:

Traceback (most recent call last):
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 261, in
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 139, in pretrain
iteration = _train(args,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 685, in _train
train_step(forward_step_func,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 412, in train_step
losses_reduced = forward_backward_func(
File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 227, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/distributed.py", line 58, in forward
return self.module(*inputs, **kwargs)
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/module.py", line 186, in forward
outputs = self.module(*inputs, **kwargs)
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward
lm_output = self.language_model(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/language_model.py", line 512, in forward
encoder_output = self.encoder(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 757, in forward
attention_output, attention_bias = self.self_attention(layernorm_output,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 510, in forward
context_layer = self._checkpointed_attention_forward(
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 397, in checkpointed_attention_forward
hidden_states = megatron.core.tensor_parallel.checkpoint(
File "/home/dengkaibiao/Megatron-LLM/megatron/core/tensor_parallel/random.py", line 251, in checkpoint
return CheckpointFunction.apply(function,
File "/home/dengkaibiao/Megatron-LLM/megatron/core/tensor_parallel/random.py", line 194, in forward
outputs = run_function(*args)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 393, in custom_forward
output = self.core_attention(query_layer, key_layer,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 231, in forward
attention_probs = self.scale_mask_softmax(attention_scores,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 148, in forward
return self.forward_fused_softmax(input, mask)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 183, in forward_fused_softmax
probs = ScaledUpperTriangMaskedSoftmax.apply(input, scale)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 22, in forward
softmax_results = scaled_upper_triang_masked_softmax_cuda.forward(
RuntimeError: seq_len <= 2048 INTERNAL ASSERT FAILED at "/home/llm-deploy/apex/csrc/megatron/scaled_upper_triang_masked_softmax_cuda.cu":38, please report a bug to PyTorch.

How to load from a saved intermediate checkpoint?

Hello, thanks for this nice library!

I was wondering if it’s possible to load from an intermediate checkpoint (our servers crashed during continuous pre-training)?
We’re running into issues where some command line arguments (e.g. TP and PP) are not loaded correctly from the checkpoint (i.e., the iter_xxxxxx folders).

We're running these arguments:

LOG_ARGS="--log_interval 1 --save_interval 500 --eval_interval 100"
TRAIN_ARGS="--train_iters 50000 --lr_decay_style cosine --lr_warmup_iters 5000 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=INFO

torchrun $DISTRIBUTED_ARGS finetune.py \
	--tensor_model_parallel_size 4 \
	--pipeline_model_parallel_size 1 \
	--load /mpt/Megatron-LLM/sharded_4/weights/ \
	--save /mpt/Megatron-LLM/sharded_4/weights/ \
	--tensorboard_dir /mpt/Megatron-LLM/checkpoints/weights/tensorboard/ \
	--data_path /mpt/Megatron-LLM/tokenized/_text_document \
	--model_name llama2 \
	--tokenizer_type SentencePieceTokenizer \
	--vocab_file /mpt/Megatron-LLM/megatron/weights/tokenizer.model \
	--bf16 \
	--use_flash_attn \
	--no_bias_gelu_fusion \
	--micro_batch_size 1 \
	--global_batch_size 512 \
	--seq_length 2048 \
	--sequence_parallel \
	--recompute_granularity selective \
	--use_checkpoint_args \
	$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS

Where the latest_checkpointed_iteration.txt number points to the last checkpoint.

Here is a snippet of the log:

Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint  
Setting num_attention_heads to 32 from checkpoint 
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint  
Setting tie_embed_logits to False from checkpoint 
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
> setting tensorboard ...
time to initialize megatron (seconds): 11.061
[after megatron is initialized] datetime: 2024-01-31 19:21:24

Here tensor_model_parallel_size has value 1, which shouldn't be correct and causes GPU OOM.

Is there a way to fix this?

Convert LLama-30B to Megatron Error

python weights2megatron.py llama --size=30 --out=test --cache-dir=pretrain_model/LLaMA/30B/ (MEATA release version)

torch.Size([4992, 6656]) 128 52
Converting weights:   0%|                                                                                                                                                            | 0/60 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "weights2megatron.py", line 257, in <module>
    main(args.model, args.size, args.out, args.cache_dir, args.megatron_path)
  File "weights2megatron.py", line 165, in main
    megatron_weights = llama_to_megatron(hf_weights, size, llama_source,
  File "weights2megatron.py", line 135, in llama_to_megatron
    transformer[f"{prefix}.attention.query_key_value.weight"] = rearrange_qkv(
  File "weights2megatron.py", line 95, in rearrange_qkv
    assert len(wq) == n_heads
AssertionError

it seems that we need to transport the weight for rearrange_qkv like following

weights[f"{prefix}.attention.wq.weight"].T if version ==1 else weights[f"{prefix}.attention.wq.weight"],
weights[f"{prefix}.attention.wk.weight"].T if version ==1 else weights[f"{prefix}.attention.wk.weight"],
weights[f"{prefix}.attention.wv.weight"].T if version ==1 else weights[f"{prefix}.attention.wv.weight"]

Can we support support llama 1 ? Currently， it just only support llama2

 assert version == 2, "Only llama v2 available using huggingface"

Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion)

The current weight conversion script doesn't generate a corresponding HuggingFace tokenizer configuration. Ideally the tokenizer configuration (special_tokens_map.json, tokenizer.json, tokenizer.model, tokenizer_config.json) should be generated as part of the megatron2hf conversion script.

As a temporary solution I created a create_hf_tokenizer_config.py script that generates a HF tokenizer configuration with token-ids matching the Megatron-LLM tokenizers with support additional custom tokens.

Additionally I noticed the following points:

Unlike _SentencePieceTokenizer the _FalconTokenizer doesn't add special tokens like <CLS><SEP><EOD><MASK> and uses the standard EOS token (<|endoftext|>) also as EOD token.
For _SentencePieceTokenizer the use of custom tokens is tied to adding the special tokens (<CLS>, <SEP>, <EOD>, <MASK> are added when new_tokens == True) even though they might not be used (eod should always be mapped to eos (</s>) since it is used by get_ltor_masks_and_position_ids() when reset_position_ids or reset_attention_mask are True)
SentencePieceTokenizer requires a vocab file and the test for it should not be excluded here only to do the check a few lines below

finetune.py: error: unrecognized arguments: --use_multiquery_attn

Calling examples/finetune.sh falcon --size 7 --tp 1 --pp 2 --gpus 8 --global-batch 8 I get the error finetune.py: error: unrecognized arguments: --use_multiquery_attn (the argument is added in finetune.sh#L65). This argument seems not to be defined in arguments.py. Is it save to simply leave it away?

HF LLaMA -> megatron weight

Could you provide the script to convert hf llama or llama2 to megatron weight ? Thanks

Can you send me the complete parameters related to training llama2 using finetune. py?

Add falcon support in megatron2hf.py

One question about the permute function code in permute_qkv.py

I am trying to convert baichuan2-megatron to hf. When reading the code, i can not understand this part

def permute(x):
        if revert:
            return x.view(head_dim//2, 2, dim).transpose(0, 1).reshape(head_dim, dim)
        return x.view(2, head_dim//2, dim).transpose(0, 1).reshape(head_dim, dim)

Why head_dim//2?
Really appreciate it if someone can explain this.

convert_llama2hf.py should be replaced with newer version

The weights2megatron directory contains the script convert_llama2hf.py which looks very similar to (an old version of) convert_llama_weights_to_hf.py. The only difference seems to be a custom parameter for max_shard_size ...

The current version of convert_llama_weights_to_hf.py of HF transformers has support for Llama2, e.g. 70B variant and MQA (n_kv_heads != n_heads).

add RoPE position interpolation

RoPE scaling / interpolation as here should be easy to add:
https://together.ai/blog/llama-2-7b-32k

(and later possibly landmark attention https://github.com/epfml/landmark-attention , can make a separate issue later)

error: preprocess.py file error while working on custom data

Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 71, in encode
text = data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 201, in
main()
Special tokens: {'': 32000, '': 32001, '': 32002, '': 32003, '': 32004, '~~': 1, '~~': 2}

padded vocab (size: 32005) with 123 dummy tokens (new size: 32128)
File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 179, in main
for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 423, in
return (item for chunk in result for item in chunk)
File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
KeyError: 'text'

Validation metrics are not logged to wandb

The validation metrics are currently not correctly sent to wandb. In the internal wandb logs I found entries like 2023-08-03 17:50:55,807 WARNING HandlerThread:28565 [handler.py:handle_request_partial_history():553] Step 50 < 51. Dropping entry: {'lm loss validation': 1.3021695613861084, '_timestamp': 1691085055.8074224}.

I suspect this happens because flush_all() is called in training_log() which calls wandb.log(.. commit=True). When the same step (iteration) number is later used in evaluate_and_print_results() the generated log entries seem to be ignored by wandb.

[Megatron Base Version] Would mind share the based version of Megatron ?

I have found the code for get_checkpoint_name(s) and DistributedOptimizer are different. The upstream version had fix many bugs . Would u mind rebase it ?

support falcon 180B

model support
HF import/export

NaN detection possibly ineffective

During fine-tuning of Falcon-40b with I got NaN losses after ~3300 steps.

I suspect that the code in the optimizer to prevent NaN updates does not work correctly since the training-loop continued to print number of skipped iterations: 0 | number of nan iterations: 0 for following iterations. I had expected that the number of nan iterations or skipped iterations counters would increment in this situation. In the situation which I observed the training unfortunately did not gracefully handle the NaN update.

Output when NaNs started to appear (metrics of previous iterations looked normal):

iteration     3383/    8000 | consumed samples:       216512 | elapsed time per iteration (ms): 7613.1 | learning rate: 6.680E-06 | global batch size:    64 | lm loss: 7.334521E-01 | loss scale: 1.0 | grad norm: 0.752 | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3384/    8000 | consumed samples:       216576 | elapsed time per iteration (ms): 7845.5 | learning rate: 6.678E-06 | global batch size:    64 | lm loss: 7.072597E-01 | loss scale: 1.0 | grad norm: 0.741 | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3385/    8000 | consumed samples:       216640 | elapsed time per iteration (ms): 7624.3 | learning rate: 6.676E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3386/    8000 | consumed samples:       216704 | elapsed time per iteration (ms): 6596.0 | learning rate: 6.674E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3387/    8000 | consumed samples:       216768 | elapsed time per iteration (ms): 7286.4 | learning rate: 6.673E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3388/    8000 | consumed samples:       216832 | elapsed time per iteration (ms): 6701.7 | learning rate: 6.671E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3389/    8000 | consumed samples:       216896 | elapsed time per iteration (ms): 8103.5 | learning rate: 6.669E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3390/    8000 | consumed samples:       216960 | elapsed time per iteration (ms): 7057.0 | learning rate: 6.668E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3391/    8000 | consumed samples:       217024 | elapsed time per iteration (ms): 7748.5 | learning rate: 6.666E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3392/    8000 | consumed samples:       217088 | elapsed time per iteration (ms): 8260.2 | learning rate: 6.664E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |
iteration     3393/    8000 | consumed samples:       217152 | elapsed time per iteration (ms): 7590.5 | learning rate: 6.662E-06 | global batch size:    64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations:   0 | number of nan iterations:   0 |

How to convert Megatron -> Huggingface weights?

Did you maybe also write the counterpart of weights2megatron.py to convert megatron checkpoints back to huggingface format? Or is there a standard tool for this process somewhere else available? Would it be useful to create such a script?

run finetune llama2-7B error

help please
error：
output_tensor = forward_step(forward_step_func, data_iterator,
File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 223, in forward_step
batch = get_batch(data_iterator)
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 101, in get_batch
tokenizer = get_tokenizer()
File "/home/dengkaibiao/Megatron-LLM/megatron/global_vars.py", line 45, in get_tokenizer
_ensure_var_is_initialized(_GLOBAL_TOKENIZER, 'tokenizer')
File "/home/dengkaibiao/Megatron-LLM/megatron/global_vars.py", line 198, in _ensure_var_is_initialized
assert var is not None, '{} is not initialized.'.format(name)
AssertionError: tokenizer is not initialized.

this is my finetune script：
LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 100 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008
--hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion
--no_bias_dropout_fusion --use_checkpoint_args
--attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5
--layernorm_epsilon 1e-6
--weight_decay 0.1 --sequence_parallel --recompute_granularity selective
--log_timers_to_tensorboard
--rope_scaling_factor 1.0"

torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /Megatron-LLM-sharded-weights
--save /Megatron-LLM-sharded-weights
--tensorboard_dir /Megatron-LLM-sharded-weights/tensorboard/
--data_path /Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 1000
--micro_batch_size 2
--use_checkpoint_args
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS

Support QWen？

Great job!
QWen is an open-source model widely used by the community. Does it support the training of this model?

Passed position_ids are ignored for `PositionEmbeddingType.rotary`

The current rotary-embedding code in this repo seems to ignore position_ids and instead always assumes position_ids to match the sequence indices, i.e. position_ids are not passed on to the encoder.forward() function:

Megatron-LLM/megatron/model/language_model.py

Lines 512 to 515 in 9006118

    
           encoder_output = self.encoder( 
        
               encoder_input, 
        
               enc_attn_mask, 
        
               inference_params=inference_params)

The actual call to appy_rotary_emb() always uses pre-computed self.freqs_cis without any position_id dependent lookups, see transformer.py#L499.

This is incompatible to the args.reset_position_ids parameter used for get_ltor_masks_and_position_ids() in finetune.py#L76 which if set to True generates position-ids for batch-packing (potentially restarting from 0 multiple times in one sequence).

Compare to HF transformers Llama impl which passes the position-ids on to the attention layers: src/transformers/models/llama/modeling_llama.py#L412-L420 and actually uses them in apply_rotary_pos_emb().

Nice-to-have training features

Nice-to-have feature ideas:

add effective token/s metric, print in training loop + wandb logging (interesting for instruction tuning & variable sequence length)
add option to set the number of checkpoints to keep and delete old ones (often it is enough to keep the one or two last checkpoints)
allow to specify the wandb run name for finetune.sh as command line arg
logging of epoch progress

run finetune llama2-7B error

help please
error：
output_tensor = forward_step(forward_step_func, data_iterator,
File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 223, in forward_step
batch = get_batch(data_iterator)
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 101, in get_batch
tokenizer = get_tokenizer()
File "/home/dengkaibiao/Megatron-LLM/megatron/global_vars.py", line 45, in get_tokenizer
_ensure_var_is_initialized(_GLOBAL_TOKENIZER, 'tokenizer')
File "/home/dengkaibiao/Megatron-LLM/megatron/global_vars.py", line 198, in _ensure_var_is_initialized
assert var is not None, '{} is not initialized.'.format(name)
AssertionError: tokenizer is not initialized.

torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /Megatron-LLM-sharded-weights
--save /Megatron-LLM-sharded-weights
--tensorboard_dir /Megatron-LLM-sharded-weights/tensorboard/
--data_path /Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 1000
--micro_batch_size 2
--use_checkpoint_args
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS

iteration-time increases linearly when micro_batch_size=1

I reported this issue last time in the issue #22

After conducting extensive investigation, I finally found that this issue only occurs when setting micro_batch_size=1. So I decided to open a new issue to emphasize this point.

I believe you can reproduce this issue because I pulled your latest code and ran it with micro_batch_size=1, and the problem still persisted. It returned to normal after setting micro_batch_size to 2. （both experiments` settings are tp=2 and pp=4 on 8 * A100 40G)

Error during merge of sharded checkpoint: 'TransformerLanguageModel' object has no attribute 'lm_head'

While merging a sharded llama2 7b tp2-pp2 checkpoint the exception AttributeError: 'TransformerLanguageModel' object has no attribute 'lm_head' is thrown here.

Traceback

Traceback (most recent call last):
  File "/root/koepf/epfl-megatron/tools/checkpoint_util.py", line 152, in <module>
    main()
  File "/root/koepf/epfl-megatron/tools/checkpoint_util.py", line 145, in main
    loader.load_checkpoint(queue, args)
  File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 319, in load_checkpoint
    _load_checkpoint(queue, args)
  File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 221, in _load_checkpoint
    queue_put("lm_head", {"lm_head": torch.cat([models[tp_rank].language_model.lm_head.data
  File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 221, in <listcomp>
    queue_put("lm_head", {"lm_head": torch.cat([models[tp_rank].language_model.lm_head.data
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1630, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'TransformerLanguageModel' object has no attribute 'lm_head'

Command used

python tools/checkpoint_util.py --target_tensor_parallel_size 1 --target_pipeline_parallel_size 1 --load_dir /root/koepf/megatron-data/checkpoints/llama2-7b-tp2-pp2-trained/ --save_dir /root/koepf/megatron-data/llama2-7b-out --model_type llama2 --bf16

add GQA(MQA) support in megatron2hf conversion

The current conversion from megatron weights to huggingface transformer version only supports traditional multiple-head attention, which works for Llama/Llama2 under 70B scale.

We will incorporate GQA(MQA) support for scaling up to 70B Llama2.

llama2 & vocabulary padding (making embedding layer sizes divisible by 128)

In weights2megatron.py for llama models args.make_vocab_size_divisible_by = 1 is set and this setting seems to be effective until conversion to hf (here are args of a 70b). This is different to falcon models which actually use a padded vocabulary with dummy tokens and this applies then also for the export to huggingface, e.g. see OpenAssistant/falcon-40b-megacode2-oasst/blob/main/config.json#L24.

Normally the size is padded to a value divisible by 128 to improve the model efficiency. Does this padding not have a beneficial effect for llama2? Was there a good reason not to use the megatron default value of 128?

LLaMa and Mistral 7B pretraining support

Hey there, I did read the docs and found LLaMa fine-tuning scripts. I was wondering if there is a way to pretrain LLaMa and Mistral Models from scratch ?

Please let me know if it's possible.

Thanks

Getting started "shard" model not working

First, of all thank you for creating this project! It looks very exciting and interesting due its close Hugging Face Integration.
I am very curious and wanted to give it a try following the Getting Started Guide in the documentation. But i ran into an error during the "Model Sharding" resulting into a Bus error (core dumped).

I am running on a single Node 8x A100 80GB with 1TB of memory. I followed the exact same step in the guide and used the container.

below is the full error stack in case its helpful. It includes quite a lot of weird C errors/warning in the beginning. I installed the package with

cd Megatron-LLM
pip install -r requirements.txt
cd megatron/data/
make
cd ../../

in the container.

Error Stack

/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::Half]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:95:   required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                          ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                  ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                                                       ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                         ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                                                              ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  768 |     cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
      |                                                                                                                                 ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::BFloat16]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:103:   required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                          ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                  ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                                                       ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                         ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                                                              ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  768 |     cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
      |                                                                                                                                 ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::Half; U = float; V = c10::Half]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:127:   required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                          ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                  ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                                                       ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                         ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                                                              ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  768 |     cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
      |                                                                                                                                 ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::BFloat16; U = float; V = c10::BFloat16]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:138:   required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                          ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                  ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                                                       ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                         ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                                                              ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  768 |     cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
      |                                                                                                                                 ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
[3/3] c++ layer_norm_cuda.o layer_norm_cuda_kernel.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_mix_prec_layer_norm_cuda.so
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF fused_weight_gradient_dense.o.d -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cpp -o fused_weight_gradient_dense.o 
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -gencode arch=compute_80,code=sm_80 -std=c++17 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cu -o fused_weight_gradient_dense.cuda.o 
[3/3] c++ fused_weight_gradient_dense.o fused_weight_gradient_dense.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_dense_cuda.so
Loading extension module fused_dense_cuda...
Building model ...
/epfllm/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/epfllm/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
 loading release checkpoint from ./model
 checkpoint version 3.0
  successfully loaded checkpoint from ./model at iteration 0
using world size: 4, data-parallel-size: 1, tensor-model-parallel size: 4, pipeline-model-parallel size: 1 
setting global batch size to 1
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. False
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... infer
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 4096
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 11008
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1
  glu_activation .................................. swiglu
  gradient_accumulation_fusion .................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 128
  layernorm_epsilon ............................... 1e-05
  lima_dropout .................................... False
  load ............................................ None
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. None
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 4096
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  metrics ......................................... []
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mmap_warmup ..................................... False
  new_tokens ...................................... True
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... True
  no_save_rng ..................................... True
  num_attention_heads ............................. 32
  num_attention_heads_kv .......................... 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  optimizer ....................................... adam
  override_opt_param_scheduler .................... False
  parallel_attn ................................... False
  parallel_layernorm .............................. False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... False
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... PositionEmbeddingType.rotary
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  rope_scaling_factor ............................. 1.0
  rope_theta ...................................... 10000.0
  sample_rate ..................................... 1.0
  save ............................................ ./model_sharded
  save_interval ................................... 1
  scalar_loss_mask ................................ 0.0
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 4096
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_iters ...................................... []
  split ........................................... 969, 30, 1
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  tensor_model_parallel_size ...................... 4
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  tie_embed_logits ................................ False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. SentencePieceTokenizer
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  use_bias ........................................ False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... True
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_one_sent_docs ............................... False
  use_post_ln ..................................... False
  use_ring_exchange_p2p ........................... False
  use_rms_norm .................................... True
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_extra_ids_list ............................ None
  vocab_file ...................................... None
  wandb_api_key ................................... None
  wandb_entity .................................... meditron
  wandb_id ........................................ None
  wandb_logger .................................... False
  wandb_project ................................... None
  wandb_resume .................................... False
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 4
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
Setting consumed_train_samples to 0 and consumed_valid_samples to 0
sending embeddings
sending lm_head
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
sending transformer layer 0
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_dense_cuda...
Bus error (core dumped)

[Save checkpoint needs long time]

Hi,
I am using your code to train llama 13b model, but I noticed that the time it takes to save a checkpoint is quite long. It took 5 minutes for the first save steps, and 15 minutes for the second save. Therefore, I have currently set torch.distributed.init_process_group(timeout=timedelta(minutes=120)). What could be the reason for this?

RuntimeError: mat1 and mat2 shapes cannot be multiplied (29056x22016 and 11008x4096)

hi, can you help to see what could be the cause of this error？

Prepend bos token

In the original Llama repository, a BOS token is prepended during inference, as seen in this code snippet.

Given this, should we also prepend a BOS token for each document during the 2nd stage of pretraining to ensure alignment with the original model's practices?

From prior models such as GPT-2 and BLOOM, a <|endoftext|> token is typically used to delineate separate documents. For example, a common approach is doc1 <eos> doc2 <eos> .... While I'm uncertain about Llama-2's exact handling of this, maybe something like <bos> doc1 <eos> <bos> doc2 <eos> ...?

Minimum number of 80GB GPUs to train Falcon-40B or Llama2-70B?

What is the minimum number of 80 GB GPUs required to train Falcon-40B or Llama2-70B and what are working/best TP/PP configurations?

So far it seems a single 8x A100 80 GB node is not enough even for Falcon-40B. I tried with TP 8, after model loading ~50 GB GPU memory are allocated, it goes OOM when processing the first batch.

Falcon unusually high loss

In weights2megatron/README.md it says:

"We are experiencing unusually high loss in falcon even when loading from a validated checkpoint. We are working to fix this..."

Is this still an open issue?

The training speed is two times slower than the Megatron-LM and Megatron-Deepspeed

The training speed gradually slowed down when I was fine-tuning LLaMA-2 7B. The training speed metric "Elapsed time per iteration" decreased from 2000 to 7000. Besides, the GPU-Util gradually decreases from 80% to 20% in the first 100 training iterations. There may be a subtle memory leak, but I cannot figure out it. Could you give me some advice?

Will you be adding a script to pretrain LLAMA2?

I see that the script finetune.py could be used to pretrain LLAMA2. I am looking forward to do continued-pretraining of Llama2 and was wondering what is the best way to load the llam2 weights in the pretrain.py script so that the pretraining does not start from scratch

Loading weights from hf conversion with different TP,PP settings

Can anyone help me with this. I'm having a bit of trouble with it, but I didn't find a similar problem inside all the issues.

Using hf_to_megatron.py generates a weights file with TP=1,PP=1, how can I use it in TP=2,PP=2 scenario. I noticed that the embedding parameters get split in two with TP=2.

convert huggingface model to megatron. "Only llama v2 available using huggingface"

Hi @AleHD

When I run this script:

python weights2megatron/weights2megatron.py llama --size=7

--out=/path/to/megatron/weights/ --cache-dir=/path/to/llama-7b/

It says:

assert version == 2, "Only llama v2 available using huggingface"
Does this script not support Llama v1?

LLaMA2-70B Inference Optmization

Hi! Your work is excellent, I want to use it to optimize the inference of LLaMA2-70B, I have 4 servers with 2 A100(80G) on each server, and I want to use model parallel and tensor parallel together to make the inference faster, can you help me ? :)

Multi nodes

I have 16x4090 on two machines, each with 8 sheets on top. How can I use this project ?

iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2)

Testing with Llam2 7B and different settings for --tensor_model_parallel_size and --pipeline_model_parallel_size I noticed that the elapsed time per iteration increases linearly for both TP=2, PP=1 and TP=1, PP=2 while it stays almost flat for TP=2 PP=2.

(please ignore the 'lima' in the names of the runs, both used the same lima dropout parameters and I also tried without lima dropout which made no difference - the dropout method is unrelated to the increase in iteration time)

This is the log of a TP=2, PP=1 run (notice how "elapsed time per iteration (ms)" increases):

 iteration        5/    2000 | consumed samples:          320 | elapsed time per iteration (ms): 11075.6 | learning rate: 5.000E-07 | global batch size:    64 | lm loss: 1.793634E+00 | loss scale: 1.0 | grad norm: 6.716 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       10/    2000 | consumed samples:          640 | elapsed time per iteration (ms): 14642.7 | learning rate: 1.000E-06 | global batch size:    64 | lm loss: 2.205030E+00 | loss scale: 1.0 | grad norm: 12.834 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       15/    2000 | consumed samples:          960 | elapsed time per iteration (ms): 22576.9 | learning rate: 1.500E-06 | global batch size:    64 | lm loss: 1.436895E+00 | loss scale: 1.0 | grad norm: 5.681 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       20/    2000 | consumed samples:         1280 | elapsed time per iteration (ms): 30626.8 | learning rate: 2.000E-06 | global batch size:    64 | lm loss: 1.524862E+00 | loss scale: 1.0 | grad norm: 5.253 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       25/    2000 | consumed samples:         1600 | elapsed time per iteration (ms): 38109.3 | learning rate: 2.500E-06 | global batch size:    64 | lm loss: 1.534147E+00 | loss scale: 1.0 | grad norm: 6.936 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       30/    2000 | consumed samples:         1920 | elapsed time per iteration (ms): 46237.4 | learning rate: 3.000E-06 | global batch size:    64 | lm loss: 1.294282E+00 | loss scale: 1.0 | grad norm: 3.555 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       35/    2000 | consumed samples:         2240 | elapsed time per iteration (ms): 53443.8 | learning rate: 3.500E-06 | global batch size:    64 | lm loss: 1.346382E+00 | loss scale: 1.0 | grad norm: 3.461 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       40/    2000 | consumed samples:         2560 | elapsed time per iteration (ms): 60785.3 | learning rate: 4.000E-06 | global batch size:    64 | lm loss: 1.323138E+00 | loss scale: 1.0 | grad norm: 2.622 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       45/    2000 | consumed samples:         2880 | elapsed time per iteration (ms): 66910.9 | learning rate: 4.500E-06 | global batch size:    64 | lm loss: 1.446500E+00 | loss scale: 1.0 | grad norm: 4.450 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       50/    2000 | consumed samples:         3200 | elapsed time per iteration (ms): 78089.7 | learning rate: 5.000E-06 | global batch size:    64 | lm loss: 1.180950E+00 | loss scale: 1.0 | grad norm: 2.712 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       55/    2000 | consumed samples:         3520 | elapsed time per iteration (ms): 88459.2 | learning rate: 5.500E-06 | global batch size:    64 | lm loss: 1.315969E+00 | loss scale: 1.0 | grad norm: 2.561 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       60/    2000 | consumed samples:         3840 | elapsed time per iteration (ms): 93434.8 | learning rate: 6.000E-06 | global batch size:    64 | lm loss: 1.255166E+00 | loss scale: 1.0 | grad norm: 2.143 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       65/    2000 | consumed samples:         4160 | elapsed time per iteration (ms): 100035.0 | learning rate: 6.500E-06 | global batch size:    64 | lm loss: 1.182611E+00 | loss scale: 1.0 | grad norm: 2.422 | number of skipped iterations:   0 | number of nan iterations:   0 |

llama2-7B AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256)

hello, I run finetune llama2-7B meet the error:
Traceback (most recent call last):
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 261, in
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 108, in pretrain
model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 371, in _setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler)
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 603, in load_checkpoint
check_checkpoint_args(checkpoint_args)
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 57, in check_checkpoint_args
_compare('padded_vocab_size')
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 49, in _compare
assert checkpoint_value == args_value, error_message
AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256).

================================================================================
this is my script:

export CUDA_DEVICE_MAX_CONNECTIONS=1
LOG_ARGS="--log_interval 1 --save_interval 10 --eval_interval 10"
TRAIN_ARGS="--train_iters 10 --lr_decay_style cosine --lr_warmup_iters 5 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008
--hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion
--no_bias_dropout_fusion --use_checkpoint_args
--attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5
--layernorm_epsilon 1e-6
--weight_decay 0.1 --sequence_parallel --recompute_activations --recompute_granularity selective
--log_timers_to_tensorboard
--rope_scaling_factor 1.0"
#--vocab_file=/home/dengkaibiao/Llama-2-7b-hf/tokenizer.model
export CUDA_VISIBLE_DEVICES=1,2
torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2
--save /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2
--tensorboard_dir /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2/tensorboard/
--data_path /home/dengkaibiao/Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--vocab_file=/home/dengkaibiao/Llama-2-7b-hf/tokenizer.model
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 128
--micro_batch_size 1
--use_flash_attn
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS

# python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.14.0a0+410ce96'

convert llama2 7B to megatron weight and split to tp=2 pp=4 and global world size is 8, single node test

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x6c (0x7f13158ba70c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&) + 0xfa (0x7f131587d620 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(char const*, char const*, int, bool) + 0x33e (0x7f131594468e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3:  + 0x154fd (0x7f13159154fd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4:  + 0x3f0f6 (0x7f131593f0f6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5:  + 0x507c0a (0x7f1355ed1c0a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6:  + 0x3b861 (0x7f131589c861 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x186 (0x7f13158960b6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0xd (0x7f13158961dd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #9:  + 0x4604cd3 (0x7f134f0aecd3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #10:  + 0x46051e3 (0x7f134f0af1e3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f1355daaeb8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #12:  + 0xf3c722 (0x7f134b9e6722 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::python::PythonEngine::execute(std::vector > const&, std::vector > const&, bool, bool, bool, std::vector > const&) + 0x6e (0x7f1356127ece in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #14: THPEngine_run_backward(_object*, _object*, _object*) + 0x2f9 (0x7f1356126849 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #15: PyCFunction_Call + 0x59 (0x5f5b39 in /usr/bin/python)
frame #16: _PyObject_MakeTpCall + 0x296 (0x5f6706 in /usr/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x62cd (0x57165d in /usr/bin/python)
frame #18: _PyFunction_Vectorcall + 0x1b6 (0x5f5ee6 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x1b6 (0x5f5ee6 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x393 (0x5f60c3 in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1902 (0x56cc92 in /usr/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in /usr/bin/python)
frame #26: _PyFunction_Vectorcall + 0x393 (0x5f60c3 in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x1b6 (0x5f5ee6 in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #30: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x393 (0x5f60c3 in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in /usr/bin/python)
frame #34: PyEval_EvalCode + 0x27 (0x68e267 in /usr/bin/python)
frame #35: /usr/bin/python() [0x67d9b1]
frame #36: /usr/bin/python() [0x67da2f]
frame #37: /usr/bin/python() [0x67dad1]
frame #38: PyRun_SimpleFileExFlags + 0x197 (0x67fbf7 in /usr/bin/python)
frame #39: Py_RunMain + 0x212 (0x6b8082 in /usr/bin/python)
frame #40: Py_BytesMain + 0x2d (0x6b840d in /usr/bin/python)
frame #41: __libc_start_main + 0xf3 (0x7f13a9204083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x2e (0x5faa2e in /usr/bin/python)

[Swiglu] question about swiglu

Hi,
I found the following code in arguments.py was deleted in your implementation, can i know why?

    if args.swiglu:
        # reduce the dimnesion for MLP since projections happens on
        # two linear layers. this keeps the number of paramters in
        # the same ballpark as the counterpart with 4*h size
        # we keep it a multiple of 64, which means the actual tensor size
        # will be a multiple of 64 / tp_size
        args.ffn_hidden_size = int((4 * args.hidden_size * 2 / 3) / 64) * 64

	encoder_output = self.encoder(
	encoder_input,
	enc_attn_mask,
	inference_params=inference_params)