bigscience-workshop / megatron-deepspeed Goto Github PK
View Code? Open in Web Editor NEWOngoing research training transformer language models at scale, including: BERT & GPT-2
License: Other
Ongoing research training transformer language models at scale, including: BERT & GPT-2
License: Other
TODO: fill in details
as discussed elsewhere by @thomasw21 - we need to be able to test specific code paths.
One way to approach this is to add:
if foobar:
do_some_XYZ branch's work
logger.debug("path XYZ was run")
and then in a test:
with CaptureStdout() as cs:
# after activating DEBUG logging level
execute_subprocess_async(cmd, env=self.get_env())
self.assertIn("path XYZ was run", cs.out)
So we need:
Once this is laid out we can then start using it for testing.
A simple script that evaluates a Megatron checkpoint on a small list of zero-shot evaluation datasets. We will use this to iterate on the model design. Maybe use the evaluation dataset here https://github.com/kingoflolz/mesh-transformer-jax
This is different from the more comprehensive benchmark the evaluation team is working on. This should be a quick easy-to-run benchmark.
Might be good to either specify versions, or ideally, switch to a conda environment for this repo. Critically, the base requirements for Megatron-DeepSpeed in requirements.txt
just require "torch" whereas you actually need torch > 1.7.0, notably with CUDA 11.1 or higher for this repo to actually work!
I am having errors importing the compiled C++ helpers.cpp
into python in gpt_dataset.py
# Use C++ implementation for speed.
# First compile and then import.
from megatron.data import helpers
assert doc_idx.dtype == np.int32
assert sizes.dtype == np.int32
sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length,
num_epochs, tokens_per_epoch)
# sample_idx = _build_sample_idx(sizes, doc_idx, seq_length,
# num_epochs, tokens_per_epoch)
Leading to the following error
ImportError: cannot import name 'helpers' from 'megatron.data' (XXXX/Megatron-DeepSpeed/megatron/data/__init__.py)
Killing subprocess 28861
update:
I managed to isolate the problem by compiling the helpers.cpp separately using gcc v9.3.1
. Compilation works but loading import helpers
doesn't work.
(venv-megatron) bash-4.2$ pwd
XXX/Megatron-DeepSpeed/megatron/data
(venv-megatron) bash-4.2$ gcc --version | head -1
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
(venv-megatron) bash-4.2$ make
make: python3-config: Command not found
make: python3-config: Command not found
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/nfs/core/python/3.9/include/python3.9 -I/xxxx/bigscience/venv-megatron/lib/python3.9/site-packages/pybind11/include helpers.cpp -o helpers
(venv-megatron) bash-4.2$ ipython
Python 3.9.4 (default, Apr 7 2021, 12:46:00)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import helpers
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-d249c8495052> in <module>
----> 1 import helpers
ModuleNotFoundError: No module named 'helpers'
In [2]:
posted the same issue on the original repo : NVIDIA/Megatron-LM#143
Hello everyone,
There are several scripts that can be used for evaluation, that can be found here :
https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tasks
However, running these scripts raises the following error :
$ python tasks/main.py --task WIKITEXT103 --num-layers 4 --hidden-size 256 --num-attention-heads 4 --seq-length 96 --max-position-embeddings 96 --fp16 --valid-data /gpfswork/rech/rcy/ulz63oj/datasets/oscar_shuff/en/en_shuf_part_1295.txt --tokenizer-type PretrainedFromHFTokenizers --tokenizer-name-or-path /gpfswork/rech/rcy/ulz63oj/tokenizers/bpe_en_5.0_30000.json --load /gpfsscratch/rech/rcy/ulz63oj/GPT_L4_A4_H256/bpe_en_5.0_30000 --micro-batch-size 8 --checkpoint-activations --log-interval 10 --no-load-optim --no-load-rng
Traceback (most recent call last):
File "tasks/main.py", line 70, in <module>
initialize_megatron(extra_args_provider=get_tasks_args)
File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/initialize.py", line 53, in initialize_megatron
set_global_variables(extra_args_provider=extra_args_provider,
File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/global_vars.py", line 88, in set_global_variables
args = _parse_args(extra_args_provider=extra_args_provider,
File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/global_vars.py", line 105, in _parse_args
_GLOBAL_ARGS = parse_args(extra_args_provider=extra_args_provider,
File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/arguments.py", line 35, in parse_args
parser = _add_network_size_args(parser)
File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/arguments.py", line 318, in _add_network_size_args
choices=megatron.model.glu_activations.GLU_ACTIVATIONS.keys(),
AttributeError: module 'megatron' has no attribute 'model'
This error does not appear when running pretrain_gpt.py
.
@thomasw21 found a solution, which is to add from . import model
in
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/__init__.py
and setup Megatron using pip install -e .
, but there may be a better solution and we do not understand why model
is not an attribute of Megatron after setup.
To reproduce, find a pretrained checkpoint and some text data and run the script provided by NVIDIA :
TASK="WIKITEXT103"
VALID_DATA=/path/to/valid/data
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=/path/to/checkpoint
COMMON_TASK_ARGS=" \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--fp16 \
--vocab-file $VOCAB_FILE"
python tasks/main.py \
--task $TASK \
$COMMON_TASK_ARGS \
--valid-data $VALID_DATA \
--tokenizer-type GPT2BPETokenizer \
--merge-file $MERGE_FILE \
--load $CHECKPOINT_PATH \
--micro-batch-size 8 \
--checkpoint-activations \
--log-interval 10 \
--no-load-optim \
--no-load-rng
Stella pointed out to how they do consistency calculations/checks with NeoX:
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/arguments.py
It'd be good for someone to study what they did over the base Megatron-LM and replicate anything that can help our work, since some good checks can save days of running a model under a wrong setup thinking it's doing something else.
I haven't studied what they did, so I don't have any specific suggestions here.
Thank you.
As can be seen from #121 we have a divergence between Meg and HF GPT2, while using the same weights under fp16.
So the proposed solution to enable users to use BigScience-pretrained models is to create a new architecture, which would be an identical clone of HF's GPT2, but with some changes.
Here are 3 changes:
def apply_overrides():
# 1. layer norm needs to be done in fp32 and then cast back to fp16 to match meg.
torch_layer_norm_orig = torch.layer_norm
def torch_layer_norm_force_fp32(input, normalized_shape, weight, bias, eps, cuddn):
out = torch_layer_norm_orig(input.float(), normalized_shape, weight.float(), bias.float(), eps, torch.backends.cudnn.enabled).half()
print(out)
#die
return out
torch.layer_norm = torch_layer_norm_force_fp32
# 2. MLP uses a slightly different activation function with a custom bwd
import transformers.activations
@torch.jit.script
def gelu_megatron_fwd(x):
return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
@torch.jit.script
def gelu_megatron_bwd(g, x):
tanh_out = torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x))
# sqrt(2/pi) * 3 * 0.044715 -> 0.1070322243
ff = 0.5 * x * ((1 - tanh_out * tanh_out) * (0.79788456 + 0.1070322243 * x * x)) + 0.5 * (1 + tanh_out)
return ff*g
class GeLUFunction(torch.autograd.Function):
@staticmethod
def forward(ctx, input):
ctx.save_for_backward(input)
return gelu_megatron_fwd(input)
@staticmethod
def backward(ctx, grad_output):
input = ctx.saved_tensors
tmp = gelu_megatron_bwd(grad_output, input)
return tmp, tmp
transformers.activations.gelu_fast = GeLUFunction.apply
transformers.activations.ACT2FN["gelu_fast"] = transformers.activations.gelu_fast
# 3. torch.baddbmm() (meg) produces slightly different results than torch.matmul, so override to use `torch.baddbmm`
import transformers.models.gpt2.modeling_gpt2
from torch import nn
def new_attn(self, query, key, value, attention_mask=None, head_mask=None):
output_size = (query.size(0), key.size(1), query.size(2), key.size(2))
matmul_result = torch.empty(output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query.dtype, device=query.device)
factor = float(value.size(-1)) ** 0.5
matmul_result = torch.baddbmm(
matmul_result,
query.reshape(-1, query.shape[2], query.shape[3]), # [b * np, sq, hn]
key.reshape(-1, query.shape[2], query.shape[3]).transpose(1, 2), # [b * np, hn, sk]
beta=0.0,
alpha=1.0 / factor
)
attn_weights = matmul_result.view(*output_size)
# attn_weights = torch.matmul(query, key.transpose(-1, -2))
#
# if self.scale_attn_weights:
# attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)
# Layer-wise attention scaling
if self.scale_attn_by_inverse_layer_idx:
attn_weights = attn_weights / float(self.layer_idx + 1)
if not self.is_cross_attention:
# if only "normal" attention layer implements causal mask
query_length, key_length = query.size(-2), key.size(-2)
causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
attn_weights = torch.where(causal_mask, attn_weights, self.masked_bias.to(attn_weights.dtype))
if attention_mask is not None:
# Apply the attention mask
attn_weights = attn_weights + attention_mask
attn_weights = nn.Softmax(dim=-1)(attn_weights)
# Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op otherwise
attn_weights = attn_weights.type(value.dtype)
attn_weights = self.attn_dropout(attn_weights)
# Mask heads if we want to
if head_mask is not None:
attn_weights = attn_weights * head_mask
attn_output = torch.matmul(attn_weights, value)
return attn_output, attn_weights
transformers.models.gpt2.modeling_gpt2.GPT2Attention._attn = new_attn
Here is how we are going to tackle the activation function: huggingface/transformers#13997
So a PR will need to be files with https://github.com/huggingface/transformers/
the paper NormFormer: Improved Transformer Pretraining with Extra Normalization https://arxiv.org/abs/2110.09456 suggests that under preLN:
gradients at earlier layers tend to be larger than gradients at later layers
so we want to verify that this is so in our case before acting on it and potentially integrating NormFormer.
So we need to expand the tensorboard and logs to log grad norm for individual layers, perhaps as in the paper we can log 5 layers: 0, 1, int(n_layers/2), -2, -1
For reference see p6 in the paper (graphs and discussion).
Currently only L2 average of all layer grad norms is calculated and logged (a single number).
Performance-wise this will require doing some extra calculations but not much - it's really just figuring out how to broadcast this info to TB so that multiple-points can be logged at once.
Additionally, we could activate this tool on demand - e.g. after encountering a spike we could roll back to the last good checkpoint and run a cycle with this debug feature enabled.
In this issue, we discuss how viable/interesting it might be to implement DeBERTa like attention mechanism:
https://arxiv.org/abs/2006.03654
Things to take in account:
Motivation. As @sashavor suggested, the carbon footprint working group needs an experiment tracker to properly follow all runs being done. An experiment tracker could also be more broadly interesting to centralise in one place all experiments being done.
Proposed solution. Following a discussion with @thomwolf, the carbon WG has identified MLFlow as a promising open-source option: it supports having a dedicated server, and can interface with TensorBoard logs (which we already produce). This blog post shows how it integrates with Tensorflow. There is also documentation on how to interface with PyTorch models.
Implementation. It's not quite clear how nicely will MLFlow play with Megatron/DeepSpeed+the limited networking on Jean Zay. The goal here is to first build a proof-of-concept showing MLFlow can integrate in our codebase and phone back to the centralised server from Jean Zay. We can then consider the finer details of reporting all metrics of interest to us.
TODO: add details
Follow appendix A.1 https://arxiv.org/pdf/1812.06162.pdf to implement monitoring of gradient noise scale and add it to the tensorboard log.
Languages: Arabic, Swahili (Bantu), Chinese, Catalan, English, French, Indic (Hindi,Urdu,Bangla), Indonesian, Portuguese, Spanish, Russian, Japanese, Amharic
Then upload to the Google storage bucket
https://console.cloud.google.com/storage/browser/bigscience/mc4_preprocessing
Add the ability to run the EleutherAI Evaluation Harness on Megatron checkpoints. Right now we are relying on converting Megatron checkpoints to Hugginface checkpoints which is an error-prone process. We also have to use Megatron anyways to run the 200B model.
Implementation details:
You will use this HF gpt2 model implementation here as your reference. Here are more details:
_init_, create_from_arg_string
to load the Megatron checkpoints_model_call
function to call the Megatron model and read logits backloglikelihood, loglikelihoods, _loglikelihood_tokens
might (or might not) require a little tweakinggreedy_until
unimplemented (raise an exception), we don't need it for now.Pick and implement one of the variants of the GLU activation.
Check here for comparison and equations
https://arxiv.org/pdf/2002.05202.pdf
https://arxiv.org/pdf/2102.11972.pdf
and here for implementation of one of them
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/activations.py
We now have a script that convert megatron-deepspeed checkpoints to HF-transformers checkpoints.
Project is here and the script is here.
However, the script doesn't have unit tests that confirm that the conversion is correct.
The goal of this issue is to add such tests. The idea is to run the forward pass of both models (before and after conversion) with any random input, then use torch.allclose
to assert that the output loss and the logits of both models perfectly match.
As discussed in #128, maintain code style consistency by setting up some basic formatters.
Makefile
pyproject.toml
setup.cfg
Benchmark HF transformers as point of reference.
I don't know whether this is intended to work or not, but I found the following program:
from megatron.data.indexed_dataset import IndexedDatasetBuilder, best_fitting_dtype
best_dtype = best_fitting_dtype(10_000)
IndexedDatasetBuilder("testfile", dtype=best_dtype)
leads to an error like:
File "/path/to/Megatron-DeepSpeed.git/megatron/data/indexed_dataset.py", line 284, in __init__
self.element_size = self.element_sizes[self.dtype]
KeyError: <class 'numpy.uint16'>
This shows up because best_fitting_dtype
will return numpy.uint16
for small vocabs:
Megatron-DeepSpeed/megatron/data/indexed_dataset.py
Lines 25 to 27 in c680954
but that particular type is missing from the element_sizes
table.
Megatron-DeepSpeed/megatron/data/indexed_dataset.py
Lines 268 to 276 in c680954
which is used in the IndexedDatasetBuilder
constructor here:
Should something like this work?
If so, it seems like either the best_fitting_dtype
function should return a different type or uint16
should be added to the element_sizes
table like f706108.
AFAIU, the current implementation uses the first 50% as the prefix doesn't support prefix-lm. The T5 paper samples the prefix length randomly from 0 to max_sequence_lengh (which is 2048 in our case)
The simplest model, with rotary embeddings. Don't necessarily train to 300B tokens to compare.
We need to have a diagnostic model size dumped during the framework init. We currently get a report per rank and not the total.
> number of parameters on (tensor, pipeline) model parallel rank (0, 1): 1745293312
> number of parameters on (tensor, pipeline) model parallel rank (2, 1): 1745293312
> number of parameters on (tensor, pipeline) model parallel rank (3, 0): 1986465792
> number of parameters on (tensor, pipeline) model parallel rank (3, 7): 1986498560
Later on ZeRO engine does dump the right thing amongst multiple other numbers and repeated on each rank
[2021-10-02 16:08:53,028] [INFO] [engine.py:134:__init__] RANK=0 STAGE=0 LAYERS=7 [0, 7) STAGE_PARAMS=1986465792 (1986.466M) TOTAL_PARAMS=57778896896 (57778.897M) UNIQUE_PARAMS=56814206976 (56814.207M)
But ideally we just want a print like:
Model size: 57B (57778896896 params)
Just on rank 0.
Thanks.
For our three baselines on different datasets (OSCAR, C4, The Pile), we would like to plot scaling laws and retrieve their coefficients. Specifically, we are looking to reproduce Figure 1 of Scaling Laws for Neural Language Models.
The TensorBoard data for the baseline runs can be retrieved on the Big Science space on HuggingFace: it's the tr3
runs with tensorboard
in their name. The naming scheme (tr3b
, tr3c
, etc.) is explained here.
For C4, we have a XL, L, and M model (tr3
, tr3c
, tr3c
) with short warm-up. For OSCAR and The Pile, we have an XL, L, M, and S model (tr3d
, tr3g
, tr3h
, tr3i
and tr3
, tr3j
, tr3k
, tr3l
). For OSCAR, we can should also add the 13B run to see if the fits hold (that's tr1-13B
).
When using PretrainedFromHF
it seems to require at least 2 gpus and specifically TP=2, otherwise with 1 gpu it crashes
To reproduce:
CHECKPOINT_PATH=checkpoints/gpt2
VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt
DATA_PATH=data/meg-gpt2_oscar-combined_text_document
TENSORBOARD_PATH=output_dir/tensorboard
TP_SIZE=1
PP_SIZE=1
N_GPUS=1
NLAYERS=24
NHIDDEN=1024
NHEADS=16
FFN_HIDDEN_SIZE=4096
SEQ_LEN=2048
GPT_ARGS=" \
--exit-interval 10 \
--num-layers $NLAYERS \
--hidden-size $NHIDDEN \
--num-attention-heads $NHEADS \
--seq-length $SEQ_LEN \
--max-position-embeddings $SEQ_LEN \
--micro-batch-size 2 \
--rampup-batch-size 4 4 1_000 \
--global-batch-size 16 \
--train-samples 100 \
--optimizer adam \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-8 \
--lr 1e-4 \
--lr-warmup-samples 5 \
--clip-grad 1.0 \
--weight-decay 1e-1 \
--fp16 \
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path t5-small \
"
OUTPUT_ARGS=" \
--log-interval 10 \
--save-interval $SAVE_INTERVAL \
--eval-interval 100 \
--eval-iters 10 \
--checkpoint-activations \
"
DATA_ARGS=" \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--tensorboard-dir $TENSORBOARD_PATH \
--tensorboard-queue-size 5 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
"
ALL_ARGS="$GPT_ARGS $OUTPUT_ARGS $DATA_ARGS"
# if you can't stand pt-1.9 launcher noise
export LOGLEVEL=WARNING
LAUNCHER="deepspeed --num_gpus $N_GPUS"
CMD="$LAUNCHER pretrain_gpt.py $ALL_ARGS"
echo $CMD
#rm -rf $CHECKPOINT_PATH
$CMD
It crashes w/ or w/o deepspeed enabled, the above script is w/o deepspeed.
The log:
[before the start of training step] datetime: 2021-09-08 17:38:58
[2021-09-08 17:38:58,791] [INFO] [checkpointing.py:408:forward] Activation Checkpointing Information
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:409:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:412:forward] ----contiguous Memory Checkpointing False with 2 total layers
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:415:forward] ----Synchronization False
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:416:forward] ----Profiling time in checkpointing False
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[...] some 10000 of these [...]
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [339,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [339,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "pretrain_gpt.py", line 222, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 147, in pretrain
iteration = train(forward_step_func,
File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 686, in train
train_step(forward_step_func,
File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 387, in train_step
loss = model[0].train_batch(data_iter=data_iterator)
File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 291, in train_batch
self._exec_schedule(sched)
File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 1237, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 587, in _exec_forward_pass
outputs = super().forward(inputs)
File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/engine.py", line 1164, in forward
loss = self.module(*inputs, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 347, in forward
x = self.activation_checkpoint_func(
File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 690, in checkpoint
CheckpointFunction.apply(function, all_outputs, *args)
File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 494, in forward
outputs = run_function(*inputs_cuda)
File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 325, in exec_func
inputs = layer(inputs)
File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 585, in forward
return super().forward(hidden_states, attention_mask, **kwargs)
File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 477, in forward
self.self_attention(layernorm_output,
File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 215, in forward
mixed_x_layer, _ = self.query_key_value(hidden_states)
File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/mpu/layers.py", line 289, in forward
output_parallel = F.linear(input_parallel, self.weight, bias)
File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1623448278899/work/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f99385a7a22 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10ac3 (0x7f9938809ac3 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f993880b167 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f99385915a4 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa249ba (0x7f99b1e819ba in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa24a51 (0x7f99b1e81a51 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x1932c6 (0x55f7dd66d2c6 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #7: <unknown function> + 0x1592ac (0x55f7dd6332ac in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #8: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #9: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #10: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #11: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #12: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #13: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #14: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #15: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #16: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #17: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #18: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #19: <unknown function> + 0x158e77 (0x55f7dd632e77 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #20: <unknown function> + 0x1590dc (0x55f7dd6330dc in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #21: <unknown function> + 0x1e3643 (0x55f7dd6bd643 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #22: <unknown function> + 0x159109 (0x55f7dd633109 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #23: <unknown function> + 0x1e3643 (0x55f7dd6bd643 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #24: <unknown function> + 0x176057 (0x55f7dd650057 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #25: _PyModule_ClearDict + 0x103 (0x55f7dd687d53 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #26: PyImport_Cleanup + 0x578 (0x55f7dd6aff88 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #27: Py_FinalizeEx + 0x79 (0x55f7dd6e1a49 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #28: Py_RunMain + 0x183 (0x55f7dd6e3893 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #29: Py_BytesMain + 0x39 (0x55f7dd6e3ca9 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #30: __libc_start_main + 0xf3 (0x7f99e84330b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #31: <unknown function> + 0x1e21c7 (0x55f7dd6bc1c7 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
It works if I use 2 gpus and TP=2.
Originally discovered by @tjruwase and I was able to reproduce it too.
We were trying to use the 350m checkpoint you gave us to test the checkpoint conversion tool - so we were unable to use it since the final checkpoint needs to be a single file - 1 gpu.
I also see PretrainedFromHF
is not being tested. So probably a test is needed as well.
Thank you.
Following the training script here as a template:
I've trained some models using 2-way tensor parallelism and 4-way pipeline parallelism, which produces a number of checkpoints in directories like "global_step26000".
I'm now trying to use one of those trained checkpoints to do inference. In particular, I'm trying to work with modified versions of these scripts where I can provide new strings that are tokenized and processed on the fly:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/examples/generate_text.sh
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/generate_samples_gpt.py
I've tried a number of different approaches, but no luck so far. I also can't seem to find any instructions written up on that.
Anyone know the steps to load one of those checkpoints for inference on new data?
Update released model files to include
config.tokenizer_class
HUB:
GCS:
This will be automated for the future by #126 - but for now do it manually since it's just 2 sets of files.
Implement rotary embeddings:
@stas00 suggested we might want to remove the whole json intermediary.
This issue should be used to discuss if that's a feature we want, and if so, define clearly what we want.
It has been proposed to group the tensorboard logs, by adding a group prefix to the log-key, e.g.:
tb.add_scalar("group a/batch size", batch_size, iteration)
Currently proposed groups:
Batch-size
- Batch-size
- Batch-size vs samples
Grad-norm
- Grad norm
- Grad norm vs samples
Learning rate
- Learning rate
- Learning rate vs samples
Lm loss train
- Lm loss
- Lm loss vs samples
Lm loss validation
- Lm loss
- Lm loss vs samples
- Lm loss ppl
- Lm loss ppl vs samples
Loss scale
- Loss scale
- Loss scale vs samples
Num zeros
- Num zeros
- Num zeros vs samples
The question is whether to hardcode these, or have a configurable map, should different trainings want different groups?
I don't know how much time I will have to polish this up, but I have a prototype MPI-enabled "terashuf". This uses an algorithm similar to terashuf, where contiguous records from the source file are shuffled in segments, and then one randomly picks leading records from the shuffled segments to put together the final shuffled file.
This prototype currently stores the shuffled segments in memory (rather than files), and so it requires one to be able to load the full file into distributed memory. Currently each rank reads a portion of the source file into memory, shuffles that section, and then ranks exchange lines with each other in order to write out the file in contiguous chunks.
It can shuffle the oscar.jsonl file in about 10 minutes using 80 procs on 8 nodes on my system.
2021-09-01T12:19:15: 0: Wrote 1319979521428 of 1320971843503 bytes (99.92%) in 343 secs, 3668.675 MB/s
2021-09-01T12:19:20: 0: Waiting for ranks to finish ...
2021-09-01T12:19:20: Seconds to write file: 348.45524168014526
real 6m25.041s
Just posting this notice in case others need to shuffle a large JSON file in a hurry.
https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/tools/distshuf.py
It currently requires mpi4py
and an mpi4py enabled DistData
class.
https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/megatron/data/distdata_mpi.py
I first attempted a torch.distributed
version, but hit some problems. I haven't yet gone back to see if a torch.dist
equivalent is easy.
For speed and correctness, both the input and output files must be on a parallel file system like Lustre/GPFS.
Example command:
srun -n 80 -N 8 python3 tools/distshuf.py \
--input /gpfs/path/to/oscar.jsonl \
--output /gpfs/path/to/oscarshuf.jsonl \
--seed 101
After having a 3->8->3 spike in the loss value a few days ago, which luckily recovered after a few hours of training, we want to discuss possible ready to use strategies that we can quickly deploy should the spike not recover.
Notes from the slack so far:
Iz Beltagy:
In case the model gets stuck in one of the spikes and doesn’t come back, we can restart it from an earlier checkpoint but shuffle the data, reset the optimizer state, switch to fp32, lower lr, change optimizer params …
Stas:
Do you think we should be prepared and have a few of these options documented from the best choice to least, or deal with it if and when it happens?
e.g. I don't think we can shuffle the data, other than perhaps changing the seed?
Ryan Teehan:
I think that would be a good idea, both as a way to inform people about decisions but also for developing justifications and reasons for best practices
Iz Beltagy:
it would be great if we have these implemented and ready to be used. As for knowing which choices are more effective, that would be something we figure out empirically, and it would be one of the contributions of the project
Up until recently we've been using pytorch code in order to apply "scale -> mask -> softmax" in attention mechanism for prefix LM.
I've recently discovered that there exists two cuda kernels to attention matrices:
ScaledUpperTriangMaskedSoftmax
which is as the naming suggest specific to GPT style modelsScaledMaskedSoftmax
The later one hasn't really been tested, and since I didn't have time to deepdive in the cuda code, I decided not to use it for initial prefix lm. However #151 has removed the mechanism to force prefix lm to use the pytorch route.
In this issue, we want to:
ScaledMaskedSoftmax
support prefix attention mask if fed with one.@RezaYazdaniAminabadi , If you could provide us with your expertise on that one, maybe just checking the first item that'd be great. Thank you!
If you have been following the training we got a few spikes in lm loss and it'd be great to look at the input data on when this happened. We suspect that input data was either garbled or in a very different language.
So we need to write a script that can receive a range of sample ids and generate the output in format:
sample id | normal text
Possible invocation syntax:
tools/sample-ids-to-text.py --seed 111 --sample-id-range 10728784-10769744 --whatever other args are needed, path to the indexed dataset, etc.
@mohammad Shoeybi shared the key pointers to the code to build upon:
With respect to getting input text from iteration, we report the number of samples consumed, You can instantiate dataset (for example here) and get the sample number directly.
You can find the index files that were used for 13B and are now reused for 104B training here: https://huggingface.co/bigscience/tr1-13B-data/tree/main/indices but it'll probably much easier to create your own tiny dataset, preprocess it, start a tiny training so that it'll create the .npy files, and working with that so that all data is there under you control and it's tiny, rather than working with 1TB files.
Replicate all the tensorboard logging in Meg-DS, plus logging hyperparams of choice.
So on the code level:
--mlflow-dir
on/off toggle which will log all MLFlow events/dataexample:
https://gist.github.com/tsaoyu/14e39a6d246cb29b107a2cc62a12f7a3
Blocking events:
Please edit the OP to add whatever fixes we applied to the core and which need to be propagated upstream into:
we want to do that to make it easier to sync upstream changes back to this repo.
Changes to send upstream:
doc_ids
--rampup-batch-size
help entrytools/merge_preprocessed_data.py
to support merging datasets - might be easier to just copy the new script.to make it easy for new users to use our setup
megatron/model/__init__.py
seem to lead to circular reports quite often, when importing from other megatron libraries, so it's probably the best idea to remove it, rather than continually working around it.
This will require adapting all these to import from the corresponding module each of the imported symbols reside in:
./pretrain_gpt.py:from megatron.model import GPTModel, GPTModelPipe
./megatron/training.py:from megatron.model import Float16Module
./megatron/training.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/optimizer/__init__.py:from megatron.model import LayerNorm
./megatron/schedules.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/schedules.py:from megatron.model import Float16Module
./megatron/text_generation_utils.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/text_generation_utils.py:from megatron.model import Float16Module
./megatron/model/realm_model.py:from megatron.model import BertModel
./megatron/model/transformer.py:from megatron.model import LayerNorm
./megatron/model/bert_model.py:from megatron.model import LayerNorm
./megatron/model/gpt_model.py:from megatron.model import LayerNorm
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import GPTModel
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import DistributedDataParallel as LocalDDP
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import Float16Module
./checkpoint-analysis.ipynb: "from megatron.model import GPTModel\n",
./pretrain_bert.py:from megatron.model import BertModel
./pretrain_t5.py:from megatron.model import T5Model
./tools/generate_samples_gpt.py:from megatron.model import GPTModel
This is a very basic python task and requires no knowledge of Megatron or Deepspeed.
Steps:
git rm megatron/model/__init__.py
make test
to ensure things still work.Thank you.
pytorch/pytorch@8b87f9a#diff-f12c726e3e8cd2b4768f8984fef27059
I think we don't need to use apex fused layernorm anymore.
torch layernorm is better. what do you think about this?
cc. @stas00 @thomasw21
Multilingual models are trained on multiple datasets (one dataset per language) with a sampling probability for each dataset.
This feature (training on instances randomly sampled from multiple dataset files) seems to be supported in our Megatron code base (check here https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/run.sh#L9-L12, thanks to @sbmaruf for finding it), but we want to give it a try and make sure it is working with multiple gpus on multiple nodes. For testing, we can do the following:
Once we verify this feature is working, we will swap the small testing datasets with the preprocessed multilingual datasets for the real training.
Here are the instructions for how to get started with the code-base
- install things: https://github.com/bigscience-workshop/Megatron-DeepSpeed#setup
- get and preprocess data to work with: https://github.com/bigscience-workshop/Megatron-DeepSpeed#quick-pre-processing-to-start-training-with
- train: https://github.com/bigscience-workshop/Megatron-DeepSpeed#gpt-pretraining
[TODO: add details]
low priority
our conversion to HF script:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/convert_checkpoint/deepspeed_to_transformers.py
doesn't do anything about the tokenizer - until now we added the right tokenizer files manually, but this doesn't work at scale.
So this ticket is to extend the above script at the point of saving the checkpoint to lookup which tokenizer is used based on these 2 args:
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path t5-small \
for GPT2BPETokenizer
it's only:
--tokenizer-type GPT2BPETokenizer \
and then fetch the right tokenizer files but loading and saving it into the folder of the checkpoint files.
So the required logic:
if args.tokenizer_type == "GPT2BPETokenizer":
mname = "gpt2"
elif args.tokenizertype == "PretrainedFromHF":
mname = args.tokenizer_name_or_path
else:
raise ValueError(f"don't know how to handle args.tokenizer_type={args.tokenizer-type}")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(mname)
tokenizer.save_pretrained(output_state_dict)"
mname
in our case so far is either t5-small
or gpt2
done.
I think this is all that's needed but I haven't tested any of this.
Extend the setup from this issue #27 by adding a few "zero-shot" multilingual datasets to the evaluation.
Check table 2 here https://arxiv.org/pdf/2010.11934.pdf for potential datasets.
I've been looking at the megatron code nowadays. When using the softmax kernel used in megatron, it was observed that the results were different from the original torch softmax. How about doing a unit test on this kernel?
Currently, parameter counts in utils.get_parameters_in_billions
are inaccurate when PP > 1. Tied variables, in particular embedding layers, exist in several copies in the first and last PP stage, which causes double counts. For now the codebase only uses the count without embedding layers which is accurate, but it would be good for the count with embedding layers to also function, mostly for operation-counting.
For background see: #40
We need an evaluation benchmark that we can use for iterating on the model design. It needs to satisfy the following requirements:
Update
Need to figure out why CI fails with 4 gpus, but works fine with 2.
I set it for now to 2 gpus, until we sort this out. 391930b
FAILED tests/test_model.py::MyTestCase::test_gpt - Failed: Timeout >300.0s
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_0_base - ...
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_1_cl - Ru...
FAILED tests/test_training.py::MegDSTestTraining::test_training_prefix_lm_all
the first one hangs on startup, the others fail with:
stderr: Traceback (most recent call last):
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_gpt.py", line 246, in <module>
stderr: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 165, in pretrain
stderr: Traceback (most recent call last):
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_gpt.py", line 246, in <module>
stderr: iteration = train(forward_step_func,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 734, in train
stderr: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 165, in pretrain
stderr: train_step(forward_step_func,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 405, in train_step
stderr: iteration = train(forward_step_func,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 734, in train
stderr: loss = model[0].train_batch(data_iter=data_iterator)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 329, in train_batch
stderr: self._exec_schedule(sched)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1313, in _exec_schedule
stderr: train_step(forward_step_func,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 405, in train_step
stderr: loss = model[0].train_batch(data_iter=data_iterator)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 329, in train_batch
stderr: self._exec_instr(**cmd.kwargs)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 723, in _exec_backward_pass
stderr: self._exec_schedule(sched)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1313, in _exec_schedule
stderr: local_part=self.grad_layer[1],
stderr: IndexError: list index out of range
As an extension of discussions from #47, start adding basic unit tests to ensure that new additions and implementations (e.g. GLU variants, prefix-LM masking schemes, etc) work as intended. Possibly setup CI pipelines to ensure that tests pass on each push to main
. This is likely mid to low priority.
Let's discuss which data is used in the test suite. And after the discussion turn into guidelines for test writers.
Here is a very rough start:
Currently the main overhead in testing is the really slow startup of Meg-DS
Hopefully soon we will have a CI, so downloading large data files can slow things down.
The test suite ensures correctness, not speed, so having 1k records is more than enough, even for parallel processing.
That's said we can have extended slow tests that don't run by default and which can download large data and then these can be used as well.
Comments, suggestions and ideas are super welcome.
After some OOM errors when preprocessing large files, I noticed that the caching mechanim can have unlimited memory due to its ability to grow forever. This causes issues, in particular in multiprocessing cases when this cache is multiplied by the number of workers. This prevents to work with large datasets.
Note: I've tried switching to HF tokenizer, and the memory constraint becomes much lower since it doesn't cache.
In order to reduce the memory constraint, we can either:
cc @ontocord
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.