bigscience-workshop / megatron-deepspeed Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 210.0 6.47 MB

Ongoing research training transformer language models at scale, including: BERT & GPT-2

License: Other

Python 90.10% Shell 1.31% Makefile 0.06% C++ 6.17% Cuda 2.14% C 0.23%

megatron-deepspeed's People

Contributors

Stargazers

Watchers

Forkers

thomasw21 dragomirradev sbmaruf stas00 adammoody trendingtechnology laplacekorea mindaugasvaitkus2 hadyelsahar ofirpress romancast saullu wade3han jtboing nafabrar huu4ontocord mmarius snapbuy dliofindia conglongli raineydavid zphang abhilash1910 techthiyanes abodacs janebert deepakn94 malteos younesbelkada lintangsutawika khalidalt quentin-anthony loubnabnl tjdgh0715 nawnoes murilo czwin32768 stjordanis narsil rks-rex emrecanacikgoz tianjianjiang oneflow-inc mapama247 theouterlimitz saurabh3949 pterameta oyelowo cloudedleopard17 drxmy beoy misska1 elijahahianyo occtop turkunlp jcarlosneto suryatmodulus ki6an larrylawl alex-ht yukihimex simdok amritsingh183 kennychen5411 panx27 condestable2000 lihux25 stadlerb tevenlescao kalufinnle lvcc2018 adamg012 machinelearningsystem reyoung carlosdvp tot0 jstrout44 thamwangjun phymucs joolstorrentecalo mitzen savitamittal1 sundevil0405 persistforever nomiscientist saberpuren upvenly aseaday nkflash ip01 splend1d bqw18744018044 marscrazy junction4nako xingyuxie codingchild2424 hgsingh nicola-zhang foreschen flypanda666

megatron-deepspeed's Issues

Train prefix-lm model

TODO: fill in details

[testing] add logging facility

as discussed elsewhere by @thomasw21 - we need to be able to test specific code paths.

One way to approach this is to add:

if foobar:
    do_some_XYZ branch's work
    logger.debug("path XYZ was run")

and then in a test:

        with CaptureStdout() as cs:
            # after activating DEBUG logging level
            execute_subprocess_async(cmd, env=self.get_env())

        self.assertIn("path XYZ was run", cs.out)

So we need:

to add a logging facility - Megatron doesn't have one - I recommend adapting https://github.com/huggingface/transformers/blob/master/src/transformers/utils/logging.py as it's greatly polished
a way to activate the desired log levels - again this can be borrowed from HF https://github.com/huggingface/transformers/blob/234cfefbb083d2614a55f6093b0badfb2efc3b45/src/transformers/training_args.py#L153-L161 - so adding these to arguments.py and then setting the desired log level via the command line, with the default being either INFO or WARNING.

Once this is laid out we can then start using it for testing.

Simple English-only evaluation benchmark

A simple script that evaluates a Megatron checkpoint on a small list of zero-shot evaluation datasets. We will use this to iterate on the model design. Maybe use the evaluation dataset here https://github.com/kingoflolz/mesh-transformer-jax
This is different from the more comprehensive benchmark the evaluation team is working on. This should be a quick easy-to-run benchmark.

Version Lock `requirements.txt`

Might be good to either specify versions, or ideally, switch to a conda environment for this repo. Critically, the base requirements for Megatron-DeepSpeed in requirements.txt just require "torch" whereas you actually need torch > 1.7.0, notably with CUDA 11.1 or higher for this repo to actually work!

Cannot import C++ compiled "helpers"

I am having errors importing the compiled C++ helpers.cpp into python in gpt_dataset.py

            # Use C++ implementation for speed.
            # First compile and then import.
             
            from megatron.data import helpers
            assert doc_idx.dtype == np.int32
            assert sizes.dtype == np.int32
            sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length,
                                                  num_epochs, tokens_per_epoch)
            # sample_idx = _build_sample_idx(sizes, doc_idx, seq_length,
            #                               num_epochs, tokens_per_epoch)

Leading to the following error

ImportError: cannot import name 'helpers' from 'megatron.data' (XXXX/Megatron-DeepSpeed/megatron/data/__init__.py)
Killing subprocess 28861

update:
I managed to isolate the problem by compiling the helpers.cpp separately using gcc v9.3.1. Compilation works but loading import helpers doesn't work.

(venv-megatron) bash-4.2$ pwd
XXX/Megatron-DeepSpeed/megatron/data

(venv-megatron) bash-4.2$ gcc --version | head -1 
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)

(venv-megatron) bash-4.2$ make
make: python3-config: Command not found
make: python3-config: Command not found
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/nfs/core/python/3.9/include/python3.9 -I/xxxx/bigscience/venv-megatron/lib/python3.9/site-packages/pybind11/include helpers.cpp -o helpers

(venv-megatron) bash-4.2$ ipython
Python 3.9.4 (default, Apr  7 2021, 12:46:00)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import helpers
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-d249c8495052> in <module>
----> 1 import helpers

ModuleNotFoundError: No module named 'helpers'

In [2]:

posted the same issue on the original repo : NVIDIA/Megatron-LM#143

Import issues when using evaluation scripts : `module 'megatron' has no attribute 'model'`

Hello everyone,

There are several scripts that can be used for evaluation, that can be found here :

https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tasks

However, running these scripts raises the following error :

$ python tasks/main.py --task WIKITEXT103 --num-layers 4 --hidden-size 256 --num-attention-heads 4 --seq-length 96 --max-position-embeddings 96 --fp16 --valid-data /gpfswork/rech/rcy/ulz63oj/datasets/oscar_shuff/en/en_shuf_part_1295.txt --tokenizer-type PretrainedFromHFTokenizers --tokenizer-name-or-path /gpfswork/rech/rcy/ulz63oj/tokenizers/bpe_en_5.0_30000.json --load /gpfsscratch/rech/rcy/ulz63oj/GPT_L4_A4_H256/bpe_en_5.0_30000 --micro-batch-size 8 --checkpoint-activations --log-interval 10 --no-load-optim --no-load-rng
Traceback (most recent call last):
  File "tasks/main.py", line 70, in <module>
    initialize_megatron(extra_args_provider=get_tasks_args)
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/initialize.py", line 53, in initialize_megatron
    set_global_variables(extra_args_provider=extra_args_provider,
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/global_vars.py", line 88, in set_global_variables
    args = _parse_args(extra_args_provider=extra_args_provider,
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/global_vars.py", line 105, in _parse_args
    _GLOBAL_ARGS = parse_args(extra_args_provider=extra_args_provider,
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/arguments.py", line 35, in parse_args
    parser = _add_network_size_args(parser)
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/arguments.py", line 318, in _add_network_size_args
    choices=megatron.model.glu_activations.GLU_ACTIVATIONS.keys(),
AttributeError: module 'megatron' has no attribute 'model'

This error does not appear when running pretrain_gpt.py.

@thomasw21 found a solution, which is to add from . import model in
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/__init__.py
and setup Megatron using pip install -e ., but there may be a better solution and we do not understand why model is not an attribute of Megatron after setup.

To reproduce, find a pretrained checkpoint and some text data and run the script provided by NVIDIA :

TASK="WIKITEXT103"

VALID_DATA=/path/to/valid/data
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=/path/to/checkpoint

COMMON_TASK_ARGS=" \
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --fp16 \
    --vocab-file $VOCAB_FILE"

python tasks/main.py \
    --task $TASK \
    $COMMON_TASK_ARGS \
    --valid-data $VALID_DATA \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file $MERGE_FILE \
    --load $CHECKPOINT_PATH \
    --micro-batch-size 8 \
    --checkpoint-activations \
    --log-interval 10 \
    --no-load-optim \
    --no-load-rng

adding consistency calculations/checks at init time

Stella pointed out to how they do consistency calculations/checks with NeoX:
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/arguments.py

It'd be good for someone to study what they did over the base Megatron-LM and replicate anything that can help our work, since some good checks can save days of running a model under a wrong setup thinking it's doing something else.

I haven't studied what they did, so I don't have any specific suggestions here.

Thank you.

clone HF's `GPT2` to create `GPTMeg` with a few tiny changes.

As can be seen from #121 we have a divergence between Meg and HF GPT2, while using the same weights under fp16.

So the proposed solution to enable users to use BigScience-pretrained models is to create a new architecture, which would be an identical clone of HF's GPT2, but with some changes.

Here are 3 changes:

def apply_overrides():

    # 1. layer norm needs to be done in fp32 and then cast back to fp16 to match meg.
    torch_layer_norm_orig = torch.layer_norm
    def torch_layer_norm_force_fp32(input, normalized_shape, weight, bias, eps, cuddn):
        out = torch_layer_norm_orig(input.float(), normalized_shape, weight.float(), bias.float(), eps, torch.backends.cudnn.enabled).half()
        print(out)
        #die
        return out
    torch.layer_norm = torch_layer_norm_force_fp32


    # 2. MLP uses a slightly different activation function with a custom bwd
    import transformers.activations
    @torch.jit.script
    def gelu_megatron_fwd(x):
        return  x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))

    @torch.jit.script
    def gelu_megatron_bwd(g, x):
        tanh_out = torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x))
        # sqrt(2/pi) * 3 * 0.044715 -> 0.1070322243
        ff = 0.5 * x * ((1 - tanh_out * tanh_out) * (0.79788456 + 0.1070322243 * x * x)) + 0.5 * (1 + tanh_out)
        return ff*g

    class GeLUFunction(torch.autograd.Function):
        @staticmethod
        def forward(ctx, input):
            ctx.save_for_backward(input)
            return gelu_megatron_fwd(input)

        @staticmethod
        def backward(ctx, grad_output):
            input = ctx.saved_tensors
            tmp = gelu_megatron_bwd(grad_output, input)
            return tmp, tmp

    transformers.activations.gelu_fast = GeLUFunction.apply
    transformers.activations.ACT2FN["gelu_fast"] = transformers.activations.gelu_fast


    # 3. torch.baddbmm() (meg) produces slightly different results than torch.matmul, so override to use `torch.baddbmm`
    import transformers.models.gpt2.modeling_gpt2
    from torch import nn
    def new_attn(self, query, key, value, attention_mask=None, head_mask=None):
        output_size = (query.size(0), key.size(1), query.size(2), key.size(2))
        matmul_result = torch.empty(output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query.dtype, device=query.device)

        factor = float(value.size(-1)) ** 0.5
        matmul_result = torch.baddbmm(
            matmul_result,
            query.reshape(-1, query.shape[2], query.shape[3]),  # [b * np, sq, hn]
            key.reshape(-1, query.shape[2], query.shape[3]).transpose(1, 2),  # [b * np, hn, sk]
            beta=0.0,
            alpha=1.0 / factor
        )
        attn_weights = matmul_result.view(*output_size)

        # attn_weights = torch.matmul(query, key.transpose(-1, -2))
        #
        # if self.scale_attn_weights:
        #     attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)

        # Layer-wise attention scaling
        if self.scale_attn_by_inverse_layer_idx:
            attn_weights = attn_weights / float(self.layer_idx + 1)

        if not self.is_cross_attention:
            # if only "normal" attention layer implements causal mask
            query_length, key_length = query.size(-2), key.size(-2)
            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
            attn_weights = torch.where(causal_mask, attn_weights, self.masked_bias.to(attn_weights.dtype))

        if attention_mask is not None:
            # Apply the attention mask
            attn_weights = attn_weights + attention_mask

        attn_weights = nn.Softmax(dim=-1)(attn_weights)

        # Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op otherwise
        attn_weights = attn_weights.type(value.dtype)
        attn_weights = self.attn_dropout(attn_weights)

        # Mask heads if we want to
        if head_mask is not None:
            attn_weights = attn_weights * head_mask

        attn_output = torch.matmul(attn_weights, value)

        return attn_output, attn_weights

    transformers.models.gpt2.modeling_gpt2.GPT2Attention._attn = new_attn

Here is how we are going to tackle the activation function: huggingface/transformers#13997

So a PR will need to be files with https://github.com/huggingface/transformers/

extract and log grad norm for individual layers

the paper NormFormer: Improved Transformer Pretraining with Extra Normalization https://arxiv.org/abs/2110.09456 suggests that under preLN:

gradients at earlier layers tend to be larger than gradients at later layers

so we want to verify that this is so in our case before acting on it and potentially integrating NormFormer.

So we need to expand the tensorboard and logs to log grad norm for individual layers, perhaps as in the paper we can log 5 layers: 0, 1, int(n_layers/2), -2, -1

For reference see p6 in the paper (graphs and discussion).

Currently only L2 average of all layer grad norms is calculated and logged (a single number).

Performance-wise this will require doing some extra calculations but not much - it's really just figuring out how to broadcast this info to TB so that multiple-points can be logged at once.

Additionally, we could activate this tool on demand - e.g. after encountering a spike we could roll back to the last good checkpoint and run a cycle with this debug feature enabled.

DeBERTa-like attention mechanism

In this issue, we discuss how viable/interesting it might be to implement DeBERTa like attention mechanism:

https://arxiv.org/abs/2006.03654

Things to take in account:

performance enhancements: Check with HF pretrained model to see first?
implementation cost: How much would someone need to spend on implementing that feature?
implementation feasability: It might not work well with Megatron-DeepSpeed setup, we need to check that.

Implement the ML Flow experiment tracker

Motivation. As @sashavor suggested, the carbon footprint working group needs an experiment tracker to properly follow all runs being done. An experiment tracker could also be more broadly interesting to centralise in one place all experiments being done.

Proposed solution. Following a discussion with @thomwolf, the carbon WG has identified MLFlow as a promising open-source option: it supports having a dedicated server, and can interface with TensorBoard logs (which we already produce). This blog post shows how it integrates with Tensorflow. There is also documentation on how to interface with PyTorch models.

Implementation. It's not quite clear how nicely will MLFlow play with Megatron/DeepSpeed+the limited networking on Jean Zay. The goal here is to first build a proof-of-concept showing MLFlow can integrate in our codebase and phone back to the centralised server from Jean Zay. We can then consider the finer details of reporting all metrics of interest to us.

Train multilingual autoregressive baseline

TODO: add details

Implement Gradient Noise Scale monitoring

Follow appendix A.1 https://arxiv.org/pdf/1812.06162.pdf to implement monitoring of gradient noise scale and add it to the tensorboard log.

preprocess mC4

Languages: Arabic, Swahili (Bantu), Chinese, Catalan, English, French, Indic (Hindi,Urdu,Bangla), Indonesian, Portuguese, Spanish, Russian, Japanese, Amharic

Then upload to the Google storage bucket
https://console.cloud.google.com/storage/browser/bigscience/mc4_preprocessing

Add Megatron support for the EleutherAI Evaluation Harness

Add the ability to run the EleutherAI Evaluation Harness on Megatron checkpoints. Right now we are relying on converting Megatron checkpoints to Hugginface checkpoints which is an error-prone process. We also have to use Megatron anyways to run the 200B model.

Implementation details:
You will use this HF gpt2 model implementation here as your reference. Here are more details:

Edit _init_, create_from_arg_string to load the Megatron checkpoints
Edit the _model_call function to call the Megatron model and read logits back
The functions loglikelihood, loglikelihoods, _loglikelihood_tokens might (or might not) require a little tweaking
Leave the function greedy_until unimplemented (raise an exception), we don't need it for now.
Check this test that shows how to load and call a Megatron checkpoint.
Here's one Megatron checkpoint that you can work with.
A relatively close implementation is already in the GPT-NeoX repo here and it might be helpful to check as well.

implement GLU activation function

Pick and implement one of the variants of the GLU activation.

Check here for comparison and equations
https://arxiv.org/pdf/2002.05202.pdf
https://arxiv.org/pdf/2102.11972.pdf

and here for implementation of one of them
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/activations.py

Add checks to confirm that the checkpoint conversion script works perfectly correct

We now have a script that convert megatron-deepspeed checkpoints to HF-transformers checkpoints.
Project is here and the script is here.
However, the script doesn't have unit tests that confirm that the conversion is correct.

The goal of this issue is to add such tests. The idea is to run the forward pass of both models (before and after conversion) with any random input, then use torch.allclose to assert that the output loss and the logits of both models perfectly match.

Consistency in code convention

As discussed in #128, maintain code style consistency by setting up some basic formatters.

Makefile
pyproject.toml
setup.cfg

Benchmark HF transformers as point of reference.

Calling IndexedDatasetBuilder directly with a best_fit datatype fails

I don't know whether this is intended to work or not, but I found the following program:

from megatron.data.indexed_dataset import IndexedDatasetBuilder, best_fitting_dtype

best_dtype = best_fitting_dtype(10_000)
IndexedDatasetBuilder("testfile", dtype=best_dtype)

leads to an error like:

  File "/path/to/Megatron-DeepSpeed.git/megatron/data/indexed_dataset.py", line 284, in __init__
    self.element_size = self.element_sizes[self.dtype]
KeyError: <class 'numpy.uint16'>

This shows up because best_fitting_dtype will return numpy.uint16 for small vocabs:

Megatron-DeepSpeed/megatron/data/indexed_dataset.py

Lines 25 to 27 in c680954

    
           def best_fitting_dtype(vocab_size=None): 
        
               if vocab_size is not None and vocab_size < 65500: 
        
                   return np.uint16

but that particular type is missing from the element_sizes table.

Megatron-DeepSpeed/megatron/data/indexed_dataset.py

Lines 268 to 276 in c680954

    
           element_sizes = { 
        
               np.uint8: 1, 
        
               np.int8: 1, 
        
               np.int16: 2, 
        
               np.int32: 4, 
        
               np.int64: 8, 
        
               np.float: 4, 
        
               np.double: 8 
        
           }

which is used in the IndexedDatasetBuilder constructor here:

Megatron-DeepSpeed/megatron/data/indexed_dataset.py

Line 284 in c680954

self.element_size = self.element_sizes[self.dtype]

Should something like this work?

If so, it seems like either the best_fitting_dtype function should return a different type or uint16 should be added to the element_sizes table like f706108.

Implement prefix-lm as in the T5 paper

AFAIU, the current implementation ~~uses the first 50% as the prefix~~ doesn't support prefix-lm. The T5 paper samples the prefix length randomly from 0 to max_sequence_lengh (which is 2048 in our case)

Train baseline + rotary embeddings

The simplest model, with rotary embeddings. Don't necessarily train to 300B tokens to compare.

Need model size dumped at init

We need to have a diagnostic model size dumped during the framework init. We currently get a report per rank and not the total.

 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (2, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 1986465792
 > number of parameters on (tensor, pipeline) model parallel rank (3, 7): 1986498560

Later on ZeRO engine does dump the right thing amongst multiple other numbers and repeated on each rank

[2021-10-02 16:08:53,028] [INFO] [engine.py:134:__init__] RANK=0 STAGE=0 LAYERS=7 [0, 7) STAGE_PARAMS=1986465792 (1986.466M) TOTAL_PARAMS=57778896896 (57778.897M) UNIQUE_PARAMS=56814206976 (56814.207M)

But ideally we just want a print like:

Model size: 57B (57778896896 params)

Just on rank 0.

Thanks.

Plot scaling laws of our baseline models

For our three baselines on different datasets (OSCAR, C4, The Pile), we would like to plot scaling laws and retrieve their coefficients. Specifically, we are looking to reproduce Figure 1 of Scaling Laws for Neural Language Models.

The TensorBoard data for the baseline runs can be retrieved on the Big Science space on HuggingFace: it's the tr3 runs with tensorboard in their name. The naming scheme (tr3b, tr3c, etc.) is explained here.
For C4, we have a XL, L, and M model (tr3, tr3c, tr3c) with short warm-up. For OSCAR and The Pile, we have an XL, L, M, and S model (tr3d, tr3g, tr3h, tr3i and tr3, tr3j, tr3k, tr3l). For OSCAR, we can should also add the 13B run to see if the fits hold (that's tr1-13B).

`PretrainedFromHF` crashes on 1 gpu

When using PretrainedFromHF it seems to require at least 2 gpus and specifically TP=2, otherwise with 1 gpu it crashes

To reproduce:

CHECKPOINT_PATH=checkpoints/gpt2
VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt
DATA_PATH=data/meg-gpt2_oscar-combined_text_document
TENSORBOARD_PATH=output_dir/tensorboard

TP_SIZE=1
PP_SIZE=1
N_GPUS=1

NLAYERS=24
NHIDDEN=1024
NHEADS=16
FFN_HIDDEN_SIZE=4096
SEQ_LEN=2048


GPT_ARGS=" \
    --exit-interval 10 \
    --num-layers $NLAYERS \
    --hidden-size $NHIDDEN \
    --num-attention-heads $NHEADS \
    --seq-length $SEQ_LEN \
    --max-position-embeddings $SEQ_LEN  \
    --micro-batch-size 2 \
    --rampup-batch-size 4 4 1_000 \
    --global-batch-size 16 \
    --train-samples 100 \
    --optimizer adam \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-8 \
    --lr 1e-4 \
    --lr-warmup-samples 5 \
    --clip-grad 1.0 \
    --weight-decay 1e-1 \
    --fp16 \
    --tensor-model-parallel-size $TP_SIZE \
    --pipeline-model-parallel-size $PP_SIZE \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path t5-small \
    "

OUTPUT_ARGS=" \
    --log-interval 10 \
    --save-interval $SAVE_INTERVAL \
    --eval-interval 100 \
    --eval-iters 10 \
    --checkpoint-activations \
    "

DATA_ARGS=" \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH \
    --data-path $DATA_PATH \
    --tensorboard-dir $TENSORBOARD_PATH \
    --tensorboard-queue-size 5 \
    --log-timers-to-tensorboard \
    --log-batch-size-to-tensorboard \
    --log-validation-ppl-to-tensorboard \
    "

ALL_ARGS="$GPT_ARGS $OUTPUT_ARGS $DATA_ARGS"

# if you can't stand pt-1.9 launcher noise
export LOGLEVEL=WARNING

LAUNCHER="deepspeed --num_gpus $N_GPUS"

CMD="$LAUNCHER pretrain_gpt.py $ALL_ARGS"

echo $CMD

#rm -rf $CHECKPOINT_PATH
$CMD

It crashes w/ or w/o deepspeed enabled, the above script is w/o deepspeed.

The log:

[before the start of training step] datetime: 2021-09-08 17:38:58 
[2021-09-08 17:38:58,791] [INFO] [checkpointing.py:408:forward] Activation Checkpointing Information
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:409:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:412:forward] ----contiguous Memory Checkpointing False with 2 total layers
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:415:forward] ----Synchronization False
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:416:forward] ----Profiling time in checkpointing False
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

[...] some 10000 of these [...]

/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [339,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [339,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "pretrain_gpt.py", line 222, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 147, in pretrain
    iteration = train(forward_step_func,
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 686, in train
    train_step(forward_step_func,
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 387, in train_step
    loss = model[0].train_batch(data_iter=data_iterator)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 291, in train_batch
    self._exec_schedule(sched)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 1237, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 587, in _exec_forward_pass
    outputs = super().forward(inputs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/engine.py", line 1164, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 347, in forward
    x = self.activation_checkpoint_func(
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 690, in checkpoint
    CheckpointFunction.apply(function, all_outputs, *args)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 494, in forward
    outputs = run_function(*inputs_cuda)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 325, in exec_func
    inputs = layer(inputs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 585, in forward
    return super().forward(hidden_states, attention_mask, **kwargs)
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 477, in forward
    self.self_attention(layernorm_output,
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 215, in forward
    mixed_x_layer, _ = self.query_key_value(hidden_states)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/mpu/layers.py", line 289, in forward
    output_parallel = F.linear(input_parallel, self.weight, bias)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1623448278899/work/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f99385a7a22 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10ac3 (0x7f9938809ac3 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f993880b167 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f99385915a4 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa249ba (0x7f99b1e819ba in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa24a51 (0x7f99b1e81a51 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x1932c6 (0x55f7dd66d2c6 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #7: <unknown function> + 0x1592ac (0x55f7dd6332ac in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #8: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #9: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #10: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #11: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #12: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #13: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #14: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #15: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #16: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #17: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #18: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #19: <unknown function> + 0x158e77 (0x55f7dd632e77 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #20: <unknown function> + 0x1590dc (0x55f7dd6330dc in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #21: <unknown function> + 0x1e3643 (0x55f7dd6bd643 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #22: <unknown function> + 0x159109 (0x55f7dd633109 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #23: <unknown function> + 0x1e3643 (0x55f7dd6bd643 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #24: <unknown function> + 0x176057 (0x55f7dd650057 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #25: _PyModule_ClearDict + 0x103 (0x55f7dd687d53 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #26: PyImport_Cleanup + 0x578 (0x55f7dd6aff88 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #27: Py_FinalizeEx + 0x79 (0x55f7dd6e1a49 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #28: Py_RunMain + 0x183 (0x55f7dd6e3893 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #29: Py_BytesMain + 0x39 (0x55f7dd6e3ca9 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #30: __libc_start_main + 0xf3 (0x7f99e84330b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #31: <unknown function> + 0x1e21c7 (0x55f7dd6bc1c7 in /home/stas/anaconda3/envs/py38-pt19/bin/python)

It works if I use 2 gpus and TP=2.

Originally discovered by @tjruwase and I was able to reproduce it too.

We were trying to use the 350m checkpoint you gave us to test the checkpoint conversion tool - so we were unable to use it since the final checkpoint needs to be a single file - 1 gpu.

I also see PretrainedFromHF is not being tested. So probably a test is needed as well.

Thank you.

@TevenLeScao

How to load a tensor+pipeline parallel checkpoint for inference tasks?

Following the training script here as a template:

https://github.com/bigscience-workshop/bigscience/blob/master/train/tr1-13B-base/tr1-13B-round1.slurm

I've trained some models using 2-way tensor parallelism and 4-way pipeline parallelism, which produces a number of checkpoints in directories like "global_step26000".

I'm now trying to use one of those trained checkpoints to do inference. In particular, I'm trying to work with modified versions of these scripts where I can provide new strings that are tokenized and processed on the fly:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/examples/generate_text.sh
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/generate_samples_gpt.py

I've tried a number of different approaches, but no luck so far. I also can't seem to find any instructions written up on that.

Anyone know the steps to load one of those checkpoints for inference on new data?

update converted models to include tokenizer files

Update released model files to include

correct tokenizer files (t5-small or gpt2):
fill out the config.tokenizer_class

HUB:

GCS:

gs://bigscience-backups/tr1-13B/

This will be automated for the future by #126 - but for now do it manually since it's just 2 sets of files.

Rotary embeddings

Description

Implement rotary embeddings:

paper: https://arxiv.org/pdf/2104.09864.pdf
code: https://github.com/ZhuiyiTechnology/roformer

Done

train a model using rotary embeddings

Remove json intermediary for preprocessing datasets.

@stas00 suggested we might want to remove the whole json intermediary.

#18 (comment)

This issue should be used to discuss if that's a feature we want, and if so, define clearly what we want.

group the tensorboard logs

It has been proposed to group the tensorboard logs, by adding a group prefix to the log-key, e.g.:

tb.add_scalar("group a/batch size", batch_size, iteration)

Currently proposed groups:

Batch-size
- Batch-size
- Batch-size vs samples
Grad-norm
- Grad norm
- Grad norm vs samples
Learning rate
- Learning rate
- Learning rate vs samples
Lm loss train
- Lm loss
- Lm loss vs samples
Lm loss validation
- Lm loss
- Lm loss vs samples
- Lm loss ppl
- Lm loss ppl vs samples
Loss scale
- Loss scale
- Loss scale vs samples
Num zeros
- Num zeros
- Num zeros vs samples

The question is whether to hardcode these, or have a configurable map, should different trainings want different groups?

@VictorSanh, @slippylolo, @ibeltagy

Distributed terashuf

I don't know how much time I will have to polish this up, but I have a prototype MPI-enabled "terashuf". This uses an algorithm similar to terashuf, where contiguous records from the source file are shuffled in segments, and then one randomly picks leading records from the shuffled segments to put together the final shuffled file.

This prototype currently stores the shuffled segments in memory (rather than files), and so it requires one to be able to load the full file into distributed memory. Currently each rank reads a portion of the source file into memory, shuffles that section, and then ranks exchange lines with each other in order to write out the file in contiguous chunks.

It can shuffle the oscar.jsonl file in about 10 minutes using 80 procs on 8 nodes on my system.

2021-09-01T12:19:15: 0: Wrote 1319979521428 of 1320971843503 bytes (99.92%) in 343 secs, 3668.675 MB/s
2021-09-01T12:19:20: 0: Waiting for ranks to finish ...
2021-09-01T12:19:20: Seconds to write file: 348.45524168014526

real	6m25.041s

Just posting this notice in case others need to shuffle a large JSON file in a hurry.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/tools/distshuf.py

It currently requires mpi4py and an mpi4py enabled DistData class.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/megatron/data/distdata_mpi.py

I first attempted a torch.distributed version, but hit some problems. I haven't yet gone back to see if a torch.dist equivalent is easy.

For speed and correctness, both the input and output files must be on a parallel file system like Lustre/GPFS.

Example command:

srun -n 80 -N 8 python3 tools/distshuf.py \
       --input /gpfs/path/to/oscar.jsonl \
       --output /gpfs/path/to/oscarshuf.jsonl \
       --seed 101

recovering from loss spikes strategies

After having a 3->8->3 spike in the loss value a few days ago, which luckily recovered after a few hours of training, we want to discuss possible ready to use strategies that we can quickly deploy should the spike not recover.

Notes from the slack so far:

Iz Beltagy:

In case the model gets stuck in one of the spikes and doesn’t come back, we can restart it from an earlier checkpoint but shuffle the data, reset the optimizer state, switch to fp32, lower lr, change optimizer params …

Stas:

Do you think we should be prepared and have a few of these options documented from the best choice to least, or deal with it if and when it happens?
e.g. I don't think we can shuffle the data, other than perhaps changing the seed?

Ryan Teehan:

I think that would be a good idea, both as a way to inform people about decisions but also for developing justifications and reasons for best practices

Iz Beltagy:

it would be great if we have these implemented and ready to be used. As for knowing which choices are more effective, that would be something we figure out empirically, and it would be one of the contributions of the project

@ibeltagy

[prefixLM] Investigate cuda kernels

Up until recently we've been using pytorch code in order to apply "scale -> mask -> softmax" in attention mechanism for prefix LM.

I've recently discovered that there exists two cuda kernels to attention matrices:

ScaledUpperTriangMaskedSoftmax which is as the naming suggest specific to GPT style models
ScaledMaskedSoftmax

The later one hasn't really been tested, and since I didn't have time to deepdive in the cuda code, I decided not to use it for initial prefix lm. However #151 has removed the mechanism to force prefix lm to use the pytorch route.

In this issue, we want to:

make sure that ScaledMaskedSoftmax support prefix attention mask if fed with one.
if it doesn't, implement a prefix lm cuda kernel.

@RezaYazdaniAminabadi , If you could provide us with your expertise on that one, maybe just checking the first item that'd be great. Thank you!

[feature request] Implement sample-ids-to-text extractor

If you have been following the training we got a few spikes in lm loss and it'd be great to look at the input data on when this happened. We suspect that input data was either garbled or in a very different language.

Spec

So we need to write a script that can receive a range of sample ids and generate the output in format:

sample id | normal text

Possible invocation syntax:

tools/sample-ids-to-text.py --seed 111 --sample-id-range 10728784-10769744 --whatever other args are needed, path to the indexed dataset, etc.

@mohammad Shoeybi shared the key pointers to the code to build upon:

With respect to getting input text from iteration, we report the number of samples consumed, You can instantiate dataset (for example here) and get the sample number directly.

You can find the index files that were used for 13B and are now reused for 104B training here: https://huggingface.co/bigscience/tr1-13B-data/tree/main/indices but it'll probably much easier to create your own tiny dataset, preprocess it, start a tiny training so that it'll create the .npy files, and working with that so that all data is there under you control and it's tiny, rather than working with 1TB files.

Set up a basic MLflow setup

Replicate all the tensorboard logging in Meg-DS, plus logging hyperparams of choice.
So on the code level:

repeat tensorboard 1:1 but log using mlflow api
find new places where to log new things (e.g. hyperparams)
WGs that want to log specific events/data will add those directly to Meg-DS code base
Currently the config is just --mlflow-dir on/off toggle which will log all MLFlow events/data

example:
https://gist.github.com/tsaoyu/14e39a6d246cb29b107a2cc62a12f7a3

Blocking events:

@JetRunner setting up the MLFlow server

Issue to gather Fixes + New features to send upstream

Please edit the OP to add whatever fixes we applied to the core and which need to be propagated upstream into:

we want to do that to make it easier to sync upstream changes back to this repo.

Changes to send upstream:

Bug fixes:

9189c4e fix bug when restarting with no eval in round 1
0125aaa Fix merge functions to take in account doc_ids
9e75429 Fix Tensorboard logging with correct rank condition as detailed in this issue on Megatron-LM repo. Relevant part of the PR: 9e75429#diff-e2b248f8c422a601bcb0b7d93f96c1dff070f2694737e2b69f1def64ab9c1844R589
56c2983 - fixes issues with --rampup-batch-size help entry
c680954 Fix document offset when merging documents, bug introduces in 0125aaa. Thanks @adammoody !
648ee17 - check whether python3-config is available and clearly assert if it doesn't when failing to build helpers

New functionality:

18201ce , e8fcbae, c680954 add tools/merge_preprocessed_data.py to support merging datasets - might be easier to just copy the new script.
7b99881 - new faster preprocessing script for when one has many cpu cores.
5069622 - new preprocessing script that uses HuggingFace Datasets as source
Cirriculum learning: #132 + #133

Create AWS image with our tools and codebase

to make it easy for new users to use our setup

eliminate `megatron/model/init.py`

megatron/model/__init__.py seem to lead to circular reports quite often, when importing from other megatron libraries, so it's probably the best idea to remove it, rather than continually working around it.

This will require adapting all these to import from the corresponding module each of the imported symbols reside in:

./pretrain_gpt.py:from megatron.model import GPTModel, GPTModelPipe
./megatron/training.py:from megatron.model import Float16Module
./megatron/training.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/optimizer/__init__.py:from megatron.model import LayerNorm
./megatron/schedules.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/schedules.py:from megatron.model import Float16Module
./megatron/text_generation_utils.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/text_generation_utils.py:from megatron.model import Float16Module
./megatron/model/realm_model.py:from megatron.model import BertModel
./megatron/model/transformer.py:from megatron.model import LayerNorm
./megatron/model/bert_model.py:from megatron.model import LayerNorm
./megatron/model/gpt_model.py:from megatron.model import LayerNorm
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import GPTModel
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import DistributedDataParallel as LocalDDP
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import Float16Module
./checkpoint-analysis.ipynb:    "from megatron.model import GPTModel\n",
./pretrain_bert.py:from megatron.model import BertModel
./pretrain_t5.py:from megatron.model import T5Model
./tools/generate_samples_gpt.py:from megatron.model import GPTModel

This is a very basic python task and requires no knowledge of Megatron or Deepspeed.

Steps:

git rm megatron/model/__init__.py
adapt all the calls listed above
run make test to ensure things still work.

Thank you.

is fused layernorm really better?

pytorch/pytorch@8b87f9a#diff-f12c726e3e8cd2b4768f8984fef27059

I think we don't need to use apex fused layernorm anymore.
torch layernorm is better. what do you think about this?

cc. @stas00 @thomasw21

Try the multi-dataset setup with sampling probabilities

Multilingual models are trained on multiple datasets (one dataset per language) with a sampling probability for each dataset.

This feature (training on instances randomly sampled from multiple dataset files) seems to be supported in our Megatron code base (check here https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/run.sh#L9-L12, thanks to @sbmaruf for finding it), but we want to give it a try and make sure it is working with multiple gpus on multiple nodes. For testing, we can do the following:

create a few small datasets. Don't preprocess a huge dataset because it takes forever. All datasets can be English if it is easier but make sure each dataset is recognizable from its data (for logging)
start training a multi-node model
add logging to verify that the sampling probabilities are respected

Once we verify this feature is working, we will swap the small testing datasets with the preprocessed multilingual datasets for the real training.

Here are the instructions for how to get started with the code-base

- install things: https://github.com/bigscience-workshop/Megatron-DeepSpeed#setup
- get and preprocess data to work with: https://github.com/bigscience-workshop/Megatron-DeepSpeed#quick-pre-processing-to-start-training-with
- train: https://github.com/bigscience-workshop/Megatron-DeepSpeed#gpt-pretraining

Train English-only baseline

[TODO: add details]

add support for wandb monitoring

low priority

auto-add tokenizer files to the converted model

our conversion to HF script:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/convert_checkpoint/deepspeed_to_transformers.py
doesn't do anything about the tokenizer - until now we added the right tokenizer files manually, but this doesn't work at scale.

So this ticket is to extend the above script at the point of saving the checkpoint to lookup which tokenizer is used based on these 2 args:

    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path t5-small \

for GPT2BPETokenizer it's only:

    --tokenizer-type  GPT2BPETokenizer \

and then fetch the right tokenizer files but loading and saving it into the folder of the checkpoint files.

So the required logic:

derive the HF model name

if args.tokenizer_type == "GPT2BPETokenizer": 
    mname = "gpt2"
elif args.tokenizertype == "PretrainedFromHF":
    mname = args.tokenizer_name_or_path
else:
   raise ValueError(f"don't know how to handle args.tokenizer_type={args.tokenizer-type}")

fetch and save the right tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(mname)
tokenizer.save_pretrained(output_state_dict)"

mname in our case so far is either t5-small or gpt2

done.

I think this is all that's needed but I haven't tested any of this.

Simple Multilingual evaluation benchmark

Extend the setup from this issue #27 by adding a few "zero-shot" multilingual datasets to the evaluation.
Check table 2 here https://arxiv.org/pdf/2010.11934.pdf for potential datasets.

unit test of fused softmax kernel

NVIDIA/Megatron-LM#132

I've been looking at the megatron code nowadays. When using the softmax kernel used in megatron, it was observed that the results were different from the original torch softmax. How about doing a unit test on this kernel?

Double counts in parameter count

Currently, parameter counts in utils.get_parameters_in_billions are inaccurate when PP > 1. Tied variables, in particular embedding layers, exist in several copies in the first and last PP stage, which causes double counts. For now the codebase only uses the count without embedding layers which is accurate, but it would be good for the count with embedding layers to also function, mostly for operation-counting.

For background see: #40

Setup a relatively simple evaluation benchmark

We need an evaluation benchmark that we can use for iterating on the model design. It needs to satisfy the following requirements:

Two versions: English-only and multilingual
Zero-shot, few-shot, and many-shots: it should include at least the first two (the third is debatable)
Works on Megatron checkpoints and hugginface/transformers checkpoints. The modeling group will make a Megatron checkpoint available soon. In the meantime, you can start using hugginface/transformers checkpoints, for example, this one https://huggingface.co/gpt2
Easy to run: ideally a single script that reads the checkpoint and outputs the evaluation results
Doesn't take too long to run: anything longer than 1 day on 1 gpu is too much
Representative: it covers the key tasks (LM, QA, classification, generation .. anything else?)

Update

Assume the model is autoregressive and it uses causal language modeling
Selected datasets need to make it possible to compare with existing pretrained models (e.g. GPT3, GPT-J, mT5 ... )

[ci] failing with 4 gpus, works with 2

Need to figure out why CI fails with 4 gpus, but works fine with 2.

I set it for now to 2 gpus, until we sort this out. 391930b

FAILED tests/test_model.py::MyTestCase::test_gpt - Failed: Timeout >300.0s
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_0_base - ...
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_1_cl - Ru...
FAILED tests/test_training.py::MegDSTestTraining::test_training_prefix_lm_all

the first one hangs on startup, the others fail with:

stderr: Traceback (most recent call last):
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_gpt.py", line 246, in <module>
stderr:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 165, in pretrain
stderr: Traceback (most recent call last):
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_gpt.py", line 246, in <module>
stderr:     iteration = train(forward_step_func,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 734, in train
stderr:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr:       File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 165, in pretrain
stderr: train_step(forward_step_func,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 405, in train_step
stderr:     iteration = train(forward_step_func,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 734, in train
stderr:     loss = model[0].train_batch(data_iter=data_iterator)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 329, in train_batch
stderr:     self._exec_schedule(sched)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1313, in _exec_schedule
stderr:     train_step(forward_step_func,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 405, in train_step
stderr:     loss = model[0].train_batch(data_iter=data_iterator)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 329, in train_batch
stderr:         self._exec_instr(**cmd.kwargs)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 723, in _exec_backward_pass
stderr: self._exec_schedule(sched)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1313, in _exec_schedule
stderr:     local_part=self.grad_layer[1],
stderr: IndexError: list index out of range

Add tests & setup CI

As an extension of discussions from #47, start adding basic unit tests to ensure that new additions and implementations (e.g. GLU variants, prefix-LM masking schemes, etc) work as intended. Possibly setup CI pipelines to ensure that tests pass on each push to main. This is likely mid to low priority.

[testing] data size / dynamic downloads - test speed and repo bloat

Let's discuss which data is used in the test suite. And after the discussion turn into guidelines for test writers.

Here is a very rough start:

We want to have the basic test suite run really fast. I have started curating tiny datasets and tokenizers sufficient for testing here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tests/data/gpt2
So the normal tests ideally should complete fast.

Currently the main overhead in testing is the really slow startup of Meg-DS

And we also don't want the repo to become slow to download because of large data files. My first attempt is to generate a synthetic input on the fly (incomplete). But perhaps have a small compressed file in the repo should not be a problem and having it uncompressed when the test suite starts.

Hopefully soon we will have a CI, so downloading large data files can slow things down.

The test suite ensures correctness, not speed, so having 1k records is more than enough, even for parallel processing.

That's said we can have extended slow tests that don't run by default and which can download large data and then these can be used as well.

Comments, suggestions and ideas are super welcome.

Limit cache for GPT2Tokenizer

After some OOM errors when preprocessing large files, I noticed that the caching mechanim can have unlimited memory due to its ability to grow forever. This causes issues, in particular in multiprocessing cases when this cache is multiplied by the number of workers. This prevents to work with large datasets.

Megatron-DeepSpeed/megatron/tokenizer/gpt2_tokenization.py

Line 167 in 781676b

self.cache = {}

Note: I've tried switching to HF tokenizer, and the memory constraint becomes much lower since it doesn't cache.

In order to reduce the memory constraint, we can either:

implement a limited memory cache (LRU or something like that)
shared cache between workers
remove cache mechanism
ignore the problem since we'll use another implementation of tokenizer

cc @ontocord

	def best_fitting_dtype(vocab_size=None):
	if vocab_size is not None and vocab_size < 65500:
	return np.uint16

	element_sizes = {
	np.uint8: 1,
	np.int8: 1,
	np.int16: 2,
	np.int32: 4,
	np.int64: 8,
	np.float: 4,
	np.double: 8
	}

bigscience-workshop / megatron-deepspeed Goto Github PK

megatron-deepspeed's People

Contributors

Stargazers

Watchers

Forkers

megatron-deepspeed's Issues

Description

Done

Spec

Bug fixes:

New functionality:

Recommend Projects

Recommend Topics

Recommend Org