Code Monkey home page Code Monkey logo

megatron-deepspeed's People

Contributors

abodacs avatar adammoody avatar akhileshgotmare avatar bhavitvyamalik avatar borisfom avatar conglongli avatar danielhesslow avatar deepakn94 avatar dweekly avatar ekmb avatar hadyelsahar avatar huu4ontocord avatar jaketae avatar jaredcasper avatar jeffra avatar kantneel avatar kvareddy avatar lintangsutawika avatar mayank31398 avatar mpatwary avatar muennighoff avatar nakosung avatar raulpuric avatar saullu avatar stas00 avatar tevenlescao avatar thomasw21 avatar tjruwase avatar victorsanh avatar wade3han avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

megatron-deepspeed's Issues

[testing] add logging facility

as discussed elsewhere by @thomasw21 - we need to be able to test specific code paths.

One way to approach this is to add:

if foobar:
    do_some_XYZ branch's work
    logger.debug("path XYZ was run")

and then in a test:

        with CaptureStdout() as cs:
            # after activating DEBUG logging level
            execute_subprocess_async(cmd, env=self.get_env())

        self.assertIn("path XYZ was run", cs.out)

So we need:

  1. to add a logging facility - Megatron doesn't have one - I recommend adapting https://github.com/huggingface/transformers/blob/master/src/transformers/utils/logging.py as it's greatly polished
  2. a way to activate the desired log levels - again this can be borrowed from HF https://github.com/huggingface/transformers/blob/234cfefbb083d2614a55f6093b0badfb2efc3b45/src/transformers/training_args.py#L153-L161 - so adding these to arguments.py and then setting the desired log level via the command line, with the default being either INFO or WARNING.

Once this is laid out we can then start using it for testing.

Version Lock `requirements.txt`

Might be good to either specify versions, or ideally, switch to a conda environment for this repo. Critically, the base requirements for Megatron-DeepSpeed in requirements.txt just require "torch" whereas you actually need torch > 1.7.0, notably with CUDA 11.1 or higher for this repo to actually work!

Cannot import C++ compiled "helpers"

I am having errors importing the compiled C++ helpers.cpp into python in gpt_dataset.py

            # Use C++ implementation for speed.
            # First compile and then import.
             
            from megatron.data import helpers
            assert doc_idx.dtype == np.int32
            assert sizes.dtype == np.int32
            sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length,
                                                  num_epochs, tokens_per_epoch)
            # sample_idx = _build_sample_idx(sizes, doc_idx, seq_length,
            #                               num_epochs, tokens_per_epoch)

Leading to the following error

ImportError: cannot import name 'helpers' from 'megatron.data' (XXXX/Megatron-DeepSpeed/megatron/data/__init__.py)
Killing subprocess 28861

update:
I managed to isolate the problem by compiling the helpers.cpp separately using gcc v9.3.1. Compilation works but loading import helpers doesn't work.

(venv-megatron) bash-4.2$ pwd
XXX/Megatron-DeepSpeed/megatron/data

(venv-megatron) bash-4.2$ gcc --version | head -1 
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)

(venv-megatron) bash-4.2$ make
make: python3-config: Command not found
make: python3-config: Command not found
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/nfs/core/python/3.9/include/python3.9 -I/xxxx/bigscience/venv-megatron/lib/python3.9/site-packages/pybind11/include helpers.cpp -o helpers

(venv-megatron) bash-4.2$ ipython
Python 3.9.4 (default, Apr  7 2021, 12:46:00)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import helpers
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-d249c8495052> in <module>
----> 1 import helpers

ModuleNotFoundError: No module named 'helpers'

In [2]:

posted the same issue on the original repo : NVIDIA/Megatron-LM#143

Import issues when using evaluation scripts : `module 'megatron' has no attribute 'model'`

Hello everyone,

There are several scripts that can be used for evaluation, that can be found here :

https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tasks

However, running these scripts raises the following error :

$ python tasks/main.py --task WIKITEXT103 --num-layers 4 --hidden-size 256 --num-attention-heads 4 --seq-length 96 --max-position-embeddings 96 --fp16 --valid-data /gpfswork/rech/rcy/ulz63oj/datasets/oscar_shuff/en/en_shuf_part_1295.txt --tokenizer-type PretrainedFromHFTokenizers --tokenizer-name-or-path /gpfswork/rech/rcy/ulz63oj/tokenizers/bpe_en_5.0_30000.json --load /gpfsscratch/rech/rcy/ulz63oj/GPT_L4_A4_H256/bpe_en_5.0_30000 --micro-batch-size 8 --checkpoint-activations --log-interval 10 --no-load-optim --no-load-rng
Traceback (most recent call last):
  File "tasks/main.py", line 70, in <module>
    initialize_megatron(extra_args_provider=get_tasks_args)
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/initialize.py", line 53, in initialize_megatron
    set_global_variables(extra_args_provider=extra_args_provider,
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/global_vars.py", line 88, in set_global_variables
    args = _parse_args(extra_args_provider=extra_args_provider,
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/global_vars.py", line 105, in _parse_args
    _GLOBAL_ARGS = parse_args(extra_args_provider=extra_args_provider,
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/arguments.py", line 35, in parse_args
    parser = _add_network_size_args(parser)
  File "/gpfsdswork/projects/rech/six/uty16tp/code/big_science/Megatron-DeepSpeed/megatron/arguments.py", line 318, in _add_network_size_args
    choices=megatron.model.glu_activations.GLU_ACTIVATIONS.keys(),
AttributeError: module 'megatron' has no attribute 'model'

This error does not appear when running pretrain_gpt.py.

@thomasw21 found a solution, which is to add from . import model in
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/__init__.py
and setup Megatron using pip install -e ., but there may be a better solution and we do not understand why model is not an attribute of Megatron after setup.

To reproduce, find a pretrained checkpoint and some text data and run the script provided by NVIDIA :

TASK="WIKITEXT103"

VALID_DATA=/path/to/valid/data
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=/path/to/checkpoint

COMMON_TASK_ARGS=" \
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --fp16 \
    --vocab-file $VOCAB_FILE"

python tasks/main.py \
    --task $TASK \
    $COMMON_TASK_ARGS \
    --valid-data $VALID_DATA \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file $MERGE_FILE \
    --load $CHECKPOINT_PATH \
    --micro-batch-size 8 \
    --checkpoint-activations \
    --log-interval 10 \
    --no-load-optim \
    --no-load-rng 

adding consistency calculations/checks at init time

Stella pointed out to how they do consistency calculations/checks with NeoX:
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/arguments.py

It'd be good for someone to study what they did over the base Megatron-LM and replicate anything that can help our work, since some good checks can save days of running a model under a wrong setup thinking it's doing something else.

I haven't studied what they did, so I don't have any specific suggestions here.

Thank you.

clone HF's `GPT2` to create `GPTMeg` with a few tiny changes.

As can be seen from #121 we have a divergence between Meg and HF GPT2, while using the same weights under fp16.

So the proposed solution to enable users to use BigScience-pretrained models is to create a new architecture, which would be an identical clone of HF's GPT2, but with some changes.

Here are 3 changes:

def apply_overrides():

    # 1. layer norm needs to be done in fp32 and then cast back to fp16 to match meg.
    torch_layer_norm_orig = torch.layer_norm
    def torch_layer_norm_force_fp32(input, normalized_shape, weight, bias, eps, cuddn):
        out = torch_layer_norm_orig(input.float(), normalized_shape, weight.float(), bias.float(), eps, torch.backends.cudnn.enabled).half()
        print(out)
        #die
        return out
    torch.layer_norm = torch_layer_norm_force_fp32


    # 2. MLP uses a slightly different activation function with a custom bwd
    import transformers.activations
    @torch.jit.script
    def gelu_megatron_fwd(x):
        return  x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))

    @torch.jit.script
    def gelu_megatron_bwd(g, x):
        tanh_out = torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x))
        # sqrt(2/pi) * 3 * 0.044715 -> 0.1070322243
        ff = 0.5 * x * ((1 - tanh_out * tanh_out) * (0.79788456 + 0.1070322243 * x * x)) + 0.5 * (1 + tanh_out)
        return ff*g

    class GeLUFunction(torch.autograd.Function):
        @staticmethod
        def forward(ctx, input):
            ctx.save_for_backward(input)
            return gelu_megatron_fwd(input)

        @staticmethod
        def backward(ctx, grad_output):
            input = ctx.saved_tensors
            tmp = gelu_megatron_bwd(grad_output, input)
            return tmp, tmp

    transformers.activations.gelu_fast = GeLUFunction.apply
    transformers.activations.ACT2FN["gelu_fast"] = transformers.activations.gelu_fast


    # 3. torch.baddbmm() (meg) produces slightly different results than torch.matmul, so override to use `torch.baddbmm`
    import transformers.models.gpt2.modeling_gpt2
    from torch import nn
    def new_attn(self, query, key, value, attention_mask=None, head_mask=None):
        output_size = (query.size(0), key.size(1), query.size(2), key.size(2))
        matmul_result = torch.empty(output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query.dtype, device=query.device)

        factor = float(value.size(-1)) ** 0.5
        matmul_result = torch.baddbmm(
            matmul_result,
            query.reshape(-1, query.shape[2], query.shape[3]),  # [b * np, sq, hn]
            key.reshape(-1, query.shape[2], query.shape[3]).transpose(1, 2),  # [b * np, hn, sk]
            beta=0.0,
            alpha=1.0 / factor
        )
        attn_weights = matmul_result.view(*output_size)

        # attn_weights = torch.matmul(query, key.transpose(-1, -2))
        #
        # if self.scale_attn_weights:
        #     attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)

        # Layer-wise attention scaling
        if self.scale_attn_by_inverse_layer_idx:
            attn_weights = attn_weights / float(self.layer_idx + 1)

        if not self.is_cross_attention:
            # if only "normal" attention layer implements causal mask
            query_length, key_length = query.size(-2), key.size(-2)
            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
            attn_weights = torch.where(causal_mask, attn_weights, self.masked_bias.to(attn_weights.dtype))

        if attention_mask is not None:
            # Apply the attention mask
            attn_weights = attn_weights + attention_mask

        attn_weights = nn.Softmax(dim=-1)(attn_weights)

        # Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op otherwise
        attn_weights = attn_weights.type(value.dtype)
        attn_weights = self.attn_dropout(attn_weights)

        # Mask heads if we want to
        if head_mask is not None:
            attn_weights = attn_weights * head_mask

        attn_output = torch.matmul(attn_weights, value)

        return attn_output, attn_weights

    transformers.models.gpt2.modeling_gpt2.GPT2Attention._attn = new_attn

Here is how we are going to tackle the activation function: huggingface/transformers#13997

So a PR will need to be files with https://github.com/huggingface/transformers/

extract and log grad norm for individual layers

the paper NormFormer: Improved Transformer Pretraining with Extra Normalization https://arxiv.org/abs/2110.09456 suggests that under preLN:

gradients at earlier layers tend to be larger than gradients at later layers

so we want to verify that this is so in our case before acting on it and potentially integrating NormFormer.

So we need to expand the tensorboard and logs to log grad norm for individual layers, perhaps as in the paper we can log 5 layers: 0, 1, int(n_layers/2), -2, -1

For reference see p6 in the paper (graphs and discussion).

Currently only L2 average of all layer grad norms is calculated and logged (a single number).


Performance-wise this will require doing some extra calculations but not much - it's really just figuring out how to broadcast this info to TB so that multiple-points can be logged at once.

Additionally, we could activate this tool on demand - e.g. after encountering a spike we could roll back to the last good checkpoint and run a cycle with this debug feature enabled.

DeBERTa-like attention mechanism

In this issue, we discuss how viable/interesting it might be to implement DeBERTa like attention mechanism:

https://arxiv.org/abs/2006.03654

Things to take in account:

  • performance enhancements: Check with HF pretrained model to see first?
  • implementation cost: How much would someone need to spend on implementing that feature?
  • implementation feasability: It might not work well with Megatron-DeepSpeed setup, we need to check that.

Implement the ML Flow experiment tracker

Motivation. As @sashavor suggested, the carbon footprint working group needs an experiment tracker to properly follow all runs being done. An experiment tracker could also be more broadly interesting to centralise in one place all experiments being done.

Proposed solution. Following a discussion with @thomwolf, the carbon WG has identified MLFlow as a promising open-source option: it supports having a dedicated server, and can interface with TensorBoard logs (which we already produce). This blog post shows how it integrates with Tensorflow. There is also documentation on how to interface with PyTorch models.

Implementation. It's not quite clear how nicely will MLFlow play with Megatron/DeepSpeed+the limited networking on Jean Zay. The goal here is to first build a proof-of-concept showing MLFlow can integrate in our codebase and phone back to the centralised server from Jean Zay. We can then consider the finer details of reporting all metrics of interest to us.

Add Megatron support for the EleutherAI Evaluation Harness

Add the ability to run the EleutherAI Evaluation Harness on Megatron checkpoints. Right now we are relying on converting Megatron checkpoints to Hugginface checkpoints which is an error-prone process. We also have to use Megatron anyways to run the 200B model.

Implementation details:
You will use this HF gpt2 model implementation here as your reference. Here are more details:

  • Edit _init_, create_from_arg_string to load the Megatron checkpoints
  • Edit the _model_call function to call the Megatron model and read logits back
  • The functions loglikelihood, loglikelihoods, _loglikelihood_tokens might (or might not) require a little tweaking
  • Leave the function greedy_until unimplemented (raise an exception), we don't need it for now.
  • Check this test that shows how to load and call a Megatron checkpoint.
  • Here's one Megatron checkpoint that you can work with.
  • A relatively close implementation is already in the GPT-NeoX repo here and it might be helpful to check as well.

Add checks to confirm that the checkpoint conversion script works perfectly correct

We now have a script that convert megatron-deepspeed checkpoints to HF-transformers checkpoints.
Project is here and the script is here.
However, the script doesn't have unit tests that confirm that the conversion is correct.

The goal of this issue is to add such tests. The idea is to run the forward pass of both models (before and after conversion) with any random input, then use torch.allclose to assert that the output loss and the logits of both models perfectly match.

Consistency in code convention

As discussed in #128, maintain code style consistency by setting up some basic formatters.

  • Makefile
  • pyproject.toml
  • setup.cfg

Benchmark HF transformers as point of reference.

Calling IndexedDatasetBuilder directly with a best_fit datatype fails

I don't know whether this is intended to work or not, but I found the following program:

from megatron.data.indexed_dataset import IndexedDatasetBuilder, best_fitting_dtype

best_dtype = best_fitting_dtype(10_000)
IndexedDatasetBuilder("testfile", dtype=best_dtype)

leads to an error like:

  File "/path/to/Megatron-DeepSpeed.git/megatron/data/indexed_dataset.py", line 284, in __init__
    self.element_size = self.element_sizes[self.dtype]
KeyError: <class 'numpy.uint16'>

This shows up because best_fitting_dtype will return numpy.uint16 for small vocabs:

def best_fitting_dtype(vocab_size=None):
if vocab_size is not None and vocab_size < 65500:
return np.uint16

but that particular type is missing from the element_sizes table.

element_sizes = {
np.uint8: 1,
np.int8: 1,
np.int16: 2,
np.int32: 4,
np.int64: 8,
np.float: 4,
np.double: 8
}

which is used in the IndexedDatasetBuilder constructor here:

self.element_size = self.element_sizes[self.dtype]

Should something like this work?

If so, it seems like either the best_fitting_dtype function should return a different type or uint16 should be added to the element_sizes table like f706108.

Implement prefix-lm as in the T5 paper

AFAIU, the current implementation uses the first 50% as the prefix doesn't support prefix-lm. The T5 paper samples the prefix length randomly from 0 to max_sequence_lengh (which is 2048 in our case)

Need model size dumped at init

We need to have a diagnostic model size dumped during the framework init. We currently get a report per rank and not the total.

 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (2, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 1986465792
 > number of parameters on (tensor, pipeline) model parallel rank (3, 7): 1986498560

Later on ZeRO engine does dump the right thing amongst multiple other numbers and repeated on each rank

[2021-10-02 16:08:53,028] [INFO] [engine.py:134:__init__] RANK=0 STAGE=0 LAYERS=7 [0, 7) STAGE_PARAMS=1986465792 (1986.466M) TOTAL_PARAMS=57778896896 (57778.897M) UNIQUE_PARAMS=56814206976 (56814.207M)

But ideally we just want a print like:

Model size: 57B (57778896896 params)

Just on rank 0.

Thanks.

Plot scaling laws of our baseline models

For our three baselines on different datasets (OSCAR, C4, The Pile), we would like to plot scaling laws and retrieve their coefficients. Specifically, we are looking to reproduce Figure 1 of Scaling Laws for Neural Language Models.

The TensorBoard data for the baseline runs can be retrieved on the Big Science space on HuggingFace: it's the tr3 runs with tensorboard in their name. The naming scheme (tr3b, tr3c, etc.) is explained here.
For C4, we have a XL, L, and M model (tr3, tr3c, tr3c) with short warm-up. For OSCAR and The Pile, we have an XL, L, M, and S model (tr3d, tr3g, tr3h, tr3i and tr3, tr3j, tr3k, tr3l). For OSCAR, we can should also add the 13B run to see if the fits hold (that's tr1-13B).

`PretrainedFromHF` crashes on 1 gpu

When using PretrainedFromHF it seems to require at least 2 gpus and specifically TP=2, otherwise with 1 gpu it crashes

To reproduce:

CHECKPOINT_PATH=checkpoints/gpt2
VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt
DATA_PATH=data/meg-gpt2_oscar-combined_text_document
TENSORBOARD_PATH=output_dir/tensorboard

TP_SIZE=1
PP_SIZE=1
N_GPUS=1

NLAYERS=24
NHIDDEN=1024
NHEADS=16
FFN_HIDDEN_SIZE=4096
SEQ_LEN=2048


GPT_ARGS=" \
    --exit-interval 10 \
    --num-layers $NLAYERS \
    --hidden-size $NHIDDEN \
    --num-attention-heads $NHEADS \
    --seq-length $SEQ_LEN \
    --max-position-embeddings $SEQ_LEN  \
    --micro-batch-size 2 \
    --rampup-batch-size 4 4 1_000 \
    --global-batch-size 16 \
    --train-samples 100 \
    --optimizer adam \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-8 \
    --lr 1e-4 \
    --lr-warmup-samples 5 \
    --clip-grad 1.0 \
    --weight-decay 1e-1 \
    --fp16 \
    --tensor-model-parallel-size $TP_SIZE \
    --pipeline-model-parallel-size $PP_SIZE \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path t5-small \
    "

OUTPUT_ARGS=" \
    --log-interval 10 \
    --save-interval $SAVE_INTERVAL \
    --eval-interval 100 \
    --eval-iters 10 \
    --checkpoint-activations \
    "

DATA_ARGS=" \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH \
    --data-path $DATA_PATH \
    --tensorboard-dir $TENSORBOARD_PATH \
    --tensorboard-queue-size 5 \
    --log-timers-to-tensorboard \
    --log-batch-size-to-tensorboard \
    --log-validation-ppl-to-tensorboard \
    "

ALL_ARGS="$GPT_ARGS $OUTPUT_ARGS $DATA_ARGS"

# if you can't stand pt-1.9 launcher noise
export LOGLEVEL=WARNING

LAUNCHER="deepspeed --num_gpus $N_GPUS"

CMD="$LAUNCHER pretrain_gpt.py $ALL_ARGS"

echo $CMD

#rm -rf $CHECKPOINT_PATH
$CMD

It crashes w/ or w/o deepspeed enabled, the above script is w/o deepspeed.

The log:

[before the start of training step] datetime: 2021-09-08 17:38:58 
[2021-09-08 17:38:58,791] [INFO] [checkpointing.py:408:forward] Activation Checkpointing Information
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:409:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:412:forward] ----contiguous Memory Checkpointing False with 2 total layers
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:415:forward] ----Synchronization False
[2021-09-08 17:38:58,792] [INFO] [checkpointing.py:416:forward] ----Profiling time in checkpointing False
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [159,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

[...] some 10000 of these [...]

/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [339,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [339,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "pretrain_gpt.py", line 222, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 147, in pretrain
    iteration = train(forward_step_func,
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 686, in train
    train_step(forward_step_func,
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/training.py", line 387, in train_step
    loss = model[0].train_batch(data_iter=data_iterator)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 291, in train_batch
    self._exec_schedule(sched)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 1237, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 587, in _exec_forward_pass
    outputs = super().forward(inputs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/engine.py", line 1164, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 347, in forward
    x = self.activation_checkpoint_func(
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 690, in checkpoint
    CheckpointFunction.apply(function, all_outputs, *args)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 494, in forward
    outputs = run_function(*inputs_cuda)
  File "/mnt/nvme1/code/github/00optimize/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 325, in exec_func
    inputs = layer(inputs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 585, in forward
    return super().forward(hidden_states, attention_mask, **kwargs)
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 477, in forward
    self.self_attention(layernorm_output,
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/model/transformer.py", line 215, in forward
    mixed_x_layer, _ = self.query_key_value(hidden_states)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/Megatron-DeepSpeed-master/megatron/mpu/layers.py", line 289, in forward
    output_parallel = F.linear(input_parallel, self.weight, bias)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1623448278899/work/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f99385a7a22 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10ac3 (0x7f9938809ac3 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f993880b167 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f99385915a4 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa249ba (0x7f99b1e819ba in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa24a51 (0x7f99b1e81a51 in /home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x1932c6 (0x55f7dd66d2c6 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #7: <unknown function> + 0x1592ac (0x55f7dd6332ac in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #8: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #9: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #10: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #11: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #12: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #13: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #14: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #15: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #16: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #17: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #18: <unknown function> + 0x1593ab (0x55f7dd6333ab in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #19: <unknown function> + 0x158e77 (0x55f7dd632e77 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #20: <unknown function> + 0x1590dc (0x55f7dd6330dc in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #21: <unknown function> + 0x1e3643 (0x55f7dd6bd643 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #22: <unknown function> + 0x159109 (0x55f7dd633109 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #23: <unknown function> + 0x1e3643 (0x55f7dd6bd643 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #24: <unknown function> + 0x176057 (0x55f7dd650057 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #25: _PyModule_ClearDict + 0x103 (0x55f7dd687d53 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #26: PyImport_Cleanup + 0x578 (0x55f7dd6aff88 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #27: Py_FinalizeEx + 0x79 (0x55f7dd6e1a49 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #28: Py_RunMain + 0x183 (0x55f7dd6e3893 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #29: Py_BytesMain + 0x39 (0x55f7dd6e3ca9 in /home/stas/anaconda3/envs/py38-pt19/bin/python)
frame #30: __libc_start_main + 0xf3 (0x7f99e84330b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #31: <unknown function> + 0x1e21c7 (0x55f7dd6bc1c7 in /home/stas/anaconda3/envs/py38-pt19/bin/python)

It works if I use 2 gpus and TP=2.

Originally discovered by @tjruwase and I was able to reproduce it too.

We were trying to use the 350m checkpoint you gave us to test the checkpoint conversion tool - so we were unable to use it since the final checkpoint needs to be a single file - 1 gpu.

I also see PretrainedFromHF is not being tested. So probably a test is needed as well.

Thank you.

@TevenLeScao

How to load a tensor+pipeline parallel checkpoint for inference tasks?

Following the training script here as a template:

https://github.com/bigscience-workshop/bigscience/blob/master/train/tr1-13B-base/tr1-13B-round1.slurm

I've trained some models using 2-way tensor parallelism and 4-way pipeline parallelism, which produces a number of checkpoints in directories like "global_step26000".

I'm now trying to use one of those trained checkpoints to do inference. In particular, I'm trying to work with modified versions of these scripts where I can provide new strings that are tokenized and processed on the fly:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/examples/generate_text.sh
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/generate_samples_gpt.py

I've tried a number of different approaches, but no luck so far. I also can't seem to find any instructions written up on that.

Anyone know the steps to load one of those checkpoints for inference on new data?

update converted models to include tokenizer files

Update released model files to include

  1. correct tokenizer files (t5-small or gpt2):
  2. fill out the config.tokenizer_class

HUB:

GCS:

  • gs://bigscience-backups/tr1-13B/

This will be automated for the future by #126 - but for now do it manually since it's just 2 sets of files.

group the tensorboard logs

It has been proposed to group the tensorboard logs, by adding a group prefix to the log-key, e.g.:

tb.add_scalar("group a/batch size", batch_size, iteration)

Currently proposed groups:

Batch-size
- Batch-size
- Batch-size vs samples
Grad-norm
- Grad norm
- Grad norm vs samples
Learning rate
- Learning rate
- Learning rate vs samples
Lm loss train
- Lm loss
- Lm loss vs samples
Lm loss validation
- Lm loss
- Lm loss vs samples
- Lm loss ppl
- Lm loss ppl vs samples
Loss scale
- Loss scale
- Loss scale vs samples
Num zeros
- Num zeros
- Num zeros vs samples

The question is whether to hardcode these, or have a configurable map, should different trainings want different groups?

@VictorSanh, @slippylolo, @ibeltagy

Distributed terashuf

I don't know how much time I will have to polish this up, but I have a prototype MPI-enabled "terashuf". This uses an algorithm similar to terashuf, where contiguous records from the source file are shuffled in segments, and then one randomly picks leading records from the shuffled segments to put together the final shuffled file.

This prototype currently stores the shuffled segments in memory (rather than files), and so it requires one to be able to load the full file into distributed memory. Currently each rank reads a portion of the source file into memory, shuffles that section, and then ranks exchange lines with each other in order to write out the file in contiguous chunks.

It can shuffle the oscar.jsonl file in about 10 minutes using 80 procs on 8 nodes on my system.

2021-09-01T12:19:15: 0: Wrote 1319979521428 of 1320971843503 bytes (99.92%) in 343 secs, 3668.675 MB/s
2021-09-01T12:19:20: 0: Waiting for ranks to finish ...
2021-09-01T12:19:20: Seconds to write file: 348.45524168014526

real	6m25.041s

Just posting this notice in case others need to shuffle a large JSON file in a hurry.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/tools/distshuf.py

It currently requires mpi4py and an mpi4py enabled DistData class.

https://github.com/adammoody/Megatron-DeepSpeed/blob/distshuf/megatron/data/distdata_mpi.py

I first attempted a torch.distributed version, but hit some problems. I haven't yet gone back to see if a torch.dist equivalent is easy.

For speed and correctness, both the input and output files must be on a parallel file system like Lustre/GPFS.

Example command:

srun -n 80 -N 8 python3 tools/distshuf.py \
       --input /gpfs/path/to/oscar.jsonl \
       --output /gpfs/path/to/oscarshuf.jsonl \
       --seed 101

recovering from loss spikes strategies

After having a 3->8->3 spike in the loss value a few days ago, which luckily recovered after a few hours of training, we want to discuss possible ready to use strategies that we can quickly deploy should the spike not recover.

Notes from the slack so far:

Iz Beltagy:

In case the model gets stuck in one of the spikes and doesn’t come back, we can restart it from an earlier checkpoint but shuffle the data, reset the optimizer state, switch to fp32, lower lr, change optimizer params …

Stas:

Do you think we should be prepared and have a few of these options documented from the best choice to least, or deal with it if and when it happens?
e.g. I don't think we can shuffle the data, other than perhaps changing the seed?

Ryan Teehan:

I think that would be a good idea, both as a way to inform people about decisions but also for developing justifications and reasons for best practices

Iz Beltagy:

it would be great if we have these implemented and ready to be used. As for knowing which choices are more effective, that would be something we figure out empirically, and it would be one of the contributions of the project

@ibeltagy

[prefixLM] Investigate cuda kernels

Up until recently we've been using pytorch code in order to apply "scale -> mask -> softmax" in attention mechanism for prefix LM.

I've recently discovered that there exists two cuda kernels to attention matrices:

  • ScaledUpperTriangMaskedSoftmax which is as the naming suggest specific to GPT style models
  • ScaledMaskedSoftmax

The later one hasn't really been tested, and since I didn't have time to deepdive in the cuda code, I decided not to use it for initial prefix lm. However #151 has removed the mechanism to force prefix lm to use the pytorch route.

In this issue, we want to:

  • make sure that ScaledMaskedSoftmax support prefix attention mask if fed with one.
  • if it doesn't, implement a prefix lm cuda kernel.

@RezaYazdaniAminabadi , If you could provide us with your expertise on that one, maybe just checking the first item that'd be great. Thank you!

[feature request] Implement sample-ids-to-text extractor

If you have been following the training we got a few spikes in lm loss and it'd be great to look at the input data on when this happened. We suspect that input data was either garbled or in a very different language.


Spec

So we need to write a script that can receive a range of sample ids and generate the output in format:

sample id | normal text 

Possible invocation syntax:

tools/sample-ids-to-text.py --seed 111 --sample-id-range 10728784-10769744 --whatever other args are needed, path to the indexed dataset, etc.

@mohammad Shoeybi shared the key pointers to the code to build upon:

With respect to getting input text from iteration, we report the number of samples consumed, You can  instantiate dataset (for example here) and get the sample number directly.


You can find the index files that were used for 13B and are now reused for 104B training here: https://huggingface.co/bigscience/tr1-13B-data/tree/main/indices but it'll probably much easier to create your own tiny dataset, preprocess it, start a tiny training so that it'll create the .npy files, and working with that so that all data is there under you control and it's tiny, rather than working with 1TB files.

Set up a basic MLflow setup

Replicate all the tensorboard logging in Meg-DS, plus logging hyperparams of choice.
So on the code level:

  1. repeat tensorboard 1:1 but log using mlflow api
  2. find new places where to log new things (e.g. hyperparams)
  3. WGs that want to log specific events/data will add those directly to Meg-DS code base
  4. Currently the config is just --mlflow-dir on/off toggle which will log all MLFlow events/data

example:
https://gist.github.com/tsaoyu/14e39a6d246cb29b107a2cc62a12f7a3

Blocking events:

Issue to gather Fixes + New features to send upstream

Please edit the OP to add whatever fixes we applied to the core and which need to be propagated upstream into:

  1. https://github.com/microsoft/Megatron-DeepSpeed
  2. https://github.com/NVIDIA/Megatron-LM

we want to do that to make it easier to sync upstream changes back to this repo.

Changes to send upstream:

Bug fixes:

New functionality:

  • 18201ce , e8fcbae, c680954 add tools/merge_preprocessed_data.py to support merging datasets - might be easier to just copy the new script.
  • 7b99881 - new faster preprocessing script for when one has many cpu cores.
  • 5069622 - new preprocessing script that uses HuggingFace Datasets as source
  • Cirriculum learning: #132 + #133

eliminate `megatron/model/__init__.py`

megatron/model/__init__.py seem to lead to circular reports quite often, when importing from other megatron libraries, so it's probably the best idea to remove it, rather than continually working around it.

This will require adapting all these to import from the corresponding module each of the imported symbols reside in:

./pretrain_gpt.py:from megatron.model import GPTModel, GPTModelPipe
./megatron/training.py:from megatron.model import Float16Module
./megatron/training.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/optimizer/__init__.py:from megatron.model import LayerNorm
./megatron/schedules.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/schedules.py:from megatron.model import Float16Module
./megatron/text_generation_utils.py:from megatron.model import DistributedDataParallel as LocalDDP
./megatron/text_generation_utils.py:from megatron.model import Float16Module
./megatron/model/realm_model.py:from megatron.model import BertModel
./megatron/model/transformer.py:from megatron.model import LayerNorm
./megatron/model/bert_model.py:from megatron.model import LayerNorm
./megatron/model/gpt_model.py:from megatron.model import LayerNorm
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import GPTModel
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import DistributedDataParallel as LocalDDP
./tasks/zeroshot_gpt/evaluate.py:from megatron.model import Float16Module
./checkpoint-analysis.ipynb:    "from megatron.model import GPTModel\n",
./pretrain_bert.py:from megatron.model import BertModel
./pretrain_t5.py:from megatron.model import T5Model
./tools/generate_samples_gpt.py:from megatron.model import GPTModel

This is a very basic python task and requires no knowledge of Megatron or Deepspeed.

Steps:

  1. git rm megatron/model/__init__.py
  2. adapt all the calls listed above
  3. run make test to ensure things still work.

Thank you.

Try the multi-dataset setup with sampling probabilities

Multilingual models are trained on multiple datasets (one dataset per language) with a sampling probability for each dataset.

This feature (training on instances randomly sampled from multiple dataset files) seems to be supported in our Megatron code base (check here https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/run.sh#L9-L12, thanks to @sbmaruf for finding it), but we want to give it a try and make sure it is working with multiple gpus on multiple nodes. For testing, we can do the following:

  • create a few small datasets. Don't preprocess a huge dataset because it takes forever. All datasets can be English if it is easier but make sure each dataset is recognizable from its data (for logging)
  • start training a multi-node model
  • add logging to verify that the sampling probabilities are respected

Once we verify this feature is working, we will swap the small testing datasets with the preprocessed multilingual datasets for the real training.

Here are the instructions for how to get started with the code-base

- install things: https://github.com/bigscience-workshop/Megatron-DeepSpeed#setup
- get and preprocess data to work with: https://github.com/bigscience-workshop/Megatron-DeepSpeed#quick-pre-processing-to-start-training-with
- train: https://github.com/bigscience-workshop/Megatron-DeepSpeed#gpt-pretraining

auto-add tokenizer files to the converted model

our conversion to HF script:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tools/convert_checkpoint/deepspeed_to_transformers.py
doesn't do anything about the tokenizer - until now we added the right tokenizer files manually, but this doesn't work at scale.

So this ticket is to extend the above script at the point of saving the checkpoint to lookup which tokenizer is used based on these 2 args:

    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path t5-small \

for GPT2BPETokenizer it's only:

    --tokenizer-type  GPT2BPETokenizer \

and then fetch the right tokenizer files but loading and saving it into the folder of the checkpoint files.

So the required logic:

  1. derive the HF model name
if args.tokenizer_type == "GPT2BPETokenizer": 
    mname = "gpt2"
elif args.tokenizertype == "PretrainedFromHF":
    mname = args.tokenizer_name_or_path
else:
   raise ValueError(f"don't know how to handle args.tokenizer_type={args.tokenizer-type}")
  1. fetch and save the right tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(mname)
tokenizer.save_pretrained(output_state_dict)"

mname in our case so far is either t5-small or gpt2

done.

I think this is all that's needed but I haven't tested any of this.

unit test of fused softmax kernel

NVIDIA/Megatron-LM#132

I've been looking at the megatron code nowadays. When using the softmax kernel used in megatron, it was observed that the results were different from the original torch softmax. How about doing a unit test on this kernel?

Double counts in parameter count

Currently, parameter counts in utils.get_parameters_in_billions are inaccurate when PP > 1. Tied variables, in particular embedding layers, exist in several copies in the first and last PP stage, which causes double counts. For now the codebase only uses the count without embedding layers which is accurate, but it would be good for the count with embedding layers to also function, mostly for operation-counting.

For background see: #40

Setup a relatively simple evaluation benchmark

We need an evaluation benchmark that we can use for iterating on the model design. It needs to satisfy the following requirements:

  • Two versions: English-only and multilingual
  • Zero-shot, few-shot, and many-shots: it should include at least the first two (the third is debatable)
  • Works on Megatron checkpoints and hugginface/transformers checkpoints. The modeling group will make a Megatron checkpoint available soon. In the meantime, you can start using hugginface/transformers checkpoints, for example, this one https://huggingface.co/gpt2
  • Easy to run: ideally a single script that reads the checkpoint and outputs the evaluation results
  • Doesn't take too long to run: anything longer than 1 day on 1 gpu is too much
  • Representative: it covers the key tasks (LM, QA, classification, generation .. anything else?)

Update

  • Assume the model is autoregressive and it uses causal language modeling
  • Selected datasets need to make it possible to compare with existing pretrained models (e.g. GPT3, GPT-J, mT5 ... )

[ci] failing with 4 gpus, works with 2

Need to figure out why CI fails with 4 gpus, but works fine with 2.

I set it for now to 2 gpus, until we sort this out. 391930b

FAILED tests/test_model.py::MyTestCase::test_gpt - Failed: Timeout >300.0s
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_0_base - ...
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_1_cl - Ru...
FAILED tests/test_training.py::MegDSTestTraining::test_training_prefix_lm_all

the first one hangs on startup, the others fail with:

stderr: Traceback (most recent call last):
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_gpt.py", line 246, in <module>
stderr:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 165, in pretrain
stderr: Traceback (most recent call last):
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_gpt.py", line 246, in <module>
stderr:     iteration = train(forward_step_func,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 734, in train
stderr:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr:       File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 165, in pretrain
stderr: train_step(forward_step_func,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 405, in train_step
stderr:     iteration = train(forward_step_func,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 734, in train
stderr:     loss = model[0].train_batch(data_iter=data_iterator)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 329, in train_batch
stderr:     self._exec_schedule(sched)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1313, in _exec_schedule
stderr:     train_step(forward_step_func,
stderr:   File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 405, in train_step
stderr:     loss = model[0].train_batch(data_iter=data_iterator)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 329, in train_batch
stderr:         self._exec_instr(**cmd.kwargs)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 723, in _exec_backward_pass
stderr: self._exec_schedule(sched)
stderr:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1313, in _exec_schedule
stderr:     local_part=self.grad_layer[1],
stderr: IndexError: list index out of range

Add tests & setup CI

As an extension of discussions from #47, start adding basic unit tests to ensure that new additions and implementations (e.g. GLU variants, prefix-LM masking schemes, etc) work as intended. Possibly setup CI pipelines to ensure that tests pass on each push to main. This is likely mid to low priority.

[testing] data size / dynamic downloads - test speed and repo bloat

Let's discuss which data is used in the test suite. And after the discussion turn into guidelines for test writers.

Here is a very rough start:

Currently the main overhead in testing is the really slow startup of Meg-DS

  • And we also don't want the repo to become slow to download because of large data files. My first attempt is to generate a synthetic input on the fly (incomplete). But perhaps have a small compressed file in the repo should not be a problem and having it uncompressed when the test suite starts.

Hopefully soon we will have a CI, so downloading large data files can slow things down.

The test suite ensures correctness, not speed, so having 1k records is more than enough, even for parallel processing.

That's said we can have extended slow tests that don't run by default and which can download large data and then these can be used as well.

Comments, suggestions and ideas are super welcome.

Limit cache for GPT2Tokenizer

After some OOM errors when preprocessing large files, I noticed that the caching mechanim can have unlimited memory due to its ability to grow forever. This causes issues, in particular in multiprocessing cases when this cache is multiplied by the number of workers. This prevents to work with large datasets.

Note: I've tried switching to HF tokenizer, and the memory constraint becomes much lower since it doesn't cache.

In order to reduce the memory constraint, we can either:

  • implement a limited memory cache (LRU or something like that)
  • shared cache between workers
  • remove cache mechanism
  • ignore the problem since we'll use another implementation of tokenizer

cc @ontocord

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.