piotrnawrot / nanot5 Goto Github PK

View Code? Open in Web Editor NEW

944.0 944.0 68.0 1.23 MB

Fast & Simple repository for pre-training and fine-tuning T5-style models

License: Apache License 2.0

Python 100.00%

nanot5's People

Contributors

Stargazers

Watchers

Forkers

shafiahmed dumpmemory kaushr canslove ltttdh techthiyanes stjordanis luca-medeiros kod-kristoff klei22 new5558 shaun95 isevendays martinkuo427 tanglespace olbychos ugermann qizhipei anasolva spencer-hong maveriq danielroeder1 sciumo gary-wf kimiko-ai ai4protein sciumotech toandreyhse tonywhite11 galunid josegron jesusoctavioas brendanedwardgavin mbilalai taylorai collawolley liuhong99 javedqadruddin tokenbender sanketvmehta paperwave rishav-hub a0308 sungkim11 eltociear dlxj ani0075saha predatorq qinengwang-aiden spico197 boreys fortytw0 owos birch-san danielsociu mivg lazarusnlp ryu1845 maxmax2016 vgaraujov lestoe11 brick-pid phucdoitoan schneiderkamplab alpoge

nanot5's Issues

Pre-trained nanoT5 model on C4 corpus

Hi Piotr,

Thank you for your great work on the nanoT5 repository.
I'm going to try to pre-train the T5-base variant model on C4 corpus by using V100 gpu and then I'll evaluate the model on GLUE tasks. I'm referring to your repo nanoT5 and it is very very helpful to me. Thanks a lot, again.
Before I pre-train the model myself, to save the time, I ask to you if there is any pre-trained model you can share for.

Best regard,
SungHo

Error enountered during multi-GPU training with torch compile enabled

Has anyone encountered the below error?

RuntimeError:

torch._dynamo.optimize is called on a non function object.
If this is a callable class, please wrap the relevant code into a function and optimize the
wrapper function.

class CallableClass:
def init(self):
super().init()
self.relu = torch.nn.ReLU()
def __call__(self, x):
    return self.relu(torch.sin(x))

def print_hello(self):
    print("Hello world")
mod = CallableClass()

If you want to optimize the call function and other code, wrap that up in a function

def wrapper_fn(x):
y = mod(x)
return y.sum()

and then optimize the wrapper_fn

opt_wrapper_fn = torch._dynamo.optimize(wrapper_fn)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 60801) of binary: /root/miniconda3/envs/nanoT5/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/nanoT5/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

The command I use is CUDA_VISIBLE_DEIVCES=0,1 accelerate launch --multi_gpu --mixed_precision=bf16 --num_processes=2 -m nanoT5.main model.compile=true.
My environment works fine with single GPU training, a.k.a the following command:
python -m nanoT5.main model.compile=true

Citing Repo

Thanks for sharing your code and optimizations for training T5. What's the best way to cite this repo?

Continued pretraining from official models.

I would like to start off by saying that this is absolutely great work. I do have a minor question though; would it be possible to continue pretraining from original weights in HF? Or would we have to manually retrain C4 in conjunction with our personal dataset.

Learning rate for multi-GPUs training

Hi, if you keep the batch_size = 128 and try to train with multiple GPUs, e.g. 8 GPUs, the effective batch_size is 128*8 = 1024, do you have any idea how to set the learning rate in that case?

I have tried to change the learning rate, for example scale it linearly w.r.t num GPU => (lr = 1 GPU's lr * 8) etc, but only to receive a trained model with worse negative log-likelihood.

So far, I only consider optim = adafactor and scheduler = legacy though.

Pre-train on different Dataset than C4

Is there any option, to pre-train T5-based models on different dataset than C4 in a self-supervised manner?

pre-training on local C4 dataset?

Sorry to bother you, due to network problems, I have downloaded the C4 dataset locally, how can I use the local dataset for pre-training? Please help me with this

pre-train on long context.

I need to fine-tune on long context (16K). Is nanoT5 easily adaptable to longT5 model? if it is, I could possibly pre-train on c4+my custom data with long context and then fine tune on my custom targets.

How to run on CPU

Do you any code example how to run it on CPUs

Silly question: Why do you need to re-implement T5 model?

Hi, thank you a lot for this helpful github.

Can I ask why do you need to re-implement T5 model instead of using the one from huggingface and pretraining the huggingface model with mixed precision directly?

have you try any other benchmark other than SNI?

Computing Rouge score during training

Hi Piotr,

Thank you for your amazing work on the nanoT5 repository.

I am a beginner in NLP so I have been trying to learn how to run T5 thanks to your repo. I have a question regarding the computation of the rouge score. I am trying to compute it during the training of my T5 to monitor the performance of my model.

When I set up return_attention_mask=True in your code (https://github.com/PiotrNawrot/nanoT5/blob/main/nanoT5/utils/copied_utils.py#L361), I have a mismatch of sizes in "position_bias = position_bias + mask" with position_bias being of size 512 (the input length) and the mask of size 568 (which is the before_mask_input_length )

Question: Do you have an advice to be able to compute the rouge score every X steps during training using your repo?

Thanks a lot for your help !
Samy

Why isn't the lr warm up from 0?

Dear authors:

nanoT5/nanoT5/utils/model_utils.py

Line 259 in 5a66107

start_factor=0.5,

It appears that you have configured the learning rate to warm up halfway through the process. Are there any specific reasons for this decision?

Flash attention

Firstly, thank you so much for this repo! I'm a huge fan of T5, and these results are extremely impressive.

I saw that you experimented with different positional embeddings like ALiBi in order to facilitate FA down the line. Was that attempt due to the fact that FA doesn't support bias? If so, there is a PR to add it that is making progress:

Dao-AILab/flash-attention#617

It would be fun to see this repo get even faster.

query regrading muti-gpu

I have tested the code for ('wikitext','wikitext-2-v1') dataset and my local dataset, the pre training runs on a single machine for the default config. Once I switch to a multigpu set up, I noticed that only one of the gpu is getting loaded, rest are idle. Does the code also support multigpu training out of the box.
cc @PiotrNawrot

About Pre-training objectives

Hi, Thanks for giving us this implementation. I really appreciate it.
I'm a bit new to training Enc-Dec models. so I was wondering if you could answer this one question.

If my understanding is correct, the regular pre-training objective of the T5 is very similar to MLM, as in you mask some tokens and have the model learn to predict them. so I want to know if, say, instead of masking tokens, I corrupt my whole dataset (20% of each row) by replacing the tokens with other tokens (not using any fancy generator-discriminator, just corrupting the data during the pre-processing step) and treat it like a grammar / typo correction task where the labels are the original, clean text itself; could be a viable objective?

input: "the katt jamped over the fense"
label: "the cat jumped over the fence"

may I ask you to tell me what you think on this?

RMS scaling issues

Could you shed some light on how the RMS scaling code used in AdamWScale is supposed to work? Seems like it has a bug, but maybe there's some magic I'm missing there...

# /Adapt Step from Adafactor
    step_size = step_size * max(1e-3, self._rms(p.data))
# /Adapt Step from Adafactor

This seems extremely different than what Adafactor does, which first calculates the expected parameter update, and then scales it like update = step_size * max(1, rms(update) / clipping)

The nanoT5 code appears to scale each parameter's update by _rms(its own parameter value), rather than _rms(the parameter's desired update) And then uses a magic constant which is maybe meant to somehow refer back to the default learning rate?

Shape mismatch warning

I trained T5 Base from scratch with different dataset (Wikipedia). Use the checkpoint #60,000 to fine tune for other downstream task (a Seq2Seq task). However, when I load the model from local folder with command : model = T5ForConditionalGeneration.from_pretrained(model_name_or_path, local_files_only=True ), I got the warning message below. Questions: is that normal? Does this affect the performance of final finetuned model?
WARNING:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at /home/.../checkpoint-pt-60001/ were not used when initializing T5ForConditionalGeneration: ['encoder.block.8.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.1.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.5.layer.1.DenseReluDense.wi_0.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.11.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.1.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.7.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.7.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.5.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.4.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.8.layer.1.DenseReluDense.wi_0.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.2.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.3.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.11.layer.1.DenseReluDense.wi_0.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.6.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.0.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.6.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.3.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.9.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.0.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.10.layer.1.DenseReluDense.wi_0.weight', 'decoder.block.1.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.2.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.9.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.10.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.4.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.1.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_1.weight']

This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at /home/.../checkpoint-pt-60001/ and are newly initialized: ['encoder.block.8.layer.1.DenseReluDense.wi.weight', 'decoder.block.11.layer.2.DenseReluDense.wi.weight', 'decoder.block.4.layer.2.DenseReluDense.wi.weight', 'encoder.block.2.layer.1.DenseReluDense.wi.weight', 'encoder.block.10.layer.1.DenseReluDense.wi.weight', 'decoder.block.1.layer.2.DenseReluDense.wi.weight', 'encoder.block.11.layer.1.DenseReluDense.wi.weight', 'decoder.block.3.layer.2.DenseReluDense.wi.weight', 'encoder.block.0.layer.1.DenseReluDense.wi.weight', 'encoder.block.3.layer.1.DenseReluDense.wi.weight', 'decoder.block.6.layer.2.DenseReluDense.wi.weight', 'encoder.block.6.layer.1.DenseReluDense.wi.weight', 'decoder.block.10.layer.2.DenseReluDense.wi.weight', 'decoder.block.9.layer.2.DenseReluDense.wi.weight', 'encoder.block.5.layer.1.DenseReluDense.wi.weight', 'decoder.block.8.layer.2.DenseReluDense.wi.weight', 'encoder.block.7.layer.1.DenseReluDense.wi.weight', 'decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.2.layer.2.DenseReluDense.wi.weight', 'encoder.block.9.layer.1.DenseReluDense.wi.weight', 'decoder.block.7.layer.2.DenseReluDense.wi.weight', 'encoder.block.1.layer.1.DenseReluDense.wi.weight', 'decoder.block.5.layer.2.DenseReluDense.wi.weight', 'encoder.block.4.layer.1.DenseReluDense.wi.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at /home/.../checkpoint-pt-60001/ and are newly initialized because the shapes did not match:

shared.weight: found shape torch.Size([32128, 768]) in the checkpoint and torch.Size([32102, 768]) in the model instantiated
encoder.embed_tokens.weight: found shape torch.Size([32128, 768]) in the checkpoint and torch.Size([32102, 768]) in the model instantiated

Transformation to HF model

Where can i find the script for transforming the checkpoint to huggingface format ?

self-defined loss function failed to work (torch._dynamo.exc.InternalTorchDynamoError: ln_encoder)

I try to add my own loss function using the encoder's hidden states, and I add a new linear layer similar to your layer self.lm_head to obtain the corresponding logits. However, the training process fails every time and it seems like I did not use the linear layer correctly, but I do not know why...
Here is my modified part of MyT5 module:

class MyT5(nn.Module):
    def __init__(self, config: T5Config):
        ...
        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
        self.ln_encode = nn.Linear(config.d_model, 3, bias=False)
    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        decoder_input_ids: Optional[torch.LongTensor] = None,
        decoder_attention_mask: Optional[torch.BoolTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        encoder_outputs = None,
        myencode_attention_mask: Optional[torch.BoolTensor] = None,
        seq_order: Optional[torch.LongTensor] = None,
    ) -> Seq2SeqLMOutput:
        if encoder_outputs is None:
            encoder_outputs = self.encoder(
                input_ids=input_ids,
                attention_mask=attention_mask,
            )

        hidden_states = encoder_outputs.hidden_states
        myencode_attention_mask_extended = myencode_attention_mask.unsqueeze(-1).expand_as(encoder_outputs[0])
        myencode_hidden_states = encoder_outputs[0][myencode_attention_mask_extended].view(-1, self.model_dim)
        myencode_loss_ft = CrossEntropyLoss(ignore_index=-100)
        ln_encode_logits = self.ln_encoder(myencode_hidden_states)
        seq_loss = seq_loss_ft(ln_encode_logits.view(-1, ln_encode_logits.size(-1)), seq_order[seq_order != -100].view(-1))
        ...
        loss += seq_loss
        return ...
    def _init_weights(self, module):
        factor = self.config.initializer_factor  # Used for testing weights initialization
        ...
        elif isinstance(module, (MyT5)):
            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)
            if hasattr(module, "lm_head") and not self.config.tie_word_embeddings:
                module.lm_head.weight.data.normal_(mean=0.0, std=factor * 1.0)
                print("lm initialized")
            if hasattr(module, "ln_encode"):
                module.ln_encode.weight.data.normal_(mean=0.0, std=factor * 1.0)
            ...

And here is the error message:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/data2/usr/projects/nanoT5/nanoT5/main.py", line 85, in main
    train(model, train_dataloader, test_dataloader, accelerator,
  File "/data2/usr/projects/nanoT5/nanoT5/utils/train_utils.py", line 237, in train
    loss, stats = forward(model, batch)
  File "/data2/usr/projects/nanoT5/nanoT5/utils/train_utils.py", line 90, in forward
    outputs = model(**batch)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 333, in _fn
    return fn(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1521, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1357, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/data2/usr/projects/nanoT5/nanoT5/utils/t5_model.py", line 477, in forward
    myencode_hidden_states = encoder_outputs[0][myencode_attention_mask_extended].view(-1, self.model_dim)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
    return hijacked_callback(frame, cache_size, hooks, frame_state)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 637, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
    return fn(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 371, in _convert_frame_assert
    return _compile(
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 584, in _compile
    raise InternalTorchDynamoError(str(e)).with_traceback(e.__traceback__) from None
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 567, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 181, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 466, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 433, in transform
    tracer.run()
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2071, in run
    super().run()
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1191, in LOAD_ATTR
    result = BuiltinVariable(getattr).call_function(
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 608, in call_function
    result = handler(tx, *args, **kwargs)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 1074, in call_getattr
    return obj.var_getattr(tx, name).add_options(options)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/variables/nn_module.py", line 192, in var_getattr
    subobj = inspect.getattr_static(base, name)
  File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/inspect.py", line 1769, in getattr_static
    raise AttributeError(attr)
torch._dynamo.exc.InternalTorchDynamoError: ln_encoder

from user code:
   File "/data2/usr/projects/nanoT5/nanoT5/utils/t5_model.py", line 479, in <resume in forward>
    ln_encode_logits = self.ln_encoder(myencode_hidden_states)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Looking forward to your assistance :)

How to create pytorch_model.bin file?

Hello, Thank you for creating nanoT5. I am trying to fine-tune a nanoT5 pre-trained model. After training model for 2^16 steps with default settings, the script seems to checkpoint and save a few files such as model.safetensors, optimizer.bin, random_states_0.pkl, scheduler.bin. How do I create the pytorch_model.bin file for finetuning?

fine-tuning error: No module named adaptive.moe

Hi Piotr,

your work on the nanoT5 repository is amazing, we like it and want to reproduce.

I study NLP recently, so I have been trying to learn how to run T5 as you do, I successfully pretrained ,but failed fintuning.
and i couldn't execute the fine-tuning python command because the module adaptive.moe doesn't exist anywhere.
i also google it ,and can't find the module, so could you help me to figure it out?

Thanks a lot for your help !
fan

AttributeError: Can't pickle local object 'IterableDataset.map.<locals>.<lambda>'

First of all, thank you for this rigorous work. As another low-budget researcher, I applaud you.

I was curious to see how this would perform on Apple Metal (MPS). However, even in pure CPU mode I can't get it to run on OSX.

Any tips?

(nanoT5) joseph@JosephsacStudio nanoT5 % python3 -m nanoT5.main device=cpu model.compile=False precision=no
[2023-07-27 22:23:37,823][Main][INFO] - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu

Mixed precision type: no

[2023-07-27 22:23:37,823][Main][INFO] - Working directory is /Users/joseph/dev/nanoT5/logs/2023-07-27/22-23-37-
loading configuration file config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/config.json
Model config T5Config {
  "_name_or_path": "google/t5-v1_1-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32128
}

loading configuration file config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/config.json
Model config T5Config {
  "_name_or_path": "google/t5-v1_1-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32128
}

loading file spiece.model from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/spiece.model
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/special_tokens_map.json
loading file tokenizer_config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/tokenizer_config.json
loading configuration file config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/config.json
Model config T5Config {
  "_name_or_path": "google/t5-v1_1-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32128
}

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
loading configuration file config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/config.json
Model config T5Config {
  "_name_or_path": "google/t5-v1_1-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32128
}

Error executing job with overrides: ['device=cpu', 'model.compile=False', 'precision=no']
Traceback (most recent call last):
  File "/Users/joseph/dev/nanoT5/nanoT5/main.py", line 65, in main
    train(model, train_dataloader, test_dataloader, accelerator,
  File "/Users/joseph/dev/nanoT5/nanoT5/utils/train_utils.py", line 186, in train
    for batch_id, batch in enumerate(train_dataloader, start=1):
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/site-packages/accelerate/data_loader.py", line 550, in __iter__
    main_iterator = super().__iter__()
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
    return self._get_iterator()
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
    w.start()
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'IterableDataset.map.<locals>.<lambda>'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Beginner Question : Would it be wise to use this as a backbone for custom seq2seq modeling fMRI data and custom encoder?

Hello,

It is my first time using transformers, I wanted to ask a few questions on how I can implement a custom transformer fast with minimal effort.

Do you think I should use your code as a starting point to create a custom model that :

Uses a completely different dataset with custom tokenization schemes (ex : tokenizing brain signals using a custom brain signal embedding model)
Can optionally change either the encoder or the decoder? (for example, I am thinking of using the ContiFormer in lieu of the regular encoder to better capture the "temporal" aspect of the data in question.
Using a very small model (like 1,2 encoder/decoder layers at most), (brain data is very limited)

Or should I just copy the T5 model code from hugging face and try to customize it from there using PyTorch? (I am already familiar with PyTorch (but not transformers or hugging face or etc, as it I only used CNNs )

Any advice would be greatly appreciated :)

Resume the pre-training process

Hi:

Thank you for your good work!

I want to know if the nanoT5 supports resuming the training process from a saved checkpoint including model(pytorch_model.bin), optimizer state(optimizer.bin) and lr scheduler(scheduler.bin)?
I would really appreciate it if you could give me a simple example for it.

Thanks a lot!

Question about implementing whole word masking in nanoT5

First off, I want to thank you for the amazing work on creating nanoT5! This repo has helped me continue pretraining codeT5 on my own data corpus. Thanks a lot! And I had some questions about the current masking implementation in nanoT5:

It seems that nanoT5 currently uses random span masking. If I want to implement the whole word masking (WWM) trick, how could I implement it?
I think I should modify the DataCollatorForT5MLM class implementation. But I am not sure if I am on the correct way, could you give me some hints on where to start?

Any insights or pointers you can provide would be much appreciated!

Pre-training fails at step 30155 out of 32768 steps every time

I have encountered a problem Cannot call sizes() on tensor with symbolic sizes/strides during pre-training. Whenever I try to pretrain the nanoT5 using google/t5-v1_1-small config, the program fails at step 30155 out of the total 32768 steps. I only modified the modules get_tokenizer and load_dataset_splits to load a customized dataset and tokenizer. The rest of the program remains unchanged except for an added wandb logger. But when I start to train from checkpoint-30000, this problem does not occur (neither happens when I set the args.current_train_step=1 nor when I set the args.current_train_step=30000). Below is the detailed traceback stack:

/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:112: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:119: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:112: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:119: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:112: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:119: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
Traceback (most recent call last):
  File "/data/user/projects/nanoT5/nanoT5/main.py", line 89, in main
    train(model, train_dataloader, test_dataloader, accelerator,
  File "/data/user/projects/nanoT5/nanoT5/utils/train_utils.py", line 190, in train
    loss, stats = forward(model, batch)
  File "/data/user/projects/nanoT5/nanoT5/utils/train_utils.py", line 88, in forward
    outputs = model(**batch)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 333, in _fn
    return fn(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1521, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1357, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
    return hijacked_callback(frame, cache_size, hooks, frame_state)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 637, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
    return fn(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 371, in _convert_frame_assert
    return _compile(
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 567, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 181, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 466, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 433, in transform
    tracer.run()
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2071, in run
    super().run()
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2159, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 853, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 953, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 181, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1020, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1005, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 436, in compile_fn
    submod_compiler.run(*example_inputs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/fx/interpreter.py", line 138, in run
    self.env[node] = self.run_node(node)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 430, in run_node
    return curr_submod(*new_args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/fx/graph_module.py", line 678, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/fx/graph_module.py", line 284, in __call__
    raise e
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/fx/graph_module.py", line 274, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.251", line 25, in forward
    l__self___encoder_block_0_layer_0_self_attention_q = self.L__self___encoder_block_0_layer_0_SelfAttention_q(type_as)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides

And this is my command that runs the code:

accelerate launch --multi_gpu --num_processes=2 -m nanoT5.main model.compile=true

About pre-training on another dataset

Hi there, I'm about to pre-train a custom mT5 on my own dataset. But how to do that? (I'm still confused about how to change the script for my dataset or the sample of the dataset I should feed into the model).

Also, I will mix those language texts into one or just pre-train them separately for pre-training.

Pre training on my own dataset

Thank you for the great work. I have some domain specific dataset(large enough), I have trying to do pretraining /continue pretraining for my own dataset. any help on that?
cc @PiotrNawrot

nanoT5 initializes lm_head weights with 768x too much variance, probably

it looks like your lm_head weight init is the same as HF's, which has a known problem meaning it doesn't match the original mesh tensorflow weight init:
huggingface/transformers#26441

I believe that when training T5 with an untied lm_head, you would want to initialize the lm_head weights with std=hidden_dim**-.5. so about std=0.036.

currently the lm_head is inited with std=1, which is 27.7x too much std or 768x too much variance.
https://github.com/PiotrNawrot/nanoT5/blob/1c82d67bf8dea635be68a3b2a68a43b68b665193/nanoT5/utils/t5_model.py#L505C26-L505C26

Difficulty applying NanoT5 to different model and database

We weren't able to find a specific documentation regarding teaching different model and database on your model. We want to teach a MT5 model which we limited to hebrew and english knowledge with a dataset of hebrew wikipedia with your implementation of T5. Is there a way to do this with your help?

Larger models and training on the Pile

After seeing the excitement around TinyLlama, it makes me want to pre train some T5 models in a similar fashion. If you are able to achieve these equivalent results in fraction of the time on C4, it seems like throwing some modern datasets and more compute at it should yield even better results...do you see any reason why this wouldn't be the case? Or does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?

Just a quick question to pretrain Flan-T5

First of all, great work! Thanks for sharing.

Just wondering if this is possible to train Flan-T5 from scratch, any thoughts or ideas on this?

Thanks!

piotrnawrot / nanot5 Goto Github PK

nanot5's People

Contributors

Stargazers

Watchers

Forkers

nanot5's Issues

Recommend Projects

Recommend Topics

Recommend Org