piotrnawrot / nanot5 Goto Github PK
View Code? Open in Web Editor NEWFast & Simple repository for pre-training and fine-tuning T5-style models
License: Apache License 2.0
Fast & Simple repository for pre-training and fine-tuning T5-style models
License: Apache License 2.0
Hi Piotr,
Thank you for your great work on the nanoT5 repository.
I'm going to try to pre-train the T5-base variant model on C4 corpus by using V100 gpu and then I'll evaluate the model on GLUE tasks. I'm referring to your repo nanoT5 and it is very very helpful to me. Thanks a lot, again.
Before I pre-train the model myself, to save the time, I ask to you if there is any pre-trained model you can share for.
Best regard,
SungHo
Has anyone encountered the below error?
RuntimeError:
torch._dynamo.optimize is called on a non function object.
If this is a callable class, please wrap the relevant code into a function and optimize the
wrapper function.
class CallableClass:
def init(self):
super().init()
self.relu = torch.nn.ReLU()def __call__(self, x): return self.relu(torch.sin(x)) def print_hello(self): print("Hello world")
mod = CallableClass()
If you want to optimize the call function and other code, wrap that up in a function
def wrapper_fn(x):
y = mod(x)
return y.sum()
and then optimize the wrapper_fn
opt_wrapper_fn = torch._dynamo.optimize(wrapper_fn)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 60801) of binary: /root/miniconda3/envs/nanoT5/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/nanoT5/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
The command I use is CUDA_VISIBLE_DEIVCES=0,1 accelerate launch --multi_gpu --mixed_precision=bf16 --num_processes=2 -m nanoT5.main model.compile=true.
My environment works fine with single GPU training, a.k.a the following command:
python -m nanoT5.main model.compile=true
Thanks for sharing your code and optimizations for training T5. What's the best way to cite this repo?
I would like to start off by saying that this is absolutely great work. I do have a minor question though; would it be possible to continue pretraining from original weights in HF? Or would we have to manually retrain C4 in conjunction with our personal dataset.
Hi, if you keep the batch_size = 128 and try to train with multiple GPUs, e.g. 8 GPUs, the effective batch_size is 128*8 = 1024, do you have any idea how to set the learning rate in that case?
I have tried to change the learning rate, for example scale it linearly w.r.t num GPU => (lr = 1 GPU's lr * 8) etc, but only to receive a trained model with worse negative log-likelihood.
So far, I only consider optim = adafactor and scheduler = legacy though.
Is there any option, to pre-train T5-based models on different dataset than C4 in a self-supervised manner?
Sorry to bother you, due to network problems, I have downloaded the C4 dataset locally, how can I use the local dataset for pre-training? Please help me with this
I need to fine-tune on long context (16K). Is nanoT5 easily adaptable to longT5 model? if it is, I could possibly pre-train on c4+my custom data with long context and then fine tune on my custom targets.
Do you any code example how to run it on CPUs
Hi, thank you a lot for this helpful github.
Can I ask why do you need to re-implement T5 model instead of using the one from huggingface and pretraining the huggingface model with mixed precision directly?
Hi Piotr,
Thank you for your amazing work on the nanoT5 repository.
I am a beginner in NLP so I have been trying to learn how to run T5 thanks to your repo. I have a question regarding the computation of the rouge score. I am trying to compute it during the training of my T5 to monitor the performance of my model.
When I set up return_attention_mask=True in your code (https://github.com/PiotrNawrot/nanoT5/blob/main/nanoT5/utils/copied_utils.py#L361), I have a mismatch of sizes in "position_bias = position_bias + mask" with position_bias being of size 512 (the input length) and the mask of size 568 (which is the before_mask_input_length )
Question: Do you have an advice to be able to compute the rouge score every X steps during training using your repo?
Thanks a lot for your help !
Samy
Dear authors:
nanoT5/nanoT5/utils/model_utils.py
Line 259 in 5a66107
It appears that you have configured the learning rate to warm up halfway through the process. Are there any specific reasons for this decision?
Firstly, thank you so much for this repo! I'm a huge fan of T5, and these results are extremely impressive.
I saw that you experimented with different positional embeddings like ALiBi in order to facilitate FA down the line. Was that attempt due to the fact that FA doesn't support bias? If so, there is a PR to add it that is making progress:
It would be fun to see this repo get even faster.
I have tested the code for ('wikitext','wikitext-2-v1') dataset and my local dataset, the pre training runs on a single machine for the default config. Once I switch to a multigpu set up, I noticed that only one of the gpu is getting loaded, rest are idle. Does the code also support multigpu training out of the box.
cc @PiotrNawrot
Hi, Thanks for giving us this implementation. I really appreciate it.
I'm a bit new to training Enc-Dec models. so I was wondering if you could answer this one question.
If my understanding is correct, the regular pre-training objective of the T5 is very similar to MLM, as in you mask some tokens and have the model learn to predict them. so I want to know if, say, instead of masking tokens, I corrupt my whole dataset (20% of each row) by replacing the tokens with other tokens (not using any fancy generator-discriminator, just corrupting the data during the pre-processing step) and treat it like a grammar / typo correction task where the labels are the original, clean text itself; could be a viable objective?
input: "the katt jamped over the fense"
label: "the cat jumped over the fence"
may I ask you to tell me what you think on this?
Could you shed some light on how the RMS scaling code used in AdamWScale is supposed to work? Seems like it has a bug, but maybe there's some magic I'm missing there...
# /Adapt Step from Adafactor
step_size = step_size * max(1e-3, self._rms(p.data))
# /Adapt Step from Adafactor
This seems extremely different than what Adafactor does, which first calculates the expected parameter update, and then scales it like update = step_size * max(1, rms(update) / clipping)
The nanoT5 code appears to scale each parameter's update by _rms(its own parameter value)
, rather than _rms(the parameter's desired update)
And then uses a magic constant which is maybe meant to somehow refer back to the default learning rate?
I trained T5 Base from scratch with different dataset (Wikipedia). Use the checkpoint #60,000 to fine tune for other downstream task (a Seq2Seq task). However, when I load the model from local folder with command : model = T5ForConditionalGeneration.from_pretrained(model_name_or_path, local_files_only=True ), I got the warning message below. Questions: is that normal? Does this affect the performance of final finetuned model?
WARNING:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at /home/.../checkpoint-pt-60001/ were not used when initializing T5ForConditionalGeneration: ['encoder.block.8.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.1.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.5.layer.1.DenseReluDense.wi_0.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.11.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.1.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.7.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.7.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.5.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.4.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.8.layer.1.DenseReluDense.wi_0.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.2.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.3.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.11.layer.1.DenseReluDense.wi_0.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.6.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.0.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_0.weight', 'encoder.block.6.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.3.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.9.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.0.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.10.layer.1.DenseReluDense.wi_0.weight', 'decoder.block.1.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.2.layer.1.DenseReluDense.wi_0.weight', 'encoder.block.9.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.10.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_1.weight', 'encoder.block.4.layer.1.DenseReluDense.wi_1.weight', 'encoder.block.1.layer.1.DenseReluDense.wi_1.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_1.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at /home/.../checkpoint-pt-60001/ and are newly initialized because the shapes did not match:
Where can i find the script for transforming the checkpoint to huggingface format ?
I try to add my own loss function using the encoder's hidden states, and I add a new linear layer similar to your layer self.lm_head
to obtain the corresponding logits. However, the training process fails every time and it seems like I did not use the linear layer correctly, but I do not know why...
Here is my modified part of MyT5
module:
class MyT5(nn.Module):
def __init__(self, config: T5Config):
...
self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
self.ln_encode = nn.Linear(config.d_model, 3, bias=False)
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.FloatTensor] = None,
decoder_input_ids: Optional[torch.LongTensor] = None,
decoder_attention_mask: Optional[torch.BoolTensor] = None,
labels: Optional[torch.LongTensor] = None,
encoder_outputs = None,
myencode_attention_mask: Optional[torch.BoolTensor] = None,
seq_order: Optional[torch.LongTensor] = None,
) -> Seq2SeqLMOutput:
if encoder_outputs is None:
encoder_outputs = self.encoder(
input_ids=input_ids,
attention_mask=attention_mask,
)
hidden_states = encoder_outputs.hidden_states
myencode_attention_mask_extended = myencode_attention_mask.unsqueeze(-1).expand_as(encoder_outputs[0])
myencode_hidden_states = encoder_outputs[0][myencode_attention_mask_extended].view(-1, self.model_dim)
myencode_loss_ft = CrossEntropyLoss(ignore_index=-100)
ln_encode_logits = self.ln_encoder(myencode_hidden_states)
seq_loss = seq_loss_ft(ln_encode_logits.view(-1, ln_encode_logits.size(-1)), seq_order[seq_order != -100].view(-1))
...
loss += seq_loss
return ...
def _init_weights(self, module):
factor = self.config.initializer_factor # Used for testing weights initialization
...
elif isinstance(module, (MyT5)):
module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)
if hasattr(module, "lm_head") and not self.config.tie_word_embeddings:
module.lm_head.weight.data.normal_(mean=0.0, std=factor * 1.0)
print("lm initialized")
if hasattr(module, "ln_encode"):
module.ln_encode.weight.data.normal_(mean=0.0, std=factor * 1.0)
...
And here is the error message:
Error executing job with overrides: []
Traceback (most recent call last):
File "/data2/usr/projects/nanoT5/nanoT5/main.py", line 85, in main
train(model, train_dataloader, test_dataloader, accelerator,
File "/data2/usr/projects/nanoT5/nanoT5/utils/train_utils.py", line 237, in train
loss, stats = forward(model, batch)
File "/data2/usr/projects/nanoT5/nanoT5/utils/train_utils.py", line 90, in forward
outputs = model(**batch)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 333, in _fn
return fn(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1521, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1357, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/data2/usr/projects/nanoT5/nanoT5/utils/t5_model.py", line 477, in forward
myencode_hidden_states = encoder_outputs[0][myencode_attention_mask_extended].view(-1, self.model_dim)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return hijacked_callback(frame, cache_size, hooks, frame_state)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 637, in _convert_frame
result = inner_convert(frame, cache_size, hooks, frame_state)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 371, in _convert_frame_assert
return _compile(
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 584, in _compile
raise InternalTorchDynamoError(str(e)).with_traceback(e.__traceback__) from None
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 567, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 181, in time_wrapper
r = func(*args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 466, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 433, in transform
tracer.run()
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2071, in run
super().run()
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1191, in LOAD_ATTR
result = BuiltinVariable(getattr).call_function(
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 608, in call_function
result = handler(tx, *args, **kwargs)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 1074, in call_getattr
return obj.var_getattr(tx, name).add_options(options)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/variables/nn_module.py", line 192, in var_getattr
subobj = inspect.getattr_static(base, name)
File "/home/usr/miniconda3/envs/nanoT5/lib/python3.10/inspect.py", line 1769, in getattr_static
raise AttributeError(attr)
torch._dynamo.exc.InternalTorchDynamoError: ln_encoder
from user code:
File "/data2/usr/projects/nanoT5/nanoT5/utils/t5_model.py", line 479, in <resume in forward>
ln_encode_logits = self.ln_encoder(myencode_hidden_states)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
Looking forward to your assistance :)
Hello, Thank you for creating nanoT5. I am trying to fine-tune a nanoT5 pre-trained model. After training model for 2^16 steps with default settings, the script seems to checkpoint and save a few files such as model.safetensors, optimizer.bin, random_states_0.pkl, scheduler.bin. How do I create the pytorch_model.bin file for finetuning?
Hi Piotr,
your work on the nanoT5 repository is amazing, we like it and want to reproduce.
I study NLP recently, so I have been trying to learn how to run T5 as you do, I successfully pretrained ,but failed fintuning.
and i couldn't execute the fine-tuning python command because the module adaptive.moe doesn't exist anywhere.
i also google it ,and can't find the module, so could you help me to figure it out?
Thanks a lot for your help !
fan
First of all, thank you for this rigorous work. As another low-budget researcher, I applaud you.
I was curious to see how this would perform on Apple Metal (MPS). However, even in pure CPU mode I can't get it to run on OSX.
Any tips?
(nanoT5) joseph@JosephsacStudio nanoT5 % python3 -m nanoT5.main device=cpu model.compile=False precision=no
[2023-07-27 22:23:37,823][Main][INFO] - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu
Mixed precision type: no
[2023-07-27 22:23:37,823][Main][INFO] - Working directory is /Users/joseph/dev/nanoT5/logs/2023-07-27/22-23-37-
loading configuration file config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/config.json
Model config T5Config {
"_name_or_path": "google/t5-v1_1-base",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dense_act_fn": "gelu_new",
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"is_gated_act": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"output_past": true,
"pad_token_id": 0,
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"tie_word_embeddings": false,
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32128
}
loading configuration file config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/config.json
Model config T5Config {
"_name_or_path": "google/t5-v1_1-base",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dense_act_fn": "gelu_new",
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"is_gated_act": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"output_past": true,
"pad_token_id": 0,
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"tie_word_embeddings": false,
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32128
}
loading file spiece.model from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/spiece.model
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/special_tokens_map.json
loading file tokenizer_config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/tokenizer_config.json
loading configuration file config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/config.json
Model config T5Config {
"_name_or_path": "google/t5-v1_1-base",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dense_act_fn": "gelu_new",
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"is_gated_act": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"output_past": true,
"pad_token_id": 0,
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"tie_word_embeddings": false,
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32128
}
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
loading configuration file config.json from cache at /Users/joseph/.cache/huggingface/hub/models--google--t5-v1_1-base/snapshots/b5fc947a416ea3cb079532cb3c2bbadeb7f800fc/config.json
Model config T5Config {
"_name_or_path": "google/t5-v1_1-base",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dense_act_fn": "gelu_new",
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"is_gated_act": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"output_past": true,
"pad_token_id": 0,
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"tie_word_embeddings": false,
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32128
}
Error executing job with overrides: ['device=cpu', 'model.compile=False', 'precision=no']
Traceback (most recent call last):
File "/Users/joseph/dev/nanoT5/nanoT5/main.py", line 65, in main
train(model, train_dataloader, test_dataloader, accelerator,
File "/Users/joseph/dev/nanoT5/nanoT5/utils/train_utils.py", line 186, in train
for batch_id, batch in enumerate(train_dataloader, start=1):
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/site-packages/accelerate/data_loader.py", line 550, in __iter__
main_iterator = super().__iter__()
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
return self._get_iterator()
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
w.start()
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/homebrew/anaconda3/envs/nanoT5/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'IterableDataset.map.<locals>.<lambda>'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Hello,
It is my first time using transformers, I wanted to ask a few questions on how I can implement a custom transformer fast with minimal effort.
Do you think I should use your code as a starting point to create a custom model that :
Or should I just copy the T5 model code from hugging face and try to customize it from there using PyTorch? (I am already familiar with PyTorch (but not transformers or hugging face or etc, as it I only used CNNs )
Any advice would be greatly appreciated :)
Hi:
Thank you for your good work!
I want to know if the nanoT5 supports resuming the training process from a saved checkpoint including model(pytorch_model.bin
), optimizer state(optimizer.bin
) and lr scheduler(scheduler.bin
)?
I would really appreciate it if you could give me a simple example for it.
Thanks a lot!
First off, I want to thank you for the amazing work on creating nanoT5! This repo has helped me continue pretraining codeT5 on my own data corpus. Thanks a lot! And I had some questions about the current masking implementation in nanoT5:
It seems that nanoT5 currently uses random span masking. If I want to implement the whole word masking (WWM) trick, how could I implement it?
I think I should modify the DataCollatorForT5MLM
class implementation. But I am not sure if I am on the correct way, could you give me some hints on where to start?
Any insights or pointers you can provide would be much appreciated!
I have encountered a problem Cannot call sizes() on tensor with symbolic sizes/strides
during pre-training. Whenever I try to pretrain the nanoT5 using google/t5-v1_1-small
config, the program fails at step 30155 out of the total 32768 steps. I only modified the modules get_tokenizer
and load_dataset_splits
to load a customized dataset and tokenizer. The rest of the program remains unchanged except for an added wandb logger. But when I start to train from checkpoint-30000, this problem does not occur (neither happens when I set the args.current_train_step=1
nor when I set the args.current_train_step=30000
). Below is the detailed traceback stack:
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
torch.has_cuda,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:112: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
torch.has_cudnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
torch.has_mps,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:119: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
torch.has_mkldnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
torch.has_cuda,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:112: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
torch.has_cudnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
torch.has_mps,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:119: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
torch.has_mkldnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
torch.has_cuda,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:112: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
torch.has_cudnn,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
torch.has_mps,
/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/overrides.py:119: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
torch.has_mkldnn,
Traceback (most recent call last):
File "/data/user/projects/nanoT5/nanoT5/main.py", line 89, in main
train(model, train_dataloader, test_dataloader, accelerator,
File "/data/user/projects/nanoT5/nanoT5/utils/train_utils.py", line 190, in train
loss, stats = forward(model, batch)
File "/data/user/projects/nanoT5/nanoT5/utils/train_utils.py", line 88, in forward
outputs = model(**batch)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 333, in _fn
return fn(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1521, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1357, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return hijacked_callback(frame, cache_size, hooks, frame_state)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 637, in _convert_frame
result = inner_convert(frame, cache_size, hooks, frame_state)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 371, in _convert_frame_assert
return _compile(
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 567, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 181, in time_wrapper
r = func(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 466, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 433, in transform
tracer.run()
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2071, in run
super().run()
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2159, in RETURN_VALUE
self.output.compile_subgraph(
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 853, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 953, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 181, in time_wrapper
r = func(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1020, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1005, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 436, in compile_fn
submod_compiler.run(*example_inputs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/fx/interpreter.py", line 138, in run
self.env[node] = self.run_node(node)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 430, in run_node
return curr_submod(*new_args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/fx/graph_module.py", line 678, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/fx/graph_module.py", line 284, in __call__
raise e
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/fx/graph_module.py", line 274, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.251", line 25, in forward
l__self___encoder_block_0_layer_0_self_attention_q = self.L__self___encoder_block_0_layer_0_SelfAttention_q(type_as)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/nanoT5/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides
And this is my command that runs the code:
accelerate launch --multi_gpu --num_processes=2 -m nanoT5.main model.compile=true
Hi there, I'm about to pre-train a custom mT5 on my own dataset. But how to do that? (I'm still confused about how to change the script for my dataset or the sample of the dataset I should feed into the model).
Also, I will mix those language texts into one or just pre-train them separately for pre-training.
Thank you for the great work. I have some domain specific dataset(large enough), I have trying to do pretraining /continue pretraining for my own dataset. any help on that?
cc @PiotrNawrot
it looks like your lm_head weight init is the same as HF's, which has a known problem meaning it doesn't match the original mesh tensorflow weight init:
huggingface/transformers#26441
I believe that when training T5 with an untied lm_head, you would want to initialize the lm_head weights with std=hidden_dim**-.5
. so about std=0.036.
currently the lm_head is inited with std=1
, which is 27.7x too much std or 768x too much variance.
https://github.com/PiotrNawrot/nanoT5/blob/1c82d67bf8dea635be68a3b2a68a43b68b665193/nanoT5/utils/t5_model.py#L505C26-L505C26
We weren't able to find a specific documentation regarding teaching different model and database on your model. We want to teach a MT5 model which we limited to hebrew and english knowledge with a dataset of hebrew wikipedia with your implementation of T5. Is there a way to do this with your help?
After seeing the excitement around TinyLlama, it makes me want to pre train some T5 models in a similar fashion. If you are able to achieve these equivalent results in fraction of the time on C4, it seems like throwing some modern datasets and more compute at it should yield even better results...do you see any reason why this wouldn't be the case? Or does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?
First of all, great work! Thanks for sharing.
Just wondering if this is possible to train Flan-T5 from scratch, any thoughts or ideas on this?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.