tag-llm's People
Forkers
monoboard1tag-llm's Issues
unusable augmented embedder: values are nan
Task: Language (domain_tag.yaml)
Model: llama7b (huggyllama)
Python 3.10 (same on 3.9)
GPU: RTX A6000 Ada
When training the model (I tried many training runs & different datasets) the embeddings turn out always nan (not a number). The trained model will only output garbage (\x04 tokens - token id 7 - instead of text, a single \x02 - token id 5 - token at the end). I'm using TagLLamaForCausalLM.from_pretrained to load the model, but also see this in the exp/Language/[...]/embedding_weights.npy file
Update:
This is interesting, it seems to happen gradually
Loss step 0: tensor(9.0469, device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)
Loss step 1: tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)
I can see that there are no input gradients coming to the augmented embedder when registering a backward hook to it, I'm unsure if that's what's supposed to happen
TypeError: compute_metrics() got an unexpected keyword argument 'special_tokens'
In metrics.py
, get_compute_metrics_fn
tries to do a partial application with a keyword argument called special_tokens
. However, the compute_metrics
function does not have this argument. It only has something called gist_token
but somehow that argument is not even used and is deleted immediately on the first line.
def get_compute_metrics_fn(
special_tokens: int, tokenizer: PreTrainedTokenizer, args: Arguments
) -> Callable:
return functools.partial(
compute_metrics, special_tokens=special_tokens, tokenizer=tokenizer, args=args
)
def compute_metrics(
eval_preds: EvalPrediction,
gist_token: int,
tokenizer: PreTrainedTokenizer,
args: Arguments,
output_file: Optional[str] = None,
) -> Metrics:
del gist_token
...
TypeError: first argument must be callable or None
I got the following error during training:
Error executing job with overrides: []
Traceback (most recent call last):
File "/Users/kevin/Tag-LLM/src/train.py", line 376, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2638, in training_step
inputs = self._prepare_inputs(inputs)
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2583, in _prepare_inputs
inputs = self._prepare_input(inputs)
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2565, in _prepare_input
return type(data)({k: self._prepare_input(v) for k, v in data.items()})
TypeError: first argument must be callable or None
The error was triggered in transformers/trainer.py
when type(data)
tries to get the type of data which is a defaultdict and reconstruct a new defaultdict after mapping its values using _prepare_input
. However, the defaultdict
constructor cannot be used this way:
def _prepare_input(self, data: Union[torch.Tensor, Any]) -> Union[torch.Tensor, Any]:
"""
Prepares one `data` before feeding it to the model, be it a tensor or a nested list/dictionary of tensors.
"""
if isinstance(data, Mapping):
return type(data)({k: self._prepare_input(v) for k, v in data.items()})
elif isinstance(data, (tuple, list)):
A fix is to convert the defaultdict to a dict at the end of the DataCollatorForTagLLM
in collator.py
:
...
if len(reg_idx) > 0:
model_inputs["reg_idx"] = reg_idx
model_inputs["reg_dim"] = reg_dim
model_inputs["reg_pred_idx"] = reg_pred_idx
if len(clf_idx) > 0:
model_inputs["clf_idx"] = clf_idx
return dict(model_inputs)
Typo in training file leads to import error
Line 28 in d68886e
this should be "AutoConfig", the import errors otherwise
Getting an error while running the train.py file
After installing the Git LFS, downloading all the required dependencies using the requirements.txt, and downloading the datasets, I have been trying to run the train.py from the src folder. In line no: 160 of train.py where the code is :
train_dataset, eval_dataset, tag_name_dict, num_new_tokens, tags_to_update, domain_tags = get_dataset(task_name, num_existing_tokens, tag_name_dict, args.model.num_token_per_tag, args.model.use_domain_tag, args.model.use_function_tag, args.model.regression, freeze, True)
I am getting an error as: idx = tag_name_dict[tname].find(">")
which is coming from line no: 358 from get_dataset.py file where the code is:
if task_name == "BA":
existing_tokens = ["", ""] if use_domain_tag else []
for tname in existing_tokens:
idx = tag_name_dict[tname].find(">")
domain_tags.append(int(tag_name_dict[tname][5:idx]))
tags_to_update = ["<BA>"]
for tname in tags_to_update:
tag_name_dict[tname] = "".join(["<TAG " + str(i) + ">" for i in range(num_existing_tokens, num_existing_tokens + num_token_per_tag)])
num_existing_tokens += num_token_per_tag
My doubt is : when I am initialising the tag_name_dict = {}, during the training of the domain tag, then after that why are we still trying to find the and in that particular dictionary
saving checkpoints errors: AttributeError: 'str' object has no attribute 'numel'
Domain Task: Language
Model: llama7b (huggyllama)
Python 3.10 (same on 3.9)
GPU: RTX A6000 Ada
This happens on every save checkpoint and breaks the training.
I can only get around this when patching transformers/trainer.py _save_checkpoint to be a try/except block and see the layer causing this is col_ampere, all other laysers save with no issue
AttributeError: 'str' object has no attribute 'numel'
Traceback (most recent call last):
File "/workspace/Tag-LLM/src/train.py", line 378, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1979, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2297, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2775, in save_model
self._save(output_dir)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2827, in _save
self.model.save_pretrained(output_dir, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 1729, in save_pretrained
shards, index = shard_checkpoint(state_dict, max_shard_size=max_shard_size, weights_name=weights_name)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 302, in shard_checkpoint
weight_size = weight.numel() * dtype_byte_size(weight.dtype)
AttributeError: 'str' object has no attribute 'numel'
Clarify Stage 2 & 3
Hi! Thanks for sharing your code. I'd like to evaluate Tag LLM according to the paper. However I cannot see how the code runs 3 stages, when looking at e.g. Translate there's a Domain Training and a Cross Domain Function training, it seems as if stage 2 is skipped.
Would I still train stage 2 or is supplying cross-domain data in stage 3 (autoregressive=False) achieving the same goal?
There are 3 config files, however only two of them show e.g. the Translate/Language training .
Would love for a quick clarification. Thanks
Dataset Sources
From where can I download the dataset?
ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])
Training works fine but during evaluation, I got the following attention mask dimension mismatch:
08:24:03,106][src.trainer_seq2seq][INFO] - ***** Running Evaluation *****
[2024-03-15 08:24:03,107][src.trainer_seq2seq][INFO] - Num examples = 3
[2024-03-15 08:24:03,107][src.trainer_seq2seq][INFO] - Batch size = 1
eval step 0
[2024-03-15 08:24:03,110][src.trainer_seq2seq][WARNING] - Overwriting existing generation config due to DeepSpeed bug. If model is not LLAMA, check this.
Error executing job with overrides: []
Traceback (most recent call last):
File "/Users/kevin/Tag-LLM/src/train.py", line 376, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 282, in evaluate
output = eval_loop(
File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 417, in evaluation_loop
loss, logits, labels = self.prediction_step(
File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 158, in prediction_step
generated_tokens = self.model.generate(
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/generation/utils.py", line 1406, in generate
return self.greedy_search(
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2201, in greedy_search
outputs = self(
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 738, in forward
outputs = self.model(
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 627, in forward
layer_outputs = decoder_layer(
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 312, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 228, in forward
raise ValueError(
ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.