sjunhongshen / tag-llm Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 1.0 88 KB

Python 100.00%

tag-llm's People

Contributors

Stargazers

Watchers

Forkers

monoboard1

tag-llm's Issues

unusable augmented embedder: values are nan

Task: Language (domain_tag.yaml)
Model: llama7b (huggyllama)
Python 3.10 (same on 3.9)
GPU: RTX A6000 Ada

When training the model (I tried many training runs & different datasets) the embeddings turn out always nan (not a number). The trained model will only output garbage (\x04 tokens - token id 7 - instead of text, a single \x02 - token id 5 - token at the end). I'm using TagLLamaForCausalLM.from_pretrained to load the model, but also see this in the exp/Language/[...]/embedding_weights.npy file

Update:
This is interesting, it seems to happen gradually

Step 0 (initial weights) to 2

Loss step 0: tensor(9.0469, device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)
Loss step 1: tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)

I can see that there are no input gradients coming to the augmented embedder when registering a backward hook to it, I'm unsure if that's what's supposed to happen

TypeError: compute_metrics() got an unexpected keyword argument 'special_tokens'

In metrics.py, get_compute_metrics_fn tries to do a partial application with a keyword argument called special_tokens. However, the compute_metrics function does not have this argument. It only has something called gist_token but somehow that argument is not even used and is deleted immediately on the first line.

def get_compute_metrics_fn(
    special_tokens: int, tokenizer: PreTrainedTokenizer, args: Arguments
) -> Callable:
    return functools.partial(
        compute_metrics, special_tokens=special_tokens, tokenizer=tokenizer, args=args
    )

def compute_metrics(
    eval_preds: EvalPrediction,
    gist_token: int,
    tokenizer: PreTrainedTokenizer,
    args: Arguments,
    output_file: Optional[str] = None,
) -> Metrics:
    del gist_token
    ...

TypeError: first argument must be callable or None

I got the following error during training:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/Users/kevin/Tag-LLM/src/train.py", line 376, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2638, in training_step
    inputs = self._prepare_inputs(inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2583, in _prepare_inputs
    inputs = self._prepare_input(inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2565, in _prepare_input
    return type(data)({k: self._prepare_input(v) for k, v in data.items()})
TypeError: first argument must be callable or None

The error was triggered in transformers/trainer.py when type(data) tries to get the type of data which is a defaultdict and reconstruct a new defaultdict after mapping its values using _prepare_input. However, the defaultdict constructor cannot be used this way:

def _prepare_input(self, data: Union[torch.Tensor, Any]) -> Union[torch.Tensor, Any]:
        """
        Prepares one `data` before feeding it to the model, be it a tensor or a nested list/dictionary of tensors.
        """
        if isinstance(data, Mapping):
            return type(data)({k: self._prepare_input(v) for k, v in data.items()})
        elif isinstance(data, (tuple, list)):

A fix is to convert the defaultdict to a dict at the end of the DataCollatorForTagLLM in collator.py:

...
        if len(reg_idx) > 0:
            model_inputs["reg_idx"] = reg_idx
            model_inputs["reg_dim"] = reg_dim
            model_inputs["reg_pred_idx"] = reg_pred_idx
        if len(clf_idx) > 0:
            model_inputs["clf_idx"] = clf_idx
    
        return dict(model_inputs)

Typo in training file leads to import error

Tag-LLM/src/train.py

Line 28 in d68886e

AutoConfiger,

this should be "AutoConfig", the import errors otherwise

Getting an error while running the train.py file

After installing the Git LFS, downloading all the required dependencies using the requirements.txt, and downloading the datasets, I have been trying to run the train.py from the src folder. In line no: 160 of train.py where the code is :

train_dataset, eval_dataset, tag_name_dict, num_new_tokens, tags_to_update, domain_tags = get_dataset(task_name, num_existing_tokens, tag_name_dict, args.model.num_token_per_tag, args.model.use_domain_tag, args.model.use_function_tag, args.model.regression, freeze, True)

I am getting an error as: idx = tag_name_dict[tname].find(">")
which is coming from line no: 358 from get_dataset.py file where the code is:
if task_name == "BA":
existing_tokens = ["", ""] if use_domain_tag else []

        for tname in existing_tokens:
            idx = tag_name_dict[tname].find(">")
            domain_tags.append(int(tag_name_dict[tname][5:idx]))     
    
        tags_to_update =  ["<BA>"]
        for tname in tags_to_update:
            tag_name_dict[tname] = "".join(["<TAG " + str(i) + ">" for i in range(num_existing_tokens, num_existing_tokens + num_token_per_tag)])
            num_existing_tokens += num_token_per_tag

My doubt is : when I am initialising the tag_name_dict = {}, during the training of the domain tag, then after that why are we still trying to find the and in that particular dictionary

saving checkpoints errors: AttributeError: 'str' object has no attribute 'numel'

Domain Task: Language
Model: llama7b (huggyllama)
Python 3.10 (same on 3.9)
GPU: RTX A6000 Ada

This happens on every save checkpoint and breaks the training.

I can only get around this when patching transformers/trainer.py _save_checkpoint to be a try/except block and see the layer causing this is col_ampere, all other laysers save with no issue

AttributeError: 'str' object has no attribute 'numel'

Traceback (most recent call last):
File "/workspace/Tag-LLM/src/train.py", line 378, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1979, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2297, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2775, in save_model
self._save(output_dir)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2827, in _save
self.model.save_pretrained(output_dir, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 1729, in save_pretrained
shards, index = shard_checkpoint(state_dict, max_shard_size=max_shard_size, weights_name=weights_name)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 302, in shard_checkpoint
weight_size = weight.numel() * dtype_byte_size(weight.dtype)
AttributeError: 'str' object has no attribute 'numel'

Clarify Stage 2 & 3

Hi! Thanks for sharing your code. I'd like to evaluate Tag LLM according to the paper. However I cannot see how the code runs 3 stages, when looking at e.g. Translate there's a Domain Training and a Cross Domain Function training, it seems as if stage 2 is skipped.

Would I still train stage 2 or is supplying cross-domain data in stage 3 (autoregressive=False) achieving the same goal?
There are 3 config files, however only two of them show e.g. the Translate/Language training .

Would love for a quick clarification. Thanks

Dataset Sources

From where can I download the dataset?

ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])

Training works fine but during evaluation, I got the following attention mask dimension mismatch:

 08:24:03,106][src.trainer_seq2seq][INFO] - ***** Running Evaluation *****
[2024-03-15 08:24:03,107][src.trainer_seq2seq][INFO] -   Num examples = 3
[2024-03-15 08:24:03,107][src.trainer_seq2seq][INFO] -   Batch size = 1
eval step 0
[2024-03-15 08:24:03,110][src.trainer_seq2seq][WARNING] - Overwriting existing generation config due to DeepSpeed bug. If model is not LLAMA, check this.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/Users/kevin/Tag-LLM/src/train.py", line 376, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 282, in evaluate
    output = eval_loop(
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 417, in evaluation_loop
    loss, logits, labels = self.prediction_step(
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 158, in prediction_step
    generated_tokens = self.model.generate(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/generation/utils.py", line 1406, in generate
    return self.greedy_search(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2201, in greedy_search
    outputs = self(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 738, in forward
    outputs = self.model(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 627, in forward
    layer_outputs = decoder_layer(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 312, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 228, in forward
    raise ValueError(
ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])

sjunhongshen / tag-llm Goto Github PK

tag-llm's People

Contributors

Stargazers

Watchers

Forkers

tag-llm's Issues

unusable augmented embedder: values are nan

TypeError: compute_metrics() got an unexpected keyword argument 'special_tokens'

TypeError: first argument must be callable or None

Typo in training file leads to import error

Getting an error while running the train.py file

saving checkpoints errors: AttributeError: 'str' object has no attribute 'numel'

Clarify Stage 2 & 3

Dataset Sources

ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent