Code Monkey home page Code Monkey logo

tag-llm's People

Contributors

sjunhongshen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

monoboard1

tag-llm's Issues

unusable augmented embedder: values are nan

Task: Language (domain_tag.yaml)
Model: llama7b (huggyllama)
Python 3.10 (same on 3.9)
GPU: RTX A6000 Ada

When training the model (I tried many training runs & different datasets) the embeddings turn out always nan (not a number). The trained model will only output garbage (\x04 tokens - token id 7 - instead of text, a single \x02 - token id 5 - token at the end). I'm using TagLLamaForCausalLM.from_pretrained to load the model, but also see this in the exp/Language/[...]/embedding_weights.npy file

grafik

Update:
This is interesting, it seems to happen gradually

Step 0 (initial weights) to 2
grafik

Loss step 0: tensor(9.0469, device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)
Loss step 1: tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)

I can see that there are no input gradients coming to the augmented embedder when registering a backward hook to it, I'm unsure if that's what's supposed to happen

TypeError: compute_metrics() got an unexpected keyword argument 'special_tokens'

In metrics.py, get_compute_metrics_fn tries to do a partial application with a keyword argument called special_tokens. However, the compute_metrics function does not have this argument. It only has something called gist_token but somehow that argument is not even used and is deleted immediately on the first line.

def get_compute_metrics_fn(
    special_tokens: int, tokenizer: PreTrainedTokenizer, args: Arguments
) -> Callable:
    return functools.partial(
        compute_metrics, special_tokens=special_tokens, tokenizer=tokenizer, args=args
    )
def compute_metrics(
    eval_preds: EvalPrediction,
    gist_token: int,
    tokenizer: PreTrainedTokenizer,
    args: Arguments,
    output_file: Optional[str] = None,
) -> Metrics:
    del gist_token
    ...

TypeError: first argument must be callable or None

I got the following error during training:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/Users/kevin/Tag-LLM/src/train.py", line 376, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2638, in training_step
    inputs = self._prepare_inputs(inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2583, in _prepare_inputs
    inputs = self._prepare_input(inputs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2565, in _prepare_input
    return type(data)({k: self._prepare_input(v) for k, v in data.items()})
TypeError: first argument must be callable or None

The error was triggered in transformers/trainer.py when type(data) tries to get the type of data which is a defaultdict and reconstruct a new defaultdict after mapping its values using _prepare_input. However, the defaultdict constructor cannot be used this way:

def _prepare_input(self, data: Union[torch.Tensor, Any]) -> Union[torch.Tensor, Any]:
        """
        Prepares one `data` before feeding it to the model, be it a tensor or a nested list/dictionary of tensors.
        """
        if isinstance(data, Mapping):
            return type(data)({k: self._prepare_input(v) for k, v in data.items()})
        elif isinstance(data, (tuple, list)):

A fix is to convert the defaultdict to a dict at the end of the DataCollatorForTagLLM in collator.py:

...
        if len(reg_idx) > 0:
            model_inputs["reg_idx"] = reg_idx
            model_inputs["reg_dim"] = reg_dim
            model_inputs["reg_pred_idx"] = reg_pred_idx
        if len(clf_idx) > 0:
            model_inputs["clf_idx"] = clf_idx
    
        return dict(model_inputs)

Getting an error while running the train.py file

After installing the Git LFS, downloading all the required dependencies using the requirements.txt, and downloading the datasets, I have been trying to run the train.py from the src folder. In line no: 160 of train.py where the code is :

train_dataset, eval_dataset, tag_name_dict, num_new_tokens, tags_to_update, domain_tags = get_dataset(task_name, num_existing_tokens, tag_name_dict, args.model.num_token_per_tag, args.model.use_domain_tag, args.model.use_function_tag, args.model.regression, freeze, True)

I am getting an error as: idx = tag_name_dict[tname].find(">")
which is coming from line no: 358 from get_dataset.py file where the code is:
if task_name == "BA":
existing_tokens = ["", ""] if use_domain_tag else []

        for tname in existing_tokens:
            idx = tag_name_dict[tname].find(">")
            domain_tags.append(int(tag_name_dict[tname][5:idx]))     
    
        tags_to_update =  ["<BA>"]
        for tname in tags_to_update:
            tag_name_dict[tname] = "".join(["<TAG " + str(i) + ">" for i in range(num_existing_tokens, num_existing_tokens + num_token_per_tag)])
            num_existing_tokens += num_token_per_tag

My doubt is : when I am initialising the tag_name_dict = {}, during the training of the domain tag, then after that why are we still trying to find the and in that particular dictionary

saving checkpoints errors: AttributeError: 'str' object has no attribute 'numel'

Domain Task: Language
Model: llama7b (huggyllama)
Python 3.10 (same on 3.9)
GPU: RTX A6000 Ada

This happens on every save checkpoint and breaks the training.

I can only get around this when patching transformers/trainer.py _save_checkpoint to be a try/except block and see the layer causing this is col_ampere, all other laysers save with no issue

AttributeError: 'str' object has no attribute 'numel'

Traceback (most recent call last):
File "/workspace/Tag-LLM/src/train.py", line 378, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1979, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2297, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2775, in save_model
self._save(output_dir)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2827, in _save
self.model.save_pretrained(output_dir, state_dict=state_dict)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 1729, in save_pretrained
shards, index = shard_checkpoint(state_dict, max_shard_size=max_shard_size, weights_name=weights_name)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 302, in shard_checkpoint
weight_size = weight.numel() * dtype_byte_size(weight.dtype)
AttributeError: 'str' object has no attribute 'numel'

Clarify Stage 2 & 3

Hi! Thanks for sharing your code. I'd like to evaluate Tag LLM according to the paper. However I cannot see how the code runs 3 stages, when looking at e.g. Translate there's a Domain Training and a Cross Domain Function training, it seems as if stage 2 is skipped.

Would I still train stage 2 or is supplying cross-domain data in stage 3 (autoregressive=False) achieving the same goal?
There are 3 config files, however only two of them show e.g. the Translate/Language training .

Would love for a quick clarification. Thanks

ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])

Training works fine but during evaluation, I got the following attention mask dimension mismatch:

 08:24:03,106][src.trainer_seq2seq][INFO] - ***** Running Evaluation *****
[2024-03-15 08:24:03,107][src.trainer_seq2seq][INFO] -   Num examples = 3
[2024-03-15 08:24:03,107][src.trainer_seq2seq][INFO] -   Batch size = 1
eval step 0
[2024-03-15 08:24:03,110][src.trainer_seq2seq][WARNING] - Overwriting existing generation config due to DeepSpeed bug. If model is not LLAMA, check this.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/Users/kevin/Tag-LLM/src/train.py", line 376, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 1979, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/trainer.py", line 2236, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 282, in evaluate
    output = eval_loop(
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 417, in evaluation_loop
    loss, logits, labels = self.prediction_step(
  File "/Users/kevin/Tag-LLM/src/trainer_seq2seq.py", line 158, in prediction_step
    generated_tokens = self.model.generate(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/generation/utils.py", line 1406, in generate
    return self.greedy_search(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2201, in greedy_search
    outputs = self(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 738, in forward
    outputs = self.model(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 627, in forward
    layer_outputs = decoder_layer(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 312, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/Users/kevin/miniconda3/envs/tagllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/kevin/Tag-LLM/src/tag_llama.py", line 228, in forward
    raise ValueError(
ValueError: Attention mask should be of size (1, 1, 1, 25), but is torch.Size([1, 1, 25, 25])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.