Code Monkey home page Code Monkey logo

stanford_alpaca's Introduction

Stanford-Alpaca

Stanford Alpaca: An Instruction-following LLaMA Model

Code License Data License Weight Diff License Python 3.9+ Code style: black

This is the repo for the Stanford Alpaca project, which aims to build and share an instruction-following LLaMA model. The repo contains:

Note: We thank the community for feedback on Stanford-Alpaca and supporting our research. Our live demo is suspended until further notice.

Usage and License Notices: Alpaca is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff is also CC BY NC 4.0 (allowing only non-commercial use).

Overview

The current Alpaca model is fine-tuned from a 7B LLaMA model [1] on 52K instruction-following data generated by the techniques in the Self-Instruct [2] paper, with some modifications that we discuss in the next section. In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite [2].

Alpaca is still under development, and there are many limitations that have to be addressed. Importantly, we have not yet fine-tuned the Alpaca model to be safe and harmless. We thus encourage users to be cautious when interacting with Alpaca, and to report any concerning behavior to help improve the safety and ethical considerations of the model.

Our initial release contains the data generation procedure, dataset, and training recipe. We intend to release the model weights if we are given permission to do so by the creators of LLaMA. For now, we have chosen to host a live demo to help readers better understand the capabilities and limits of Alpaca, as well as a way to help us better evaluate Alpaca's performance on a broader audience.

Please read our release blog post for more details about the model, our discussion of the potential harm and limitations of Alpaca models, and our thought process for releasing a reproducible model.

[1]: LLaMA: Open and Efficient Foundation Language Models. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. https://arxiv.org/abs/2302.13971v1

[2]: Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. https://arxiv.org/abs/2212.10560

Data Release

alpaca_data.json contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields:

  • instruction: str, describes the task the model should perform. Each of the 52K instructions is unique.
  • input: str, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
  • output: str, the answer to the instruction as generated by text-davinci-003.

We used the following prompts for fine-tuning the Alpaca model:

  • for examples with a non-empty input field:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
  • for examples with an empty input field:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

During inference (eg for the web demo), we use the user instruction with an empty input field (second option).

Data Generation Process

Running the code
  1. Set environment variables OPENAI_API_KEY to your OpenAI API key.
  2. Install the dependencies with pip install -r requirements.txt.
  3. Run python -m generate_instruction generate_instruction_following_data to generate the data.

We built on the data generation pipeline from self-instruct and made the following modifications:

  • We used text-davinci-003 to generate the instruction data instead of davinci.
  • We wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003. Note: there is a slight error in the prompt we used, and future users should incorporate the edit in #24
  • We adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
  • We simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
  • We only generated a single instance for each instruction, instead of 2 to 3 instances as in [1].

This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, we also find our 52K generated data to be much more diverse than the data released by self-instruct. We plot the below figure (in the style of Figure 2 in the self-instruct paper to demonstrate the diversity of our data. The inner circle of the plot represents the root verb of the instructions, and the outer circle represents the direct objects.

Fine-tuning

We fine-tune our models using standard Hugging Face training code. We fine-tune LLaMA-7B and LLaMA-13B with the following hyperparameters:

Hyperparameter LLaMA-7B LLaMA-13B
Batch size 128 128
Learning rate 2e-5 1e-5
Epochs 3 5
Max length 512 512
Weight decay 0 0

To reproduce our fine-tuning runs for LLaMA, first install the requirements

pip install -r requirements.txt

Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode. We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using Python 3.10. Replace <your_random_port> with a port of your own, <your_path_to_hf_converted_llama_ckpt_and_tokenizer> with the path to your converted checkpoint and tokenizer (following instructions in the PR), and <your_output_dir> with where you want to store your outputs.

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

The same script also works for OPT fine-tuning. Here's an example for fine-tuning OPT-6.7B

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path "facebook/opt-6.7b" \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
    --tf32 True

Note the given training script is meant to be simple and easy to use, and is not particularly optimized. To run on more gpus, you may prefer to turn down gradient_accumulation_steps to keep a global batch size of 128. Global batch size has not been tested for optimality.

Addressing OOM

Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM. Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU. If you'd like to further reduce the memory footprint, here are some options:

  • Turn on CPU offload for FSDP with --fsdp "full_shard auto_wrap offload". This saves VRAM at the cost of longer runtime.
  • In our experience, DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP with offload. Here's an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:
    pip install deepspeed
    torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
        --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
        --data_path ./alpaca_data.json \
        --bf16 True \
        --output_dir <your_output_dir> \
        --num_train_epochs 3 \
        --per_device_train_batch_size 4 \
        --per_device_eval_batch_size 4 \
        --gradient_accumulation_steps 8 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 2000 \
        --save_total_limit 1 \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --deepspeed "./configs/default_offload_opt_param.json" \
        --tf32 True
    • The DeepSpeed library also provides some helpful functions to estimate memory usage.
  • LoRA fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB. We may release our re-implemention of this in the future, but for now the peft codebase can be a useful resource.

Recovering Alpaca Weights

The weight diff between Alpaca-7B and LLaMA-7B is located here. To recover the original Alpaca-7B weights, follow these steps:

1. Convert Meta's released weights into huggingface format. Follow this guide:
    https://huggingface.co/docs/transformers/main/model_doc/llama
2. Make sure you cloned the released weight diff into your local machine. The weight diff is located at:
    https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
3. Run this function with the correct paths. E.g.,
    python weight_diff.py recover --path_raw <path_to_step_1_dir> --path_diff <path_to_step_2_dir> --path_tuned <path_to_store_recovered_weights>

Once step 3 completes, you should have a directory with the recovered weights, from which you can load the model like the following

import transformers
alpaca_model = transformers.AutoModelForCausalLM.from_pretrained("<path_to_store_recovered_weights>")
alpaca_tokenizer = transformers.AutoTokenizer.from_pretrained("<path_to_store_recovered_weights>")

Authors

All grad students below contributed equally and the order is determined by random draw.

All advised by Tatsunori B. Hashimoto. Yann is also advised by Percy Liang and Xuechen is also advised by Carlos Guestrin.

Citation

Please cite the repo if you use the data or code in this repo.

@misc{alpaca,
  author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
  title = {Stanford Alpaca: An Instruction-following LLaMA model},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}

Naturally, you should also cite the original LLaMA paper [1] and the Self-Instruct paper [2].

Acknowledgements

We thank Yizhong Wang for his help in explaining the data generation pipeline in Self-Instruct and providing the code for the parse analysis plot. We thank Yifan Mai for helpful support, and members of the Stanford NLP Group as well as the Center for Research on Foundation Models (CRFM) for their helpful feedback.

stanford_alpaca's People

Contributors

eltociear avatar lxuechen avatar rtaori avatar tiiiger avatar yanndubs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stanford_alpaca's Issues

Do you shift the output label?

From your training code, the output label and input label is the same. Where do you shift the output label? Will this happen automatically inside trainer?

Public release of model weights

Congratulations on the fine-tune! We have observed some fantastic performance through the provided web interface.

AFAIK the original Llama model was released under GNU/GPL, you should be able to distribute derivative work respecting this original license, correct? (Even if the original model weights have not officially been distributed to the public yet)

Will you provide some sort of wait-list to notify us when the model weights are made available?

Interested in as much information as you may share on this, again, congratulations and thank your impressive work!

https://github.com/facebookresearch/llama/blob/main/LICENSE

Reduce reproduction cost 96%, from $600 to $24, by releasing the instruct dataset only

The blog post says $500 was spent producing the dataset.
The blog post also says $100 was spent on 3xA100 80GB for 3 hours.
The market rate for 4xA100 is around $8 per hour. (See vast.ai for example)

If the dataset is provided for fine tuning then Alpaca could be reproduce for just about $24 and we would not have to wait for Facebook's response regarding sharing of the pre-trained model.

How to inference after finetuning ?

Thanks for sharing the training code. I've finished a 3-epoch finetuing.
However, I don't find the inference code.
Would you please give some advice on it? or sharing the infercence code ?
Thanks again!

When can we support airgap installation?

HI guys,
This one is awesome. When do you guys plan to support airgap installation? in another words, the end user can run it in their Laptop or any VMs in public cloud?

Fine-Tuning very slow (6h->24h??)

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

Not quite understand the importance of this repo.

Hi, devs at stanford. Today I took a try on your project and run the command to generate the data. And after awhile, it outputs a json file, regen.json like below. So I have a little confused, forgive my ignorance but I really don't know how to make something cool with this "regex.json" file. You know what I mean, is like I got a file, but what can I do with it. I guessed ppl might able to create something similar to ChatGpt but weaker, this is my guessing so far. Please enlightened me, thanks.
image

Generation problem after / before instruction fine-tuning

Environment: 6xA6000 48GB with Ubuntu 22.04, Pytorch 1.13.0

I ran into a generation problem after following your instruction to convert LLaMA-7B weight using your attached script.

I simply used the following script to directly test generation after loading the converted LLaMA-7B model:

tokenizer.batch_decode(model.generate(**tokenizer('I want to ', return_tensors="pt")))

The output of above code is:

'I want to acoérницschutzirectorioieckťDEX threshold släktetolasĭüttpiel'

The problem happens both before and after following your README for instruction fine-tuning. (note that I see the loss is decreasing over time during the fine-tuning stage which seems OK)

I have no problem running generation using original code from LLaMA, may I know your generation script so that I can test what caused the problem? Thanks.

Training recipe??

The blog says training recipe is too released in the code, but I cannot find it. Can you update the repo with code used for training the model, along with required dependencies/guide, etc, to help us do the same, maybe with bigger models.
Thanks for this awesome repo.

infer cost

Hi,

Can a consumer level GPU run infer with alpaca-7B model?

why 52K?

Hello, thank you for open-sourcing your training details! I just tried your demo and found the responses surprisingly fluent.

Wondering if your decision to train on a 52K instruction dataset was influenced by some criteria? I'm wondering if there's a floor where you found responses to be qualitatively inferior, or trying a number beyond 52K to have not yielded better results?

No evaluation dataset was given for the trainer

Hi, there, I just finish the finetuning process as introduced in train.py. However, I encountered one problem about trainer.evaluate().

{'loss': 0.3974, 'learning_rate': 3.5380966993958655e-11, 'epoch': 3.0}
{'loss': 0.4492, 'learning_rate': 0.0, 'epoch': 3.0}
{'train_runtime': 17758.138, 'train_samples_per_second': 8.785, 'train_steps_per_second': 0.069, 'train_loss': 0.7304400721402787, 'epoch': 3.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1218/1218 [4:55:48<00:00, 14.57s/it]
Traceback (most recent call last):
  File "/home/codes/finetune_llama/alpaca/train.py", line 233, in <module>
    train()
  File "/home/codes/finetune_llama/alpaca/train.py", line 227, in train
    trainer.evaluate()
  File "/home/anaconda3/envs/hawq/lib/python3.9/site-packages/transformers/trainer.py", line 2920, in evaluate
    eval_dataloader = self.get_eval_dataloader(eval_dataset)
  File "/home/anaconda3/envs/hawq/lib/python3.9/site-packages/transformers/trainer.py", line 934, in get_eval_dataloader
    raise ValueError("Trainer: evaluation requires an eval_dataset.")
ValueError: Trainer: evaluation requires an eval_dataset.

Should I give an eval_dataset here?

Support for gpt-3.5-turbo

gpt-3.5-turbo is cheaper and faster than davinci. I'm not 100% sure whether it will actually work better for Alpaca but figure it may be worth a trial. Any interest in taking a PR?

Question about training precision

In the provided training command:

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

Why is --bf16 used, if the model checkpoints were originally fp16? Is it simply overridden by the --tf32 flag later?

Bigger LLaMA models

Dear Stanford Researchers, Professors, Students (all geniuses) thank you for your amazing job!
Would the tuning code you released in this repo (and the dataset) be fit for finetuning larger LLaMA models like 13b/30b/65b?

How would the computational effort scale with such models?

No checkpoint and no eval_dataset

It seems no eval_dataset and thus no storing for checkpoint ?

(for privacy, I hide the absolute file path and replace with )

Traceback (most recent call last):
  File "<path>/stanford_alpaca/train.py", line 232, in <module>
    train()
  File "<path>/stanford_alpaca/train.py", line 226, in train
    trainer.evaluate()
  File "<path>/stanford_alpaca/transformers-68d640f7c368bcaaaecfc678f11908ebbd3d6176/src/transformers/trainer.py", line 2920, in evaluate
    eval_dataloader = self.get_eval_dataloader(eval_dataset)
  File "<path>/stanford_alpaca/transformers-68d640f7c368bcaaaecfc678f11908ebbd3d6176/src/transformers/trainer.py", line 934, in get_eval_dataloader
    raise ValueError("Trainer: evaluation requires an eval_dataset.")
ValueError: Trainer: evaluation requires an eval_dataset.

Example of Instruction-Tuning Training

Hello, thank you for open-sourcing this work. We are now interested in generating our own instructions to fine-tune the Llama model based on your documentation and approach. Could you please advise on any resources or references we can use? Also, are these codes available on Hugging Face?

'type' object is not subscriptable

The exception can be fixed by replacing 'dict' to 'Dict'

from typing import Optional, Sequence, Union ... def openai_completion( prompts: Union[str, Sequence[str], Sequence[dict[str, str]], dict[str, str]],
-->
from typing import Optional, Sequence, Union, Dict ... def openai_completion( prompts: Union[str, Sequence[str], Sequence[Dict[str, str]], Dict[str, str]],

Reduce the length of your prompt.

prompt_batches: 0%| | 0/1 [00:00<?, ?it/s]WARNING:root:OpenAIError: This model's maximum context length is 4097 tokens, however you requested 4162 tokens (1090 in your prompt; 3072 for the completion). Please reduce your prompt; or completion length..

Inquiry: Inference Parameters used for Gradio Demo

As an independent researcher I'm interested in knowing what generation parameters are used in the Gradio Web Demo. Things such as temperature and repetition penalty, if you have used even more advanced samplers like Typical Sampling or Tail Free Sampling, I'd be interested to know that as well. From my brief testing it appears that the some parameter or setting is hampering creativity, perhaps that is intentional for the demo?
Thanks in advance!

Questions on fine-tuning process

I have three questions regarding the fine-tuning process.

  1. How does max length (hyperparameter) work? Does each training sample concatanate multiple examples until it reaches the max length, or each training sample only includes a single example that is padded to the max length?
  2. Is cross entropy loss is applied to all tokens including the input tokens (instruction + input), or just output tokens (response), or the weighted sum?
  3. How is an user prompt processed at test time? Is it considered as an example with an empty input field?

Thank you in advance.

OOM issue

Can this finetuning script fit into A10, which only has 24GB GPU memory? I am trying to fine-tune the model on 4 A10 GPUs using a batch size of 1, but I still get an OOM error.

Training code detail.

Thanks for sharing this project. I have been trying to train the larger model for an offline first free education assistant for poor students preparing for competitive exams . Sharing training code, even if in an pr can really helpful for me fine-tune an education assistant.

Resuming from checkpoint

My first run of the trainer could not save the model because the evaluate() call fails. I have removed that method call and now would like to resume from the last checkpoint. However, I cannot seem to get that working. Is there some disparity between model architecture and checkpoint architecture? The change I made to accommodate checkpoint resumption and the error I get are shown below show below.

Change for checkpoint resumption

data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args) trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) transformers.logging.set_verbosity_info() trainer.train() #trainer.train("output/checkpoint-18000") #trainer.evaluate() trainer.save_state() safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)

Error stacktrace

`Loading model from output/checkpoint-18000/.
Traceback (most recent call last):
File "/home/ubuntu/alpaca/stanford_alpaca/train.py", line 246, in
train()
File "/home/ubuntu/alpaca/stanford_alpaca/train.py", line 239, in train
trainer.train("output/checkpoint-18000/")
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1617, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2120, in _load_from_checkpoint
load_result = load_sharded_checkpoint(model, resume_from_checkpoint, strict=is_sagemaker_mp_enabled())
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 385, in load_sharded_checkpoint
state_dict = torch.load(os.path.join(folder, shard_file), map_location="cpu")
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
result = unpickler.load()
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 169, in _rebuild_tensor_v2
tensor = _rebuild_tensor(storage, storage_offset, size, stride)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 148, in rebuild_tensor
return t.set
(storage._untyped_storage, storage_offset, size, stride)
RuntimeError: Trying to resize storage that is not resizable
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122406 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122407 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122409 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 122408) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED
`

CUDA out of memory

Hi

Great work! In READM, you guys mention that 4 A100 80G can train this model, but when I try 8 40G A100, it meets cuda oom error.

Finetuning using standard hugging face training code

In ReadMe.md I saw that the model is finetuned using Stanford hugging face setup. I tried it but getting this error. Could someone help in calling Llama weights using hf

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Bitsy/llama-7b-hfcompatible-clean")

Error :

KeyError Traceback (most recent call last)
in
1 from transformers import AutoModelForCausalLM
2
----> 3 model = AutoModelForCausalLM.from_pretrained("Bitsy/llama-7b-hfcompatible-clean")

2 frames
/usr/local/lib/python3.9/dist-packages/transformers/models/auto/configuration_auto.py in getitem(self, key)
577 return self._extra_content[key]
578 if key not in self._mapping:
--> 579 raise KeyError(key)
580 value = self._mapping[key]
581 module_name = model_type_to_module_name(key)

KeyError: 'llama'

inference kwargs

Thanks for the great work, I reproduced the training, but at inference time tends to generate shorter text. I am using:

generated = model.generate(batch["input_ids"], max_length=512)

Does the interface on the demo web page adjust other kwargs?
Thanks

Confusion about input ids

Hi, thanks for sharing such a great job.
I've read your fine-tuning code and I'm a little confused about the inputs of the model.
From the code, the Input of model should be, here's an, example: ### # Instruction: {instruction}### Input{input}### Response:{response}. so the input_ids: tokenizer(example), label_ids:tokenizer(example), and label_ids[:len(source_len)]=IGNORE_INDEX.
I would like to ask, why do input ids contain response token ids? So the data target won't leak?

I am looking forward to your reply. Thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.