Comments (12)
what do you mean? you don't have to put a system prompt in
from llm-foundry.
The idea is to mix instruct data with pre-training datamix in the later stages of pre-training
To construct such mix ideally during training when using example from instruct data to compute loss only on completions
from llm-foundry.
Yes, you must use streaming to do streams.
from llm-foundry.
yup, https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/gpt2-arc-easy-cpu-streaming-dataset.yaml and https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py
from llm-foundry.
@dakinggg thank you for answer! But I have 1 more question:
Can I use both pretrain (from convert_hf_dataset.py
) and finetuning (from convert_finetuning_dataset.py
datasets in one config?
like
train_loader:
dataset:
decoder_only_format: true
max_seq_len: ${max_seq_len}
shuffle: true
streams:
PRETRAIN_DATASET: # from https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_dataset_hf.py
local: /data/PRETRAIN_PATH
remote: s3://PRETRAIN_PATH
repeat: 1.0
split: train
FT_DATASET: # from https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py
local: /data/FT_PATH
remote: s3://FT_PATH
repeat: 1.0
split: train
drop_last: true
name: finetuning
num_workers: 8
Now I'm getting
ValueError: In the dataset config, you must set either `hf_name` to use a
HuggingFace dataset or set `remote` to use a streaming dataset, but both were
None.
from llm-foundry.
are you using the latest version of foundry? and no, you can't mix pretrainining and finetuning data, although you can construct finetuning data that looks like pretraining data if you want.
from llm-foundry.
yes, I using last version, I not tried yet multiple streams of finetuning set, this error getting when i'm mix pretraining and ft data
thank you for answers!
from llm-foundry.
are you using the latest version of foundry? and no, you can't mix pretrainining and finetuning data, although you can construct finetuning data that looks like pretraining data if you want.
This will work but with possible unwanted memorisation of system prompts with low variability from instruct data
from llm-foundry.
Hm, maybe you're right, we'll try like that
from llm-foundry.
yup, https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/gpt2-arc-easy-cpu-streaming-dataset.yaml and https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py
yes, you give me example when I need to define local/remote of mds tokenized data which I get from tokenize script
but can I use fly tokenization like this?
train_loader:
name: finetuning
dataset:
hf_name: parquet
streams:
FT_SET_1:
hf_name: FT_SET_1 # assuming data files are json formatted
hf_kwargs:
data_dir: /data/FT_SET_1/data/
split: train
preprocessing_fn: preprocessing:preprocessing_ruorca
FT_SET_2:
hf_name: FT_SET_2 # assuming data files are json formatted
hf_kwargs:
data_dir: /data/FT_SET_2/data/
split: train
preprocessing_fn: preprocessing:preprocessing_ruorca
max_seq_len: ${max_seq_len}
allow_pad_trimming: false
decoder_only_format: true
packing_ratio: 9
shuffle: true
drop_last: true
num_workers: 8
pin_memory: false
prefetch_factor: 2
persistent_workers: true
timeout: 10
or HF sets with fly tokenization doesn't works with streams?
from llm-foundry.
@dakinggg I guess I can't run it like this, I can only run it with already tokenized data with convert_finetuning_dataset.py
?
from llm-foundry.
Seems like this is resolved. Feel free to open a new issue if more questions.
from llm-foundry.
Related Issues (20)
- Setting Dropout in MPT Prefix-LM after Exporting to HuggingFace Crashes during Fine-tuning HOT 2
- Evaluation for long_context_tasks failed with a KeyError: 'continuation_indices' HOT 3
- Can you add the pre-training of dbrx? HOT 3
- Installation issue from habana_alpha branch HOT 2
- Fine-tune dbrx-instruct on a single VM with 8 H100s HOT 1
- Is there a way to figure out what dependencies are installed in the docker image? HOT 1
- Opt-3b Pretrain YAML config failing with mosaicml/llm-foundry/2.2.1_cu121_flash2-4aef5de docker HOT 1
- Observing 1/2 the throughput on AMD MI250 HOT 4
- Add State Space Models / Mamba Layer Support
- Possibility of training with hostname instead IP HOT 1
- Train with attention mask HOT 1
- MoE with FSDP HOT 1
- Conversion Sharded -> Monolithic checkpoint HOT 1
- Finetuning does not work on nightly HOT 2
- How to run inference/convert_composer_to_hf.py with MPT-1B model on Habana Gaudi 2, file formats do not match HOT 4
- `ValueError` when following finetuning `mpt-7b-arc-easy--gpu.yaml` example with different default batch size HOT 2
- Issue when installing "pip install -e ".[gpu-flash2]"" HOT 3
- Wrong number of samples for C4? HOT 2
- Composer crashes when attempting to load sharded checkpoint HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llm-foundry.