I can fine-tune with flying tokenization like this, <div class="snippet-clipboard-

yup, <a href="https://github.com/mosaicml/llm-foundry/blob/main/s/train/yamls/fi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

yup, <a href="https://github.com/mosaicml/llm-foundry/blob/main/s/t

Fly tokenization with multiple streams about llm-foundry HOT 12 CLOSED

germanjke commented on June 10, 2024

Fly tokenization with multiple streams

from llm-foundry.

Comments (12)

dakinggg commented on June 10, 2024 2

what do you mean? you don't have to put a system prompt in

from llm-foundry.

AnatoliiPotapov commented on June 10, 2024 1

The idea is to mix instruct data with pre-training datamix in the later stages of pre-training
To construct such mix ideally during training when using example from instruct data to compute loss only on completions

from llm-foundry.

dakinggg commented on June 10, 2024 1

Yes, you must use streaming to do streams.

from llm-foundry.

dakinggg commented on June 10, 2024

yup, https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/gpt2-arc-easy-cpu-streaming-dataset.yaml and https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py

from llm-foundry.

germanjke commented on June 10, 2024

@dakinggg thank you for answer! But I have 1 more question:

Can I use both pretrain (from convert_hf_dataset.py) and finetuning (from convert_finetuning_dataset.py datasets in one config?

train_loader:
  dataset:
    decoder_only_format: true
    max_seq_len: ${max_seq_len}
    shuffle: true
    streams:
      PRETRAIN_DATASET: # from https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_dataset_hf.py
        local: /data/PRETRAIN_PATH
        remote: s3://PRETRAIN_PATH
        repeat: 1.0
        split: train
      FT_DATASET: # from https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py
        local: /data/FT_PATH
        remote: s3://FT_PATH
        repeat: 1.0
        split: train
  drop_last: true
  name: finetuning
  num_workers: 8

Now I'm getting

ValueError: In the dataset config, you must set either `hf_name` to use a
HuggingFace dataset or set `remote` to use a streaming dataset, but both were
None.

from llm-foundry.

dakinggg commented on June 10, 2024

are you using the latest version of foundry? and no, you can't mix pretrainining and finetuning data, although you can construct finetuning data that looks like pretraining data if you want.

from llm-foundry.

germanjke commented on June 10, 2024

yes, I using last version, I not tried yet multiple streams of finetuning set, this error getting when i'm mix pretraining and ft data

thank you for answers!

from llm-foundry.

AnatoliiPotapov commented on June 10, 2024

are you using the latest version of foundry? and no, you can't mix pretrainining and finetuning data, although you can construct finetuning data that looks like pretraining data if you want.

This will work but with possible unwanted memorisation of system prompts with low variability from instruct data

from llm-foundry.

AnatoliiPotapov commented on June 10, 2024

Hm, maybe you're right, we'll try like that

from llm-foundry.

germanjke commented on June 10, 2024

yup, https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/gpt2-arc-easy-cpu-streaming-dataset.yaml and https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py

yes, you give me example when I need to define local/remote of mds tokenized data which I get from tokenize script

but can I use fly tokenization like this?

train_loader:
  name: finetuning
  dataset:
    hf_name: parquet
    streams:
        FT_SET_1:
            hf_name: FT_SET_1 # assuming data files are json formatted
            hf_kwargs:
                data_dir: /data/FT_SET_1/data/
            split: train
            preprocessing_fn: preprocessing:preprocessing_ruorca
        FT_SET_2:
            hf_name: FT_SET_2 # assuming data files are json formatted
            hf_kwargs:
                data_dir: /data/FT_SET_2/data/
            split: train
            preprocessing_fn: preprocessing:preprocessing_ruorca
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    packing_ratio: 9
    shuffle: true
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 10

or HF sets with fly tokenization doesn't works with streams?

from llm-foundry.

germanjke commented on June 10, 2024

@dakinggg I guess I can't run it like this, I can only run it with already tokenized data with convert_finetuning_dataset.py?

from llm-foundry.

dakinggg commented on June 10, 2024

Seems like this is resolved. Feel free to open a new issue if more questions.

from llm-foundry.

Fly tokenization with multiple streams about llm-foundry HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent