Code Monkey home page Code Monkey logo

Comments (12)

dakinggg avatar dakinggg commented on June 10, 2024 2

what do you mean? you don't have to put a system prompt in

from llm-foundry.

AnatoliiPotapov avatar AnatoliiPotapov commented on June 10, 2024 1

The idea is to mix instruct data with pre-training datamix in the later stages of pre-training
To construct such mix ideally during training when using example from instruct data to compute loss only on completions

from llm-foundry.

dakinggg avatar dakinggg commented on June 10, 2024 1

Yes, you must use streaming to do streams.

from llm-foundry.

dakinggg avatar dakinggg commented on June 10, 2024

yup, https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/gpt2-arc-easy-cpu-streaming-dataset.yaml and https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py

from llm-foundry.

germanjke avatar germanjke commented on June 10, 2024

@dakinggg thank you for answer! But I have 1 more question:

Can I use both pretrain (from convert_hf_dataset.py) and finetuning (from convert_finetuning_dataset.py datasets in one config?

like

train_loader:
  dataset:
    decoder_only_format: true
    max_seq_len: ${max_seq_len}
    shuffle: true
    streams:
      PRETRAIN_DATASET: # from https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_dataset_hf.py
        local: /data/PRETRAIN_PATH
        remote: s3://PRETRAIN_PATH
        repeat: 1.0
        split: train
      FT_DATASET: # from https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py
        local: /data/FT_PATH
        remote: s3://FT_PATH
        repeat: 1.0
        split: train
  drop_last: true
  name: finetuning
  num_workers: 8

Now I'm getting

ValueError: In the dataset config, you must set either `hf_name` to use a
HuggingFace dataset or set `remote` to use a streaming dataset, but both were
None.

from llm-foundry.

dakinggg avatar dakinggg commented on June 10, 2024

are you using the latest version of foundry? and no, you can't mix pretrainining and finetuning data, although you can construct finetuning data that looks like pretraining data if you want.

from llm-foundry.

germanjke avatar germanjke commented on June 10, 2024

yes, I using last version, I not tried yet multiple streams of finetuning set, this error getting when i'm mix pretraining and ft data

thank you for answers!

from llm-foundry.

AnatoliiPotapov avatar AnatoliiPotapov commented on June 10, 2024

are you using the latest version of foundry? and no, you can't mix pretrainining and finetuning data, although you can construct finetuning data that looks like pretraining data if you want.

This will work but with possible unwanted memorisation of system prompts with low variability from instruct data

from llm-foundry.

AnatoliiPotapov avatar AnatoliiPotapov commented on June 10, 2024

Hm, maybe you're right, we'll try like that

from llm-foundry.

germanjke avatar germanjke commented on June 10, 2024

yup, https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/gpt2-arc-easy-cpu-streaming-dataset.yaml and https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_finetuning_dataset.py

yes, you give me example when I need to define local/remote of mds tokenized data which I get from tokenize script

but can I use fly tokenization like this?

train_loader:
  name: finetuning
  dataset:
    hf_name: parquet
    streams:
        FT_SET_1:
            hf_name: FT_SET_1 # assuming data files are json formatted
            hf_kwargs:
                data_dir: /data/FT_SET_1/data/
            split: train
            preprocessing_fn: preprocessing:preprocessing_ruorca
        FT_SET_2:
            hf_name: FT_SET_2 # assuming data files are json formatted
            hf_kwargs:
                data_dir: /data/FT_SET_2/data/
            split: train
            preprocessing_fn: preprocessing:preprocessing_ruorca
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    packing_ratio: 9
    shuffle: true
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 10

or HF sets with fly tokenization doesn't works with streams?

from llm-foundry.

germanjke avatar germanjke commented on June 10, 2024

@dakinggg I guess I can't run it like this, I can only run it with already tokenized data with convert_finetuning_dataset.py?

from llm-foundry.

dakinggg avatar dakinggg commented on June 10, 2024

Seems like this is resolved. Feel free to open a new issue if more questions.

from llm-foundry.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.