Code Monkey home page Code Monkey logo

Comments (6)

ConnorJL avatar ConnorJL commented on July 20, 2024 1

Hi there! There are scripts in the dataset directory that roughly show what needs to be done, but it's all super rough I'm sorry, I still haven't found the time to refactor everything. Basically you need to convert your text files into tfrecords using the create_tfrecords.py script (you'll need to modify it by hand to pick up your text files). Then you place those in a google storage bucket. Finally you need to modify the input function (or create a new one) in inputs.py. You can see how it works by looking at the openwebtext function. You just need to create a list of your train and eval file names and pass them as shown to the bpe_text function, which returns a TF dataset that your training can use.

Hope that helps.

from gpt2.

khaerulumam42 avatar khaerulumam42 commented on July 20, 2024

Yes I create some function using my dataset and it works thanks. I will try to pr to make your code more flexible. I have some questions about your code:

  1. What stitch mean on inputs.py? I try change stitch value from 42 into 2 solve my error OutOfRange, but error still appear when iterations reach 10000
  2. I split my 2GB dataset into 10MB txt file each file, and some file cannot convert to tfrecords data, I havent figure out why it happens, any rule to make tfrecords data?

Thank you :)

from gpt2.

ConnorJL avatar ConnorJL commented on July 20, 2024

Thanks, I'm hoping to find some time this week to polish some code and write a better tutorial, we'll see. Looking forward to your PR!

  1. To train the model, you need to feed it chunks of text that are n_ctx+1 long. Since most text won't be that long, I concatenate multiple texts with "<|endoftext|>" between to reach that length. Stitch determines how many such texts are loaded and concatenated, before slicing n_ctx+1 symbols out of it. That means stitch must be set so that: (minimum-length-of-you-texts * stitch) >= n_ctx+1 Setting that correctly should fix your OutOfRange error. Note that "length" is in BPE tokens, not Unicode symbols.
  2. By default, the script throws out files that are composed only of zeros or smaller than a certain length (25 BPE tokens by default iirc). It should also throw out things out that throw an error during reading or ftfy's fixing process. So I would expect the text files are either too small or contain some kind of totally corrupt unicode. You can see the encoding/writing process here.

Hope that helps!

from gpt2.

kbrajwani avatar kbrajwani commented on July 20, 2024

please make sample colab notebook on data preprocess i am also getting OutOfRange error . I tried to change dataset also and stitch values but didn't work for me.

from gpt2.

ConnorJL avatar ConnorJL commented on July 20, 2024

Hi @kbrajwani , I'm afraid I do not maintain this repo anymore. I would recommend using the Hugging Face transformers library instead. Good luck!

from gpt2.

kbrajwani avatar kbrajwani commented on July 20, 2024

hey @ConnorJL , No problem. I tried to do that see huggingface/transformers#6672. They have some issues with tpu.

from gpt2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.