Wow nice repository, I also find GPT2 repo to train on TPU because I just got access g

Hi there! There are s in the dataset directory that roughly show what needs to b

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

format dataset about gpt2 HOT 6 OPEN

connorjl commented on July 20, 2024

format dataset

from gpt2.

Comments (6)

ConnorJL commented on July 20, 2024 1

Hi there! There are scripts in the dataset directory that roughly show what needs to be done, but it's all super rough I'm sorry, I still haven't found the time to refactor everything. Basically you need to convert your text files into tfrecords using the create_tfrecords.py script (you'll need to modify it by hand to pick up your text files). Then you place those in a google storage bucket. Finally you need to modify the input function (or create a new one) in inputs.py. You can see how it works by looking at the openwebtext function. You just need to create a list of your train and eval file names and pass them as shown to the bpe_text function, which returns a TF dataset that your training can use.

Hope that helps.

from gpt2.

khaerulumam42 commented on July 20, 2024

Yes I create some function using my dataset and it works thanks. I will try to pr to make your code more flexible. I have some questions about your code:

What stitch mean on inputs.py? I try change stitch value from 42 into 2 solve my error OutOfRange, but error still appear when iterations reach 10000
I split my 2GB dataset into 10MB txt file each file, and some file cannot convert to tfrecords data, I havent figure out why it happens, any rule to make tfrecords data?

Thank you :)

from gpt2.

ConnorJL commented on July 20, 2024

Thanks, I'm hoping to find some time this week to polish some code and write a better tutorial, we'll see. Looking forward to your PR!

To train the model, you need to feed it chunks of text that are n_ctx+1 long. Since most text won't be that long, I concatenate multiple texts with "<|endoftext|>" between to reach that length. Stitch determines how many such texts are loaded and concatenated, before slicing n_ctx+1 symbols out of it. That means stitch must be set so that: (minimum-length-of-you-texts * stitch) >= n_ctx+1 Setting that correctly should fix your OutOfRange error. Note that "length" is in BPE tokens, not Unicode symbols.
By default, the script throws out files that are composed only of zeros or smaller than a certain length (25 BPE tokens by default iirc). It should also throw out things out that throw an error during reading or ftfy's fixing process. So I would expect the text files are either too small or contain some kind of totally corrupt unicode. You can see the encoding/writing process here.

Hope that helps!

from gpt2.

kbrajwani commented on July 20, 2024

please make sample colab notebook on data preprocess i am also getting OutOfRange error . I tried to change dataset also and stitch values but didn't work for me.

from gpt2.

ConnorJL commented on July 20, 2024

Hi @kbrajwani , I'm afraid I do not maintain this repo anymore. I would recommend using the Hugging Face transformers library instead. Good luck!

from gpt2.

kbrajwani commented on July 20, 2024

hey @ConnorJL , No problem. I tried to do that see huggingface/transformers#6672. They have some issues with tpu.

from gpt2.

format dataset about gpt2 HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent