Comments (6)
Hi there! There are scripts in the dataset directory that roughly show what needs to be done, but it's all super rough I'm sorry, I still haven't found the time to refactor everything. Basically you need to convert your text files into tfrecords using the create_tfrecords.py script (you'll need to modify it by hand to pick up your text files). Then you place those in a google storage bucket. Finally you need to modify the input function (or create a new one) in inputs.py. You can see how it works by looking at the openwebtext function. You just need to create a list of your train and eval file names and pass them as shown to the bpe_text function, which returns a TF dataset that your training can use.
Hope that helps.
from gpt2.
Yes I create some function using my dataset and it works thanks. I will try to pr to make your code more flexible. I have some questions about your code:
- What stitch mean on
inputs.py
? I try change stitch value from 42 into 2 solve my errorOutOfRange
, but error still appear when iterations reach 10000 - I split my 2GB dataset into 10MB txt file each file, and some file cannot convert to tfrecords data, I havent figure out why it happens, any rule to make tfrecords data?
Thank you :)
from gpt2.
Thanks, I'm hoping to find some time this week to polish some code and write a better tutorial, we'll see. Looking forward to your PR!
- To train the model, you need to feed it chunks of text that are n_ctx+1 long. Since most text won't be that long, I concatenate multiple texts with "<|endoftext|>" between to reach that length. Stitch determines how many such texts are loaded and concatenated, before slicing n_ctx+1 symbols out of it. That means stitch must be set so that: (minimum-length-of-you-texts * stitch) >= n_ctx+1 Setting that correctly should fix your OutOfRange error. Note that "length" is in BPE tokens, not Unicode symbols.
- By default, the script throws out files that are composed only of zeros or smaller than a certain length (25 BPE tokens by default iirc). It should also throw out things out that throw an error during reading or ftfy's fixing process. So I would expect the text files are either too small or contain some kind of totally corrupt unicode. You can see the encoding/writing process here.
Hope that helps!
from gpt2.
please make sample colab notebook on data preprocess i am also getting OutOfRange error . I tried to change dataset also and stitch values but didn't work for me.
from gpt2.
Hi @kbrajwani , I'm afraid I do not maintain this repo anymore. I would recommend using the Hugging Face transformers library instead. Good luck!
from gpt2.
hey @ConnorJL , No problem. I tried to do that see huggingface/transformers#6672. They have some issues with tpu.
from gpt2.
Related Issues (20)
- when reading metadata of gs://openwebtext/stuff/encoder/encoder.json HOT 1
- Your 1.5B model HOT 2
- error when using create_tfrecords.py HOT 3
- Are there some research papers about text-to-set generation? HOT 1
- How can i create smaller sized file for inference of 1.5B model HOT 1
- I figured out how to cram GPT-2 1.5B onto a single TPU core with Adam optimizer HOT 3
- Training on artificial language data (server logs, medical records, etc.) HOT 1
- Docker documentation for CUDA
- DOCKER: Web interface doesn't work
- about encoder.json HOT 4
- character-level HOT 1
- 117M/model.ckpt.index is corrupted?
- GPT vs BERT, under same computation and data resource, which one is better for downstream tasks like GLUE?
- Error on output HOT 1
- Retraining a new model, only gpu 0 can be used HOT 1
- Training 1.5B?
- Samples?
- where is the length of the forecast article set? Thank you!
- create_tfrecords.py。Dealing with problems with your own data set
- Question about the metric reported in the paper?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt2.