Comments (4)
Currently this repo is a bit of a mess and finetuning is not as userfriendly as I'd like it to be. I hope to improve things at some point but currently things are messy and you'll have to tweak a lot of code by hand.
The bug you posted is strange, I've never seen it before. Given what I can read from it, I'd guess it means either the CPU is too slow or you run out of RAM.
from gpt2.
And i'm tring to train a model follow your guides, I created a json file like this:
And I create a new floder to save my model checkpoint. And i've transfer my dataset to tfrecords file.
{ "n_head": 16, "encoder_path": "encoder", "n_vocab": 50257, "embed_dropout": 0.1, "lr": 0.00025, "warmup_steps": 2000, "weight_decay": 0.01, "beta1": 0.9, "beta2": 0.98, "epsilon": 1e-9, "opt_name": "adam", "train_batch_size": 256, "attn_dropout": 0.1, "train_steps": 100, "eval_steps": 10, "max_steps": 500000, "data_path": "datasets/openwebtext/", "scale": 0.20412414523193154, "res_dropout": 0.1, "predict_batch_size": 8, "eval_batch_size": 8, "iterations": 500, "n_embd": 1024, "input": "openwebtext", "model": "GPT2", "model_path": "mymodel", "n_ctx": 1024, "predict_path": "mymodelprediction.txt", "n_layer": 24 }
And i put my tfrecord data in datasets/openwebtext/,change the dir in inputs.py
files = [os.path.join(params["data_path"], "/movie.tfrecords")]
When I started to train the model,
But i get the error, the optimizer failed failed.
Done calling model_fn. I0614 15:52:23.765870 140478902515584 estimator.py:1147] Done calling model_fn. Create CheckpointSaverHook. I0614 15:52:23.768234 140478902515584 basic_session_run_hooks.py:541] Create CheckpointSaverHook. Graph was finalized. I0614 15:52:29.201557 140478902515584 monitored_session.py:240] Graph was finalized. 2019-06-14 15:52:29.207255: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz 2019-06-14 15:52:29.207564: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2ecd6c0 executing computations on platform Host. Devices: 2019-06-14 15:52:29.207636: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined> 2019-06-14 15:52:43.704444: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. Running local_init_op. I0614 15:52:57.521352 140478902515584 session_manager.py:500] Running local_init_op. Done running local_init_op. I0614 15:52:57.879982 140478902515584 session_manager.py:502] Done running local_init_op. Saving checkpoints for 0 into mymodel/model.ckpt. I0614 15:53:11.620566 140478902515584 basic_session_run_hooks.py:606] Saving checkpoints for 0 into mymodel/model.ckpt. 2019-06-14 15:55:08.456842: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. 2019-06-14 15:55:12.012810: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:13.181140: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:14.330157: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:15.507901: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. .... 2019-06-14 16:00:04.472067: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:499] arithmetic_optimizer failed: Deadline exceeded: arithmetic_optimizer exceeded deadline., time = 368.891ms. ^C
If anyone got the idea, please help me.
from gpt2.
Running local_init_op. I0615 06:44:22.530402 140075623929728 session_manager.py:500] Running local_init_op. Done running local_init_op. I0615 06:44:22.885511 140075623929728 session_manager.py:502] Done running local_init_op. Saving checkpoints for 0 into lol/model.ckpt. I0615 06:44:36.133187 140075623929728 basic_session_run_hooks.py:606] Saving checkpoints for 0 into lol/model.ckpt. 2019-06-15 06:46:31.778847: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:35.063397: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:36.083030: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:37.083019: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:38.082609: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 17179869184 bytes == 0x394786000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d0588b2b 0x7f65d0556736 0x7f65d0556c27 0x7f65d0556cf8 0x7f65d7251cab 0x7f65d7255db7 0x7f65d07f54bb 0x7f65d07e7995 0x7f65d089ee99 0x7f65d089bd78 0x7f65e419266f 0x7f65e52746db 0x7f65e55ad88f ^C
Looks like same today, i'm using google colaboratory.
from gpt2.
Currently this repo is a bit of a mess and finetuning is not as userfriendly as I'd like it to be. I hope to improve things at some point but currently things are messy and you'll have to tweak a lot of code by hand.
The bug you posted is strange, I've never seen it before. Given what I can read from it, I'd guess it means either the CPU is too slow or you run out of RAM.
Thank you very much!
from gpt2.
Related Issues (20)
- when reading metadata of gs://openwebtext/stuff/encoder/encoder.json HOT 1
- Your 1.5B model HOT 2
- error when using create_tfrecords.py HOT 3
- Are there some research papers about text-to-set generation? HOT 1
- How can i create smaller sized file for inference of 1.5B model HOT 1
- I figured out how to cram GPT-2 1.5B onto a single TPU core with Adam optimizer HOT 3
- Training on artificial language data (server logs, medical records, etc.) HOT 1
- Docker documentation for CUDA
- DOCKER: Web interface doesn't work
- about encoder.json HOT 4
- character-level HOT 1
- 117M/model.ckpt.index is corrupted?
- GPT vs BERT, under same computation and data resource, which one is better for downstream tasks like GLUE?
- Error on output HOT 1
- Retraining a new model, only gpu 0 can be used HOT 1
- Training 1.5B?
- Samples?
- where is the length of the forecast article set? Thank you!
- create_tfrecords.py。Dealing with problems with your own data set
- Question about the metric reported in the paper?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt2.