Code Monkey home page Code Monkey logo

Comments (4)

ConnorJL avatar ConnorJL commented on July 20, 2024 3

Currently this repo is a bit of a mess and finetuning is not as userfriendly as I'd like it to be. I hope to improve things at some point but currently things are messy and you'll have to tweak a lot of code by hand.

The bug you posted is strange, I've never seen it before. Given what I can read from it, I'd guess it means either the CPU is too slow or you run out of RAM.

from gpt2.

wjy979769265 avatar wjy979769265 commented on July 20, 2024 1

And i'm tring to train a model follow your guides, I created a json file like this:
And I create a new floder to save my model checkpoint. And i've transfer my dataset to tfrecords file.

{ "n_head": 16, "encoder_path": "encoder", "n_vocab": 50257, "embed_dropout": 0.1, "lr": 0.00025, "warmup_steps": 2000, "weight_decay": 0.01, "beta1": 0.9, "beta2": 0.98, "epsilon": 1e-9, "opt_name": "adam", "train_batch_size": 256, "attn_dropout": 0.1, "train_steps": 100, "eval_steps": 10, "max_steps": 500000, "data_path": "datasets/openwebtext/", "scale": 0.20412414523193154, "res_dropout": 0.1, "predict_batch_size": 8, "eval_batch_size": 8, "iterations": 500, "n_embd": 1024, "input": "openwebtext", "model": "GPT2", "model_path": "mymodel", "n_ctx": 1024, "predict_path": "mymodelprediction.txt", "n_layer": 24 }

And i put my tfrecord data in datasets/openwebtext/,change the dir in inputs.py
files = [os.path.join(params["data_path"], "/movie.tfrecords")]

When I started to train the model,
But i get the error, the optimizer failed failed.
Done calling model_fn. I0614 15:52:23.765870 140478902515584 estimator.py:1147] Done calling model_fn. Create CheckpointSaverHook. I0614 15:52:23.768234 140478902515584 basic_session_run_hooks.py:541] Create CheckpointSaverHook. Graph was finalized. I0614 15:52:29.201557 140478902515584 monitored_session.py:240] Graph was finalized. 2019-06-14 15:52:29.207255: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz 2019-06-14 15:52:29.207564: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2ecd6c0 executing computations on platform Host. Devices: 2019-06-14 15:52:29.207636: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined> 2019-06-14 15:52:43.704444: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. Running local_init_op. I0614 15:52:57.521352 140478902515584 session_manager.py:500] Running local_init_op. Done running local_init_op. I0614 15:52:57.879982 140478902515584 session_manager.py:502] Done running local_init_op. Saving checkpoints for 0 into mymodel/model.ckpt. I0614 15:53:11.620566 140478902515584 basic_session_run_hooks.py:606] Saving checkpoints for 0 into mymodel/model.ckpt. 2019-06-14 15:55:08.456842: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. 2019-06-14 15:55:12.012810: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:13.181140: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:14.330157: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:15.507901: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. .... 2019-06-14 16:00:04.472067: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:499] arithmetic_optimizer failed: Deadline exceeded: arithmetic_optimizer exceeded deadline., time = 368.891ms. ^C

If anyone got the idea, please help me.

from gpt2.

wjy979769265 avatar wjy979769265 commented on July 20, 2024

Running local_init_op. I0615 06:44:22.530402 140075623929728 session_manager.py:500] Running local_init_op. Done running local_init_op. I0615 06:44:22.885511 140075623929728 session_manager.py:502] Done running local_init_op. Saving checkpoints for 0 into lol/model.ckpt. I0615 06:44:36.133187 140075623929728 basic_session_run_hooks.py:606] Saving checkpoints for 0 into lol/model.ckpt. 2019-06-15 06:46:31.778847: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:35.063397: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:36.083030: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:37.083019: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:38.082609: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 17179869184 bytes == 0x394786000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d0588b2b 0x7f65d0556736 0x7f65d0556c27 0x7f65d0556cf8 0x7f65d7251cab 0x7f65d7255db7 0x7f65d07f54bb 0x7f65d07e7995 0x7f65d089ee99 0x7f65d089bd78 0x7f65e419266f 0x7f65e52746db 0x7f65e55ad88f ^C

Looks like same today, i'm using google colaboratory.

from gpt2.

wjy979769265 avatar wjy979769265 commented on July 20, 2024

Currently this repo is a bit of a mess and finetuning is not as userfriendly as I'd like it to be. I hope to improve things at some point but currently things are messy and you'll have to tweak a lot of code by hand.

The bug you posted is strange, I've never seen it before. Given what I can read from it, I'd guess it means either the CPU is too slow or you run out of RAM.

Thank you very much!

from gpt2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.