Comments (4)
I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=TrueAlthought the Gpu memory use less, but when runing eval, it still crash , shows omm.
Thu Jul 18 10:41:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 56C P2 64W / 275W | 8936MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 49C P8 17W / 275W | 602MiB / 11264MiB | 4% Default |
+-------------------------------+----------------------+----------------------+So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:
with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())for step in range(args.n_epochs): # training start_list = list(range(0, train_data.size, args.batch_size)) np.random.shuffle(start_list) for start in start_list: end = start + args.batch_size model.train(sess, get_feed_dict(model, train_data, start, end)) config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True) config2.gpu_options.allow_growth=True with tf.Session(config=config2) as sess2: sess2.run(tf.global_variables_initializer()) sess2.run(tf.local_variables_initializer()) # evaluation train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size))) test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size)) print('epoch %d train_auc: %.4f test_auc: %.4f' % (step, train_auc, test_auc))
But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"
how do you solve this problem finally? thanks
from dkn.
Hi! This is kind of weird because the default batch size is not that large. Reducing the batch size might help.
from dkn.
Thank you for your reply.
I had tried to set batch_size to 64 and even 32, but it still get error.
I found than the problem appear in the code in train.py of function train():
# evaluation
**train_auc = model.eval(sess, get_feed_dict(model, train_data, 0, train_data.size))**
It loads all the train_data into the feed_dict.
In addition when I use nvidia-smi to find out how gpu exhausted, when running the codes
def train(args, train_data, test_data):
model = DKN(args)
with tf.Session() as sess:
...
My gpus almost ues all the memory as show behide:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 51C P8 16W / 275W | 9863MiB / 11264MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 51C P2 64W / 275W | 9429MiB / 11264MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
How can I solve the problem,please?
train_data size:14747
test_data size:408
word_embs size :401650
entity_embs size:91000
from dkn.
I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True
Althought the Gpu memory use less, but when runing eval, it still crash , shows omm.
Thu Jul 18 10:41:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 56C P2 64W / 275W | 8936MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 49C P8 17W / 275W | 602MiB / 11264MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:
with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
for step in range(args.n_epochs):
# training
start_list = list(range(0, train_data.size, args.batch_size))
np.random.shuffle(start_list)
for start in start_list:
end = start + args.batch_size
model.train(sess, get_feed_dict(model, train_data, start, end))
config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
config2.gpu_options.allow_growth=True
with tf.Session(config=config2) as sess2:
sess2.run(tf.global_variables_initializer())
sess2.run(tf.local_variables_initializer())
# evaluation
train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
print('epoch %d train_auc: %.4f test_auc: %.4f' % (step, train_auc, test_auc))
But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"
from dkn.
Related Issues (20)
- .vec not found HOT 10
- 两个疑惑之处 HOT 2
- 关于TransE代码的一些疑问 HOT 1
- About the kg.txt HOT 1
- TransE HOT 4
- Question about DKN/data/kg/kg_preprocess.py / HOT 1
- Question about the experimental results HOT 4
- 请问怎么对推荐结果进行验证
- 请问在训练的时候怎么样在训练集里划分出用户的history clicked news HOT 1
- 请求帮助help
- convert to tf 2.0 code HOT 1
- How can we get the complete dataset?
- All the scores are greater than 0.5 HOT 2
- 关于使用的知识图谱 HOT 2
- c++程序无法执行 HOT 1
- 代码有些问题 HOT 6
- f1值 HOT 1
- 1
- how do you implement the procedure of entity linking HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dkn.