adatao / tensorspark Goto Github PK

View Code? Open in Web Editor NEW

297.0 297.0 101.0 9.07 MB

TensorFlow on Spark

Python 99.09% Shell 0.91%

tensorspark's People

Contributors

Stargazers

Watchers

Forkers

ldolberg codeaudit schevalier tanza9 xunavy miguelperalvo brettbevers gdtm86 nhuantdbk fankurt jrabary tnn1t1s ml-lab jackieqizhu tristaneljed yangjunpro dangchienhsgs kentchun33333 jeffzhengye chanpaul wanjinchang jmwdpk halofanx reinmj ahnqirage ericdoug zmoon111 lambdaji zdx luolaihu sungsoo mhnatiuk jinsongbo intellifora fashtimedotcom rpatil524 desperado1992 campuslifeceo yaowenwu sshilpika hungtd9 sujinzhao veterun cequencer rohan0401 jonathanwoodard jackiechen0708 nageshsk intellidiscovery vulgatecnn rushidesai1 pingyu0720 aimwts davbzh masoodk chenjun0210 syngain geetanjli015 charlesa101 adley-yang reshphil allensmile tengke-xiong m-tian younfor yyuzhong jz3707 cloudstdio happycodeday shubaozhang yuhuali1989 zhengxle northernknight kikou2016 cry2133 connectthefuture robinsingh1 sunminghong gcwk wjapollo zetaops zofuthan byronrwth srviest kioco raymarion patrickgsheng zhaoxx063 xzhxq koreain grigorii-p srinivasannagarajarao kevintrannz wenke1020 stjordanis chaoshunh iamtracyfu iq-scm

tensorspark's Issues

TensorSpark productionalized in yarn-cluster mode

Hey Arimo contributors,
Thanks for open-sourcing your TensorSpark!
I have done few modifications to a fork of your repo that are mostly relevant to someone interested in taking TensorSpark to production in yarn-cluster mode with CPU machines. You can go through my commit comments to see if there is anything you’d like to bring to your repo; feel free to let me know and I’ll send a PR for the branch and you can then “git cherry-pick” the commits you’re interested in.
I’ve been working with the MNIST dataset; if I find the time, I’ll try to create the ImageNet/AlexNet scenario which I see to be on your roadmap as well.
Thanks.

some modify to accelerate the train function

in file "paramservermodel.py"
int function def train(self, labels, features):

for i in range(len(self.compute_gradients)):

    #     self.gradients[
    #         i] += self.compute_gradients[i][0].eval(feed_dict=feed)

because its not necessary to use "for" in this code, replaced by:

grads, test_error_rate = self.session.run([self.compute_gradients,self.error_rate],feed_dict=feed)
self.gradients[:] = [g[0] for g in grads]

this will save several times of gpu train, especial in mnist one iter time from 7ms to 1ms in my computer

how can I run it with image files in folders ? such as imagenet

TensorSpark on HPC cluster testing

Hi,

I have been able to run this code on an HPC cluster. Currently, I'm trying to figure out how to test with this set up.

Any suggestion would be useful,
Thank you!

How to save the trained model to HDFS?

After I have trained the model, I can not find an interface to save the trained model to HDFS. Is there any way that could solve this problem? much thanks.

a bug of "OOM, about gpu" when I run it on more than one spark worker node

for example : I had change the files to load my own pic like [None,32,32,3] . Everything is OK, but when I set the partition=2 or 4 , 8 ... and my computer information is gtx1070, ubuntu14.04, 8G. I also change the model init code with:
config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.allocator_type = 'BFC' #config.gpu_options.per_process_gpu_memory_fraction = 0.2 session = tf.Session(config=config)
upon will enable several process in one gpu.
the bug is when the programer run some epoches , I find "nvidia-smi" 's gpu memory grows without stop.
from 800MB to 2G , 4G, 8G... finally show some errors like cuda OOM.
my way to solve it:
after my check and try to fix it, I find a function leads to the GPU Memory Leak
def reset_gradients(self): #with self.session.as_default(): #self.gradients = [tf.zeros(g[1].get_shape()).eval() for g in self.compute_gradients] self.gradients = [0.0]*len(self.compute_gradients) # my modify self.num_gradients = 0
though I don't the details why this change can works ,but it did.
email :[email protected]

adatao / tensorspark Goto Github PK

tensorspark's People

Contributors

Stargazers

Watchers

Forkers

tensorspark's Issues

TensorSpark productionalized in yarn-cluster mode

some modify to accelerate the train function

for i in range(len(self.compute_gradients)):

how can I run it with image files in folders ? such as imagenet

TensorSpark on HPC cluster testing

How to save the trained model to HDFS?

a bug of "OOM, about gpu" when I run it on more than one spark worker node

iteration and time

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent