Code Monkey home page Code Monkey logo

Comments (23)

geohot avatar geohot commented on September 28, 2024 1

Oh, it's the res connections (add) launching two backwards. Need to toposort first, my backwards algo is too simple.

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024 1

FYI, I am having it kind of running on kaggle (pyopencl.Device 'Tesla P100-PCIE-16GB). I am ignoring that batchnorms are not correctly implemented (best is probably to freeze them). It needs a magnitude more memory and is two magnitudes slower than pytorch. Tinygrad maxes out at BS=16 (why?) and takes like 7h for one epoch (probably the backwardpass of conv2d). Similar code in pytorch does BS=128 and 1 epochs in like 4mins.

from tinygrad.

adamritter avatar adamritter commented on September 28, 2024

The exact case that needed to be supported is: grad shape must match tensor shape in <tinygrad.ops.Sub object at 0x000002265A26E970>, (4, 32, 112, 112) != (1, 32, 1, 1)

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

#118 might fix this... I can't tell yet because it is unbelievable slow!

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

I think it gets stuck in an infinite loop!

from tinygrad.

geohot avatar geohot commented on September 28, 2024

Weird, try with DEBUG=1

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

That's what I did. Seems to compute similar backward passes for hours. Killed it after several hours.

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

Now it seems to be stuck here

          LogSoftmax :    0.06 ms  [(1, 1000)]
                 Mul :    0.01 ms  [(1, 1000), (1, 1000)]
                 Sum :    0.02 ms  [(1, 1000)]
                 Mul :    0.01 ms  [(1,), (1,)]

from tinygrad.

adamritter avatar adamritter commented on September 28, 2024

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

Thx, now it works!

from tinygrad.

geohot avatar geohot commented on September 28, 2024

Now to fix it for the GPU so we can use it!

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

I am trying to train efficientnet (preloaded weights without last layer) on a small subset of the Kaggle dogs and cats data set. 1 epoch would be like 1-2h but it allocated like 24GB of memory after 2-3 batches and swapping kills the “performance” completely. Not sure yet if it is tinygrad or my bad code!

I am training all layers. Only training the last layer should be no problem and that one should already work on the GPU.

from tinygrad.

adamritter avatar adamritter commented on September 28, 2024

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

Do you zero the gradients? it's important to not have any references left inside the memory, or else the garbage collector won't work.

On Mon, Nov 16, 2020 at 5:57 PM Marcel Bischoff @.***> wrote: I am trying to train efficientnet (preloaded weights without last layer) on a small subset of the Kaggle dogs and cats data set. 1 epoch would be like 1-2h but it allocated like 24GB of memory after 2-3 batches and swapping kills the “performance” completely. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#104 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN5SWAE3GLNHAUQAHTTJPWTSQFRZFANCNFSM4TSFLSPQ .

Where would I do this? The optim.zero_grad() doesn't do the job... if you want you can take a look here: https://github.com/marcelbischoff/tinygrad/blob/tsteffnet/examples/train_efficientnet_catsvsdogs.py

BTW, GPU=1 examples/train_efficientnet.py now works!?!

from tinygrad.

geohot avatar geohot commented on September 28, 2024

GPU=1 training should work.

Hmm, so who knows what tinygrad actually frees? There's been no thought of lifecycle management.

Also, BatchNorm2D is not correct for training.

(i don't think zero_grad should change anything with the GC)

from tinygrad.

marcelbischoff avatar marcelbischoff commented on September 28, 2024

I don't but it doesn't seem to free anything.

from tinygrad.

geohot avatar geohot commented on September 28, 2024

lol, that's probably right

from tinygrad.

adamritter avatar adamritter commented on September 28, 2024
losses.append(loss)

You don't detach the results, so all the ancestors (parents...) of the loss will be appended and can't be garbage collected. Detach should probably just make a copy of the tensor without the parents.

from tinygrad.

adamritter avatar adamritter commented on September 28, 2024

also I think you could put the code into the repo so that other people can help to clean it up

from tinygrad.

adamritter avatar adamritter commented on September 28, 2024

See #123 for detach implementation

from tinygrad.

geohot avatar geohot commented on September 28, 2024

See #140 and #141

from tinygrad.

hemangjoshi37a avatar hemangjoshi37a commented on September 28, 2024

A key challenge here seems to be with memory management, particularly with regard to gradient accumulation and tensor lifecycle management. While Python's garbage collector should ideally handle unused or dereferenced objects, it can become problematic if references are held in unexpected places - for instance, the tensor gradient computation graph. If tensors are not properly detached, the entire computation history might persist, causing memory usage to pile up.

Here's a piece of advice to prevent this. We could explicitly remove references to tensors after they're no longer needed by utilizing Python's del statement, thereby freeing up memory space. Consider inserting these statements at the end of each training iteration:

del loss
del inputs
del labels
torch.cuda.empty_cache()  # if you're using PyTorch

It's worth noting that torch.cuda.empty_cache() is specific to PyTorch and is used to clear the GPU cache, which might hold onto memory even after tensors are deleted. For a tinygrad equivalent, you'd need to design a similar operation that clears any caching mechanism present.

Another potential improvement could be in the backward operation. A possible bottleneck could be launching two backward computations for residual connections. As @geohot mentioned, introducing a topological sort could optimize this process by computing backward passes in an orderly manner, reducing redundancy and potential memory usage.

Also, it's noteworthy to be mindful of @adamritter's suggestion on implementing a detach function. Detaching results could help ensure that unnecessary computation history is not carried forward, again helping to mitigate memory issues.

from tinygrad.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.