Fix broadcasting. Munch 1's from left for mod, munch 1's from right for div.

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="74

Now it seems to be stuck here <div class="snippet-clipboard-content notranslate po

GPU=1 examples/train_efficientnet.py doesn't work about tinygrad HOT 23 CLOSED

tinygrad commented on September 28, 2024

GPU=1 examples/train_efficientnet.py doesn't work

from tinygrad.

Comments (23)

geohot commented on September 28, 2024 1

Oh, it's the res connections (add) launching two backwards. Need to toposort first, my backwards algo is too simple.

from tinygrad.

marcelbischoff commented on September 28, 2024 1

FYI, I am having it kind of running on kaggle (pyopencl.Device 'Tesla P100-PCIE-16GB). I am ignoring that batchnorms are not correctly implemented (best is probably to freeze them). It needs a magnitude more memory and is two magnitudes slower than pytorch. Tinygrad maxes out at BS=16 (why?) and takes like 7h for one epoch (probably the backwardpass of conv2d). Similar code in pytorch does BS=128 and 1 epochs in like 4mins.

from tinygrad.

adamritter commented on September 28, 2024

The exact case that needed to be supported is: grad shape must match tensor shape in <tinygrad.ops.Sub object at 0x000002265A26E970>, (4, 32, 112, 112) != (1, 32, 1, 1)

from tinygrad.

marcelbischoff commented on September 28, 2024

#118 might fix this... I can't tell yet because it is unbelievable slow!

from tinygrad.

marcelbischoff commented on September 28, 2024

I think it gets stuck in an infinite loop!

from tinygrad.

geohot commented on September 28, 2024

Weird, try with DEBUG=1

from tinygrad.

marcelbischoff commented on September 28, 2024

That's what I did. Seems to compute similar backward passes for hours. Killed it after several hours.

from tinygrad.

marcelbischoff commented on September 28, 2024

Now it seems to be stuck here

          LogSoftmax :    0.06 ms  [(1, 1000)]
                 Mul :    0.01 ms  [(1, 1000), (1, 1000)]
                 Sum :    0.02 ms  [(1, 1000)]
                 Mul :    0.01 ms  [(1,), (1,)]

from tinygrad.

adamritter commented on September 28, 2024

Look at my fix

…

On Mon, Nov 16, 2020, 1:32 AM Marcel Bischoff ***@***.***> wrote: Now it seems stuck here LogSoftmax : 0.06 ms [(1, 1000)] Mul : 0.01 ms [(1, 1000), (1, 1000)] Sum : 0.02 ms [(1, 1000)] Mul : 0.01 ms [(1,), (1,)] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#104 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AN5SWACC25XEFQSFCSDBMY3SQCTPNANCNFSM4TSFLSPQ> .

from tinygrad.

marcelbischoff commented on September 28, 2024

Thx, now it works!

from tinygrad.

geohot commented on September 28, 2024

Now to fix it for the GPU so we can use it!

from tinygrad.

marcelbischoff commented on September 28, 2024

I am trying to train efficientnet (preloaded weights without last layer) on a small subset of the Kaggle dogs and cats data set. 1 epoch would be like 1-2h but it allocated like 24GB of memory after 2-3 batches and swapping kills the “performance” completely. Not sure yet if it is tinygrad or my bad code!

I am training all layers. Only training the last layer should be no problem and that one should already work on the GPU.

from tinygrad.

adamritter commented on September 28, 2024

Do you zero the gradients? it's important to not have any references left inside the memory, or else the garbage collector won't work.

…

On Mon, Nov 16, 2020 at 5:57 PM Marcel Bischoff ***@***.***> wrote: I am trying to train efficientnet (preloaded weights without last layer) on a small subset of the Kaggle dogs and cats data set. 1 epoch would be like 1-2h but it allocated like 24GB of memory after 2-3 batches and swapping kills the “performance” completely. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#104 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AN5SWAE3GLNHAUQAHTTJPWTSQFRZFANCNFSM4TSFLSPQ> .

from tinygrad.

marcelbischoff commented on September 28, 2024

No. That sounds like the issue I am having.

…

Sent from my iPhone

On Nov 16, 2020, at 1:07 PM, adamritter ***@***.***> wrote: Do you zero the gradients? it's important to not have any references left inside the memory, or else the garbage collector won't work. On Mon, Nov 16, 2020 at 5:57 PM Marcel Bischoff ***@***.***> wrote: > I am trying to train efficientnet (preloaded weights without last layer) > on a small subset of the Kaggle dogs and cats data set. 1 epoch would be > like 1-2h but it allocated like 24GB of memory after 2-3 batches and > swapping kills the “performance” completely. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#104 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AN5SWAE3GLNHAUQAHTTJPWTSQFRZFANCNFSM4TSFLSPQ> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

from tinygrad.

marcelbischoff commented on September 28, 2024

Do you zero the gradients? it's important to not have any references left inside the memory, or else the garbage collector won't work.
…
On Mon, Nov 16, 2020 at 5:57 PM Marcel Bischoff @.***> wrote: I am trying to train efficientnet (preloaded weights without last layer) on a small subset of the Kaggle dogs and cats data set. 1 epoch would be like 1-2h but it allocated like 24GB of memory after 2-3 batches and swapping kills the “performance” completely. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#104 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN5SWAE3GLNHAUQAHTTJPWTSQFRZFANCNFSM4TSFLSPQ .

Where would I do this? The optim.zero_grad() doesn't do the job... if you want you can take a look here: https://github.com/marcelbischoff/tinygrad/blob/tsteffnet/examples/train_efficientnet_catsvsdogs.py

BTW, GPU=1 examples/train_efficientnet.py now works!?!

from tinygrad.

geohot commented on September 28, 2024

GPU=1 training should work.

Hmm, so who knows what tinygrad actually frees? There's been no thought of lifecycle management.

Also, BatchNorm2D is not correct for training.

(i don't think zero_grad should change anything with the GC)

from tinygrad.

marcelbischoff commented on September 28, 2024

I don't but it doesn't seem to free anything.

from tinygrad.

geohot commented on September 28, 2024

lol, that's probably right

from tinygrad.

adamritter commented on September 28, 2024

losses.append(loss)

You don't detach the results, so all the ancestors (parents...) of the loss will be appended and can't be garbage collected. Detach should probably just make a copy of the tensor without the parents.

from tinygrad.

adamritter commented on September 28, 2024

also I think you could put the code into the repo so that other people can help to clean it up

from tinygrad.

adamritter commented on September 28, 2024

See #123 for detach implementation

from tinygrad.

geohot commented on September 28, 2024

See #140 and #141

from tinygrad.

hemangjoshi37a commented on September 28, 2024

A key challenge here seems to be with memory management, particularly with regard to gradient accumulation and tensor lifecycle management. While Python's garbage collector should ideally handle unused or dereferenced objects, it can become problematic if references are held in unexpected places - for instance, the tensor gradient computation graph. If tensors are not properly detached, the entire computation history might persist, causing memory usage to pile up.

Here's a piece of advice to prevent this. We could explicitly remove references to tensors after they're no longer needed by utilizing Python's del statement, thereby freeing up memory space. Consider inserting these statements at the end of each training iteration:

del loss
del inputs
del labels
torch.cuda.empty_cache()  # if you're using PyTorch

It's worth noting that torch.cuda.empty_cache() is specific to PyTorch and is used to clear the GPU cache, which might hold onto memory even after tensors are deleted. For a tinygrad equivalent, you'd need to design a similar operation that clears any caching mechanism present.

Another potential improvement could be in the backward operation. A possible bottleneck could be launching two backward computations for residual connections. As @geohot mentioned, introducing a topological sort could optimize this process by computing backward passes in an orderly manner, reducing redundancy and potential memory usage.

Also, it's noteworthy to be mindful of @adamritter's suggestion on implementing a detach function. Detaching results could help ensure that unnecessary computation history is not carried forward, again helping to mitigate memory issues.

from tinygrad.

GPU=1 examples/train_efficientnet.py doesn't work about tinygrad HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent