Comments (23)
Oh, it's the res connections (add) launching two backwards. Need to toposort first, my backwards algo is too simple.
from tinygrad.
FYI, I am having it kind of running on kaggle (pyopencl.Device 'Tesla P100-PCIE-16GB
). I am ignoring that batchnorms are not correctly implemented (best is probably to freeze them). It needs a magnitude more memory and is two magnitudes slower than pytorch. Tinygrad maxes out at BS=16 (why?) and takes like 7h for one epoch (probably the backwardpass of conv2d). Similar code in pytorch does BS=128 and 1 epochs in like 4mins.
from tinygrad.
The exact case that needed to be supported is: grad shape must match tensor shape in <tinygrad.ops.Sub object at 0x000002265A26E970>, (4, 32, 112, 112) != (1, 32, 1, 1)
from tinygrad.
#118 might fix this... I can't tell yet because it is unbelievable slow!
from tinygrad.
I think it gets stuck in an infinite loop!
from tinygrad.
Weird, try with DEBUG=1
from tinygrad.
That's what I did. Seems to compute similar backward passes for hours. Killed it after several hours.
from tinygrad.
Now it seems to be stuck here
LogSoftmax : 0.06 ms [(1, 1000)]
Mul : 0.01 ms [(1, 1000), (1, 1000)]
Sum : 0.02 ms [(1, 1000)]
Mul : 0.01 ms [(1,), (1,)]
from tinygrad.
from tinygrad.
Thx, now it works!
from tinygrad.
Now to fix it for the GPU so we can use it!
from tinygrad.
I am trying to train efficientnet (preloaded weights without last layer) on a small subset of the Kaggle dogs and cats data set. 1 epoch would be like 1-2h but it allocated like 24GB of memory after 2-3 batches and swapping kills the “performance” completely. Not sure yet if it is tinygrad or my bad code!
I am training all layers. Only training the last layer should be no problem and that one should already work on the GPU.
from tinygrad.
from tinygrad.
from tinygrad.
Do you zero the gradients? it's important to not have any references left inside the memory, or else the garbage collector won't work.
…
On Mon, Nov 16, 2020 at 5:57 PM Marcel Bischoff @.***> wrote: I am trying to train efficientnet (preloaded weights without last layer) on a small subset of the Kaggle dogs and cats data set. 1 epoch would be like 1-2h but it allocated like 24GB of memory after 2-3 batches and swapping kills the “performance” completely. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#104 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN5SWAE3GLNHAUQAHTTJPWTSQFRZFANCNFSM4TSFLSPQ .
Where would I do this? The optim.zero_grad()
doesn't do the job... if you want you can take a look here: https://github.com/marcelbischoff/tinygrad/blob/tsteffnet/examples/train_efficientnet_catsvsdogs.py
BTW, GPU=1 examples/train_efficientnet.py
now works!?!
from tinygrad.
GPU=1 training should work.
Hmm, so who knows what tinygrad actually frees? There's been no thought of lifecycle management.
Also, BatchNorm2D is not correct for training.
(i don't think zero_grad should change anything with the GC)
from tinygrad.
I don't but it doesn't seem to free anything.
from tinygrad.
lol, that's probably right
from tinygrad.
losses.append(loss)
You don't detach the results, so all the ancestors (parents...) of the loss will be appended and can't be garbage collected. Detach should probably just make a copy of the tensor without the parents.
from tinygrad.
also I think you could put the code into the repo so that other people can help to clean it up
from tinygrad.
See #123 for detach implementation
from tinygrad.
from tinygrad.
A key challenge here seems to be with memory management, particularly with regard to gradient accumulation and tensor lifecycle management. While Python's garbage collector should ideally handle unused or dereferenced objects, it can become problematic if references are held in unexpected places - for instance, the tensor gradient computation graph. If tensors are not properly detached, the entire computation history might persist, causing memory usage to pile up.
Here's a piece of advice to prevent this. We could explicitly remove references to tensors after they're no longer needed by utilizing Python's del
statement, thereby freeing up memory space. Consider inserting these statements at the end of each training iteration:
del loss
del inputs
del labels
torch.cuda.empty_cache() # if you're using PyTorch
It's worth noting that torch.cuda.empty_cache()
is specific to PyTorch and is used to clear the GPU cache, which might hold onto memory even after tensors are deleted. For a tinygrad equivalent, you'd need to design a similar operation that clears any caching mechanism present.
Another potential improvement could be in the backward
operation. A possible bottleneck could be launching two backward computations for residual connections. As @geohot mentioned, introducing a topological sort could optimize this process by computing backward passes in an orderly manner, reducing redundancy and potential memory usage.
Also, it's noteworthy to be mindful of @adamritter's suggestion on implementing a detach
function. Detaching results could help ensure that unnecessary computation history is not carried forward, again helping to mitigate memory issues.
from tinygrad.
Related Issues (20)
- llama speed tracking issue
- proposal symbolic to uops
- [bug report] llvm generates invalid code for certain uops HOT 1
- garbage collection reference error while running tests HOT 2
- Use UOps.VECTORIZE for consts HOT 1
- Invalid Metal library on non-conda Mac env after cmp_tuple HOT 3
- invalid PTX with NV raises error "RuntimeError: CUDA Error 4, driver shutting down"
- Running `python docs/abstractions3.py` results in error
- TinyJit does not update bound values in symbolic shape
- Error with AMD and installing Tinygrad from Local HOT 8
- beautiful_mnist.py: RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery)" HOT 4
- Neural Networks documentation examples fail HOT 2
- inflated ops count for `a-b` compared to `a+b`
- simple linear kernel not fusing HOT 1
- Move SPLIT_REDUCEOP into kernel.py ($500 bounty) HOT 1
- Apple M1 Max cannot load llama3-8b-sfr weights (because no bfloat support?) HOT 5
- [DRAFT PROPOSAL] Outline for AMD >100TFLOPS matmul for 7900XTX bounty HOT 1
- `UPat` `__repr__` does not include permutation from list HOT 2
- Fail to run a simple example with nv backend on RTX 3060 laptop GPU HOT 7
- idx simplification given valid HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tinygrad.