Comments (16)
@karanchahal @williamFalcon I ported the pure python code to cython and got significant speedups:
My experiments are on 3x64x64 input tensor and filters size is 256x3x3x3
Pure python:
50% sparse --> 45 seconds
90% sparse --> 11 seconds
100% sparse --> 60ms
Cython optimized:
50% sparse --> 13 ms
90% sparse --> 5 ms
100% sparse --> 661 microseconds
For ref: PyTorch conv2d took 1.9 ms on my machine (CPU). (Prev results were on colab(CPU))
google drive link to .pyx and ipynb file:
https://drive.google.com/open?id=1gnrbFNWJBZbyPH6KKnCLmrPBNqOFtKUD
https://drive.google.com/open?id=1--_B89H4iSZuJuj9QKqBRrB5Tlr7DMnH
Link to compiled C file:
https://drive.google.com/open?id=1nCGKRmM4AGcmepEJCkWAl_SBZc2l-rrA
I am looking at more ways to optimize cython code now.
from lightning.
@shivamsaboo17 @karanchahal https://gitter.im/PyTorch-Lightning/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge
from lightning.
super excited about this feature!
from lightning.
from lightning.
Great! Will start reading these papers.
from lightning.
The paper actually metions using CSR format as row slicing is very fast. Not sure if COO format would be as efficient but we can try. Although converting from COO to CSR should be possible (but not sure how) with small computational overhead
from lightning.
@karanchahal this sounds great. let's add both and we can use the official PyTorch version when it's ready!
The first one as a trainer option:
Trainer(quantize_bits=4)
The second after training which can be called on Module.
trainer.fit(model)
model.quantize(bits=8)
@karanchahal submit a PR and we can walk through the implementation!
from lightning.
@karanchahal can you please check the link you provide for pruning notebook. I think it's the same link for quantization notebook.
Also, regarding the implementation of neural network pruning, I found that masking the weights that we need to prun is very simple to implement, but if we still keep the weight tensors as the same datatype as before, we still have to do entire matrix multiplication. While multiplications with 0's take less time, still I believe this is really inefficient when you prun 90% of weights but still have to do full matrix multiplication. Are you familiar with a way to handle sparse weights more efficiently in pytorch or some other way such that we can re-structure the network based on prunned weights (assuming unstructured pruning)?
from lightning.
from lightning.
Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors.
But I was really interested in implementing custom layers in PyTorch for just inference (only writing forward pass perhaps using torch.sparse API) once we have all the boolean mask. Would you be interested in collaborating on implementing such layers? Perhaps we can start specifically for linear layers and then extend to other types of layers.
from lightning.
I read through the ICLR 17 paper and implemented their algorithm in python (link to colab). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy).
If you have any suggestions please let me know!
from lightning.
from lightning.
I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object.
I too think that I should first try to make the implementation work with cython and numba before C++ implementation.
Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer.
Will try out a few things this weekend and let you know if I get any improvements
from lightning.
from lightning.
@sidhanthholalkere @karanchahal spoke with @soumith about this. I think this is better added to core PyTorch. Check out this issue.
Once it's merged and live there we can do whatever we need to do to support it.
Closing to move this work to the PyTorch issue.
from lightning.
Note that we have a notebook with a preview tutorial on eager mode post training quantization in core pytorch over in pytorch/pytorch#18318 ... please check it out and leave feedback.
from lightning.
Related Issues (20)
- Enable batch size finder for distributed strategies HOT 1
- Support wandb_logger.watch() when using LightningCLI
- The packages such as libraries and models are not loading from files
- Please make it simple!
- LOG issue HOT 1
- Multi-gpu training is much lower than single gpu (due to additional processes?)
- Missing documentation for the `log_weight_decay` argument in `lightning.pytorch.callbacks.LearningRateMonitor`
- parsing issue with `save_last` parameter of `ModelCheckpoint`
- Construct objects from yaml by classmethod
- FSDP Strategy checkpoint loading
- Current FSDPPrecision does not support custom scaler for 16-mixed precision
- Differentiate testing multiple sets/models when logging
- Issue in Manual optimisation, during self.manual_backward call HOT 1
- Existing metric keys not moved to device after LearningRateFinder
- Checkpoint every_n_steps reruns epoch on restore HOT 3
- Metrics logged by self.log and metric.compute() are different HOT 1
- Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster HOT 3
- Full validation after first microbatch when training after LearningRateFinder
- Add a warning when some of the modules are in eval mode before the training stage
- why pytorch-lightning doc say "Model-parallel training (FSDP and DeepSpeed)". I think there is something wrong. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightning.