Is your feature request related to a problem? Please describe. No

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Quantisation and Pruning Support about lightning HOT 16 CLOSED

karanchahal commented on May 3, 2024

Quantisation and Pruning Support

from lightning.

Comments (16)

shivamsaboo17 commented on May 3, 2024 3

@karanchahal @williamFalcon I ported the pure python code to cython and got significant speedups:
My experiments are on 3x64x64 input tensor and filters size is 256x3x3x3
Pure python:
50% sparse --> 45 seconds
90% sparse --> 11 seconds
100% sparse --> 60ms

Cython optimized:
50% sparse --> 13 ms
90% sparse --> 5 ms
100% sparse --> 661 microseconds

For ref: PyTorch conv2d took 1.9 ms on my machine (CPU). (Prev results were on colab(CPU))

google drive link to .pyx and ipynb file:
https://drive.google.com/open?id=1gnrbFNWJBZbyPH6KKnCLmrPBNqOFtKUD
https://drive.google.com/open?id=1--_B89H4iSZuJuj9QKqBRrB5Tlr7DMnH

Link to compiled C file:
https://drive.google.com/open?id=1nCGKRmM4AGcmepEJCkWAl_SBZc2l-rrA

I am looking at more ways to optimize cython code now.

from lightning.

williamFalcon commented on May 3, 2024 2

@shivamsaboo17 @karanchahal https://gitter.im/PyTorch-Lightning/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge

from lightning.

williamFalcon commented on May 3, 2024 2

super excited about this feature!

from lightning.

karanchahal commented on May 3, 2024 1

Hey sure, I was quite interested in this actually. Some great work has been done on fast sparse kernels (link <https://openreview.net/forum?id=rJPcZ3txx>, link <https://arxiv.org/abs/1702.08597>, link <https://arxiv.org/abs/1802.10280>), but it's certainly an area of active research. I haven't read these papers but I've heard this is a good place to start. Let's read this and then revert back here with what we've learnt ? Best, Karanbir Chahal

…

On Wed, Aug 14, 2019, 14:19 Shivam Saboo ***@***.***> wrote: Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors. But I was really interested in implementing custom layers in PyTorch for just inference once we have all the boolean mask. Would you be interested in collaborating on implementing such layers? Perhaps we can start specifically for linear layers and then extend to other types of layers. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7XGO2FROWOKPU4MUXLQEPBK7ANCNFSM4IKIZYAQ> .

from lightning.

shivamsaboo17 commented on May 3, 2024 1

Great! Will start reading these papers.

from lightning.

shivamsaboo17 commented on May 3, 2024 1

The paper actually metions using CSR format as row slicing is very fast. Not sure if COO format would be as efficient but we can try. Although converting from COO to CSR should be possible (but not sure how) with small computational overhead

from lightning.

williamFalcon commented on May 3, 2024

@karanchahal this sounds great. let's add both and we can use the official PyTorch version when it's ready!

The first one as a trainer option:
Trainer(quantize_bits=4)

The second after training which can be called on Module.

trainer.fit(model)

model.quantize(bits=8)

@karanchahal submit a PR and we can walk through the implementation!

from lightning.

shivamsaboo17 commented on May 3, 2024

@karanchahal can you please check the link you provide for pruning notebook. I think it's the same link for quantization notebook.
Also, regarding the implementation of neural network pruning, I found that masking the weights that we need to prun is very simple to implement, but if we still keep the weight tensors as the same datatype as before, we still have to do entire matrix multiplication. While multiplications with 0's take less time, still I believe this is really inefficient when you prun 90% of weights but still have to do full matrix multiplication. Are you familiar with a way to handle sparse weights more efficiently in pytorch or some other way such that we can re-structure the network based on prunned weights (assuming unstructured pruning)?

from lightning.

karanchahal commented on May 3, 2024

Hello, This conversation between me an Tim Dettmers might interest you in the challenges of attaining real world speed ups with sparse weights. TimDettmers/sparse_learning#1 My apologies on the wrong link, I'll update it soon and let you know. Best, Karan

…

On Tue, Aug 13, 2019, 20:09 Shivam Saboo ***@***.***> wrote: @karanchahal <https://github.com/karanchahal> can you please check the links you provide for pruning notebook. I think it's the same link for quantization notebook. Also, regarding the implementation of neural network pruning, I found that masking the weights that we need to prun is very simple to implement, but if we still keep the weight tensors as the same datatype as before, we still have to do entire matrix multiplication. While multiplications with 0's take less time, still I believe this is really inefficient when you prun 90% of weights but still have to do full matrix multiplication. Are you familiar with a way to handle sparse weights more efficiently in pytorch or some other way such that we can re-structure the network based on prunned weights (assuming unstructured pruning)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7WI7TP6J4KNHSD4BWLQELBQNANCNFSM4IKIZYAQ> .

from lightning.

shivamsaboo17 commented on May 3, 2024

Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors.
But I was really interested in implementing custom layers in PyTorch for just inference (only writing forward pass perhaps using torch.sparse API) once we have all the boolean mask. Would you be interested in collaborating on implementing such layers? Perhaps we can start specifically for linear layers and then extend to other types of layers.

from lightning.

shivamsaboo17 commented on May 3, 2024

I read through the ICLR 17 paper and implemented their algorithm in python (link to colab). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy).
If you have any suggestions please let me know!

from lightning.

karanchahal commented on May 3, 2024

This is pretty interesting ! Great work! I see you're using numba to run it on the GPU if I'm not mistaken. I wonder if numba converts the python loops into C/C++, if not using C++ extensions might be a worthwhile exercise. I was also wondering if combining cython with numba would be the easier way to go for that? The speed increase is definitely encouraging. I think tuning this implementation could get us below 4 ms. Btw what do the pytorch people use to do the conv2d, is it plain im2col or something fancy like a Winograd algorithm? Mostly I feel they must have really optimised the loading and unloading of data to and fro from the GPU. We'll have a tough time getting a better speed than cudnn's super optimised implementation. But definitely worth trying ! I've been traveling a lot this week and have been unable to read the papers or code :/ I'll try to read up soon and study your implementation. On another note, good news is that I've almost got quant aware training working ( inference in 4 bits ! ). Apologies again for the late response :) Best, Karanbir Chahal

…

On Fri, Aug 16, 2019, 00:34 Shivam Saboo ***@***.***> wrote: I read through the ICLR 17 paper <https://openreview.net/forum?id=rJPcZ3txx> and implemented their algorithm in python (link to colab <https://colab.research.google.com/drive/1MpDzO70S--zGDWjpcwx7uBgSDunKkDhy>). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy). If you have any suggestions please let me know! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7SHPY2DBF4I447IIMDQEWSE3ANCNFSM4IKIZYAQ> .

from lightning.

shivamsaboo17 commented on May 3, 2024

I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object.

I too think that I should first try to make the implementation work with cython and numba before C++ implementation.

Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer.

Will try out a few things this weekend and let you know if I get any improvements

from lightning.

karanchahal commented on May 3, 2024

Ahh okay, well pytorch has the Torchscript thing that we can try as well. It uses a jit too and applies the optimizations for pytorch tensors. I don't know it's possible to get it working with scipy sparse format. Can we use the sparse tensor format (COO) instead of the one scipy uses ? Thanks again for this great work ! Best, Karan

…

On Fri, Aug 16, 2019, 20:03 Shivam Saboo ***@***.***> wrote: I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object. I too think that I should first try to make the implementation work with cython and numba before C++ implementation. Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer. Will try out a few things this weekend and let you know if I get any improvements — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7XSLAOT3ZZ2EHREWDDQE23EPANCNFSM4IKIZYAQ> .

from lightning.

williamFalcon commented on May 3, 2024

@sidhanthholalkere @karanchahal spoke with @soumith about this. I think this is better added to core PyTorch. Check out this issue.

Once it's merged and live there we can do whatever we need to do to support it.

Closing to move this work to the PyTorch issue.

from lightning.

gottbrath commented on May 3, 2024

Note that we have a notebook with a preview tutorial on eager mode post training quantization in core pytorch over in pytorch/pytorch#18318 ... please check it out and leave feedback.

from lightning.

Quantisation and Pruning Support about lightning HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent