Hi! I am trying to configure the GLM-130B models with FasterTransformer and I need

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

用FasterTransformer加速推理中指定某个pt,还是整个pt的文件夹？ <a target="_blank" rel="noopener norefer

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Inference with FasterTransformer with GLB-130B about glm-130b HOT 9 CLOSED

thudm commented on August 21, 2024

Inference with FasterTransformer with GLB-130B

from glm-130b.

Comments (9)

Sengxian commented on August 21, 2024

Hello @hamza-alethea. Sorry, the current conversion script is for the checkpoint of our training framework, we will fix it at very soon. By the way, FasterTransformer does not support inference on V100 machines. A quantized version of the GLM-130B that allows efficient INT8 inference on the V100 will be released in the next few days, so please be patient and keep an eye on our GitHub repo.

from glm-130b.

hamza-alethea commented on August 21, 2024

Thank you!
Do you know any other method which helps me to reduce response time for GLM-130B?

from glm-130b.

Sengxian commented on August 21, 2024

We have just released the quantized version of GLM-130B. The V100 servers can efficiently run the GLM-130B in INT8 precision, see Quantization of GLM-130B for details.

from glm-130b.

jiangliqin commented on August 21, 2024

We have just released the quantized version of GLM-130B. The V100 servers can efficiently run the GLM-130B in INT8 precision, see Quantization of GLM-130B for details.
Hello,the Quantization method referred in the link can also apply to GLM-10B model?

from glm-130b.

Sengxian commented on August 21, 2024

We have just released the quantized version of GLM-130B. The V100 servers can efficiently run the GLM-130B in INT8 precision, see Quantization of GLM-130B for details.
Hello,the Quantization method referred in the link can also apply to GLM-10B model?

We haven't tried it, but I think a smaller model might be easier to do quantization.

from glm-130b.

jiangliqin commented on August 21, 2024

We have just released the quantized version of GLM-130B. The V100 servers can efficiently run the GLM-130B in INT8 precision, see Quantization of GLM-130B for details.
Hello,the Quantization method referred in the link can also apply to GLM-10B model?

We haven't tried it, but I think a smaller model might be easier to do quantization.

i juset convert to 4-way for the GLM-10B,but i run the generate_block.sh,but failed to load the model

from glm-130b.

jiangliqin commented on August 21, 2024

We have just released the quantized version of GLM-130B. The V100 servers can efficiently run the GLM-130B in INT8 precision, see Quantization of GLM-130B for details.
Hello,the Quantization method referred in the link can also apply to GLM-10B model?

We haven't tried it, but I think a smaller model might be easier to do quantization.

i juset convert to 4-way for the GLM-10B,but i run the generate_block.sh,but failed to load the model

i change the MPSIZE to4 in the script

from glm-130b.

jiangliqin commented on August 21, 2024

用FasterTransformer加速推理中指定某个pt,还是整个pt的文件夹？

from glm-130b.

Sengxian commented on August 21, 2024

Hello @jiangliqin! This repo is only for GLM-130B model, we have not yet done quantization for GLM-10B.

from glm-130b.

Recommend Projects

Inference with FasterTransformer with GLB-130B about glm-130b HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent