<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="16

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model about autogptq HOT 4 CLOSED

GenTxt commented on May 20, 2024

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model

from autogptq.

Comments (4)

qwopqwop200 commented on May 20, 2024

If this is true, it's very strange. I've coded the result so that it doesn't change.

from autogptq.

PanQiWei commented on May 20, 2024

@GenTxt can you share your quantization code and model to us so that we can try to reproduce and figure out what went wrong.

Also you may try on the up-to-date commit in main branch, may be it can solve your problem.

from autogptq.

GenTxt commented on May 20, 2024

https://huggingface.co/kz919/gpt-neox-20b-8k-longtuning/tree/main

Converted above to safetensors with text generation webui script.

CUDA_VISIBLE_DEVICES="0" python quant_with_alpaca.py --pretrained_model_dir models/neox20b_8192_safe --quantized_model_dir 4bit_converted --bits 4 --group_size 128 --fast_tokenizer --save_and_reload

Old models deleted as current triton kernel can cause errors on refurbished 6000.
triton-lang/triton#1556

For the specific code above, this error:

Occurs on NVIDIA GeForce RTX 2080 Ti (similar to original 6000 - gpu1)
Doesn't occur on NVIDIA GeForce RTX 3090 (works fine on same - gpu0)

Quantized in latest cuda main and not encountering the error. False alarm. Closing here and carefully testing each update.

Thanks

from autogptq.

PeiyuZ-star commented on May 20, 2024

https://huggingface.co/kz919/gpt-neox-20b-8k-longtuning/tree/main

Converted above to safetensors with text generation webui script.

CUDA_VISIBLE_DEVICES="0" python quant_with_alpaca.py --pretrained_model_dir models/neox20b_8192_safe --quantized_model_dir 4bit_converted --bits 4 --group_size 128 --fast_tokenizer --save_and_reload

Old models deleted as current triton kernel can cause errors on refurbished 6000. openai/triton#1556

For the specific code above, this error:
Occurs on NVIDIA GeForce RTX 2080 Ti (similar to original 6000 - gpu1)
Doesn't occur on NVIDIA GeForce RTX 3090 (works fine on same - gpu0)
Quantized in latest cuda main and not encountering the error. False alarm. Closing here and carefully testing each update.

Thanks

Hi, I've also tyied neox20b quantization, the inference speed I got is 16tokens/s, which isn't fast enough, may I have your results?

from autogptq.

Recommend Projects

Recent pull 23 generating jammed sentence output with new quantized neox20b 4bit model about autogptq HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent