casper-hansen / autoawq Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 148.0 7.46 MB

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Home Page: https://casper-hansen.github.io/AutoAWQ/

License: MIT License

Python 99.74% Shell 0.26%

autoawq's People

Contributors

Stargazers

Watchers

Forkers

jamesdborin curname boehm-e tnbase bdambrosio abacaj qwopqwop200 webbigdata-jp mildmillard abaso007 thebloke zeroxclem pyning jan-karsten-kuhnke vikparuchuri s4rduk4r mmnga gtimothee danyray420 ariwaranosai sycomix zyssae minyang-chen stjordanis sebastianbodza compressa-ai twaka muxodious bleugreen ftgreat sombressoul mayi140611 tomchapin sanyamlakhanpal huangyuxiang03 parthb3 asdlei99 drxd1000 sanster peterding icyxp guhaifudeng chenxu2048 oieieio ahhr80 binaryninja rycont vtars hanrui1sensetime aoyuqc jundolc starmys moomou ojus1 dbindioracle yumianhuli1 techcto seungrokj jebtang revsic sycao5 lr1729 sundogs8603 mejerry215 namtranase choronz xnul zymrael djicode xjpang moreh-dev linotfan shashipal95 davidepaglieri ilyasmoutawwakil pariyat justinlin610 cloud9wj sharphl zwd003 ccc0168 britvabo dreamgenx michaelfeil cybersys sushruta surajx kevin3314 jade2290 jinz2014 remusn3t suparious labsimilk victorsanh penghuicheng sangh0 oscarsavolainendr techxgenus hlearning weishengying

autoawq's Issues

支持awq8bit量化吗？

CUDA out of memory when quantizing LLaMA-70b

I tried to quantize llama-70b model on a machine with 8 A100 GPUs, but CUDA OOM error occured (while llama-13b can be quantized successfully). How to quantize such large models with AutoAWQ? Is there some extra params concerning parallelism should be added when loading the model? Thanks!

I have incompatible transformers

I installed AutoAWQ through "pip install autoawq" in the Conda environment of Oobabooga but it didn't appear in the model loader when I started it up. Then i tried upgrading it but it gave me this message

"ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autoawq 0.1.0 requires transformers>=4.32.0, but you have transformers 4.30.2 which is incompatible."

Do I need to upgrade the transformers? I'm not sure how to do that.

Support batch input for performance test

@casper-hansen Thanks for your contribution on AWQ.

In order to better compare the acceleration effect of INT4 relative to FP16 models, could you please support the following features in run_speed() function ?

Support batch input
Support running baseline on FP16 models

Add unit/integration testing

A nice list of tests that I would like to implement in order to more easily make sure everything works.

test each model architecture by generating 1 token (fused + unfused)
test batched input
test quantization
test CUDA kernels
test multi-GPU
test fusing qkv operations

push_to_hub error

AttributeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 model.push_to_hub('rshrott/description-together-ai-4bit-awq')
2 tokenizer.push_to_hub('rshrott/description-together-ai-4bit-awq')

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1614, in Module.getattr(self, name)
1612 if name in modules:
1613 return modules[name]
-> 1614 raise AttributeError("'{}' object has no attribute '{}'".format(
1615 type(self).name, name))

AttributeError: 'LlamaAWQForCausalLM' object has no attribute 'push_to_hub'

Implement weight map and weight sharding

Huggingface format usually involves sharded files in 10GB chunks, and alongside the sharded format, an index file is created to keep track of the layers in the model in a model_name.bin.index.json file.

This is especially useful for larger models.

Support for `gpt-neox` model

Thanks for your amazing work! Hope this can boost up the whole llm community.
Do you have any plans to support gpt-neox model (https://github.com/EleutherAI/gpt-neox) as another base model?

Thanks.

Implement BigCode models (StarCoder etc.)

https://huggingface.co/bigcode/starcoder

Recursion error when creating AutoTokenizer for llama-13b-hf

When creating AutoTokenizer for llama-13b-hf, the problem of RecursionError: maximum recursion depth exceeded while getting the str of an object occurred.

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

in https://github.com/casper-hansen/AutoAWQ/blob/main/awq/entry.py

Maybe the original code in llm-awq is better for different model types:

# all hf model
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
if "mpt" in config.__class__.__name__.lower():
    enc = AutoTokenizer.from_pretrained(config.tokenizer_name, trust_remote_code=True)
else:
    enc = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)

Any plan on updating bloom benchmark?

Hi casper, this is amazing work! I saw you already have bloom model supported, do you happen to run any benchmark on it? If so, can you update the benchmark numbers? Thanks a lot!

AWQ inference is 50% slower than GPTQ

Hey Casper,

System: Ubuntu 22.04 RTX3090 CUDA 118 Python 3.10 AutoAWQ 0.1.0

I'm only seeing 50% of the performance of a GPTQ model in ExLlamaV2 which is surprising.

Also the in device memory use is 15% higher for the same model, AWQ loaded in AutoAWQ vs GPTQ loaded in ExLlamaV2. Here's a few stats with a max_seq_len of 4096.

model                                   max seq      memory     tokens/s
TheBloke_Llama-2-7B-chat-AWQ               4096      7560MB         60.7
TheBloke_Llama-2-7B-chat-GPTQ              4096      6622MB        139.9

TheBloke_Llama-2-13B-chat-AWQ              4096     12532MB         38.7
TheBloke_Llama-2-13B-chat-GPTQ             4096     10828MB         84.9

TheBloke_Phind-CodeLlama-34B-v2-AWQ        4096     21162MB         21.5 (would not complete)
TheBloke_Phind-CodeLlama-34B-v2-GPTQ       4096     18370MB         40.2

Implement exllama q4_matmul kernel as alternative

ExLlama has implemented very optimized CUDA kernels. We should import the kernels to see just how efficient it could be in AWQ.

https://github.com/turboderp/exllama/blob/master/exllama_ext/exllama_ext.cpp#L199

Add multi-gpu support to fused layers

Currently, fused layers only work on single GPUs. If you have multiple GPUs, you would have to set from_quantized(fuse_layers=False).

(Please help) ImportError: DLL load failed while importing awq_inference_engine: The specified module could not be found.

I'll do my best to give some background information here.

I've set everything up,
-the conda environment.
-the pytorch with cudatookit 11.8.
-i have the nvidia toolkit installed.
-installed autoawq
I've confirmed my steps with somebody else. I've set it up correctly. they also are getting the same errors.

when I go to run the basic_generate.py I get the error saying:
ImportError: DLL load failed while importing awq_inference_engine: The specified module could not be found.
[Error Below]
(autoawq) C:\Users\cf902\Desktop\oogabooga-TextGenerationWebui\AWQ\AutoAWQ-main>python examples/basic_generate.py Traceback (most recent call last): File "C:\Users\cf902\Desktop\oogabooga-TextGenerationWebui\AWQ\AutoAWQ-main\examples\basic_generate.py", line 1, in <module> from awq import AutoAWQForCausalLM File "C:\Users\cf902\miniconda3\envs\autoawq\lib\site-packages\awq\__init__.py", line 2, in <module> from awq.models.auto import AutoAWQForCausalLM File "C:\Users\cf902\miniconda3\envs\autoawq\lib\site-packages\awq\models\__init__.py", line 1, in <module> from .mpt import MptAWQForCausalLM File "C:\Users\cf902\miniconda3\envs\autoawq\lib\site-packages\awq\models\mpt.py", line 1, in <module> from .base import BaseAWQForCausalLM File "C:\Users\cf902\miniconda3\envs\autoawq\lib\site-packages\awq\models\base.py", line 11, in <module> from awq.quantize.quantizer import AwqQuantizer File "C:\Users\cf902\miniconda3\envs\autoawq\lib\site-packages\awq\quantize\quantizer.py", line 10, in <module> from awq.modules.linear import WQLinear_GEMM, WQLinear_GEMV File "C:\Users\cf902\miniconda3\envs\autoawq\lib\site-packages\awq\modules\linear.py", line 4, in <module> import awq_inference_engine # with CUDA kernels ImportError: DLL load failed while importing awq_inference_engine: The specified module could not be found.

from what I've been able to diagnose myself.
anywhere in the awq site package. where import awq_inference_engine is called.
it always says the same error:
ImportError: DLL load failed while importing awq_inference_engine: The specified module could not be found.
however, the file awq_inference_engine.cp310-win_amd64.pyd is there.
so its failing or something? i don't know how to diagnose this further. I'd appreciate the help because until this is fixed i don't think i can run AWQ models.

additional notes:
OS: Windows
Discord: craser1 <--(contact me if needed) - (if additional information is needed please ask)

also if this is something I should be directing somewhere else please tell me. I don't know how to leave a bug report...

📌 AutoAWQ Roadmap

Optimization

Fused layers of LLaMa models
Implement GEMV kernel #40
Implement ExLlama kernels #313
More fused layers for implemented models #40
INT8 quantization #45
Optimize split_k_iters #39

More models

Ease of access

Software integration and quality

Unit & integration testing #31
Integrate into Huggingface optimum/transformers
Quantization config #8
Model weight sharding and shard index #36

TypeError: AutoAWQForCausalLM.from_quantized() got an unexpected keyword argument 'safetensors'

I decided to try this instead of AutoGPTQ. I am trying this simplified snippet

pip install huggingface_hub transformers torch einops hf_transfer autoawq

from awq import AutoAWQForCausalLM
self.model = AutoAWQForCausalLM.from_quantized(
    MODEL_NAME, 
    fuse_layers=True,
    trust_remote_code=False, 
    safetensors=True
)

and I get the error

TypeError: AutoAWQForCausalLM.from_quantized() got an unexpected keyword argument 'safetensors'

I'm using Modal.com service and nvidia/cuda:11.8.0-devel-ubuntu22.04 base image. Any ideas why I'd be getting this?

INT8 quantization support

The motivation for INT8 is to keep even more accuracy while still getting some gains on inference speed. I experimented with implementing dequantization for INT8 and ultimately need more work on this before it will be usable.

Edit: Implement SmoothQuant instead. Here is a fork of SmoothQuant that supports LLaMa models. Integrate this into AutoAWQ. https://github.com/AniZpZ/smoothquant/tree/llama-dev

__device__ uint8_t dequantize_s8_to_fp16x2(uint32_t const& source)
{
    // https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions/interleaved_numeric_conversion.h#L54
    uint8_t result;

    uint32_t*      h   = reinterpret_cast<uint32_t*>(&result);
    uint32_t const i8s = reinterpret_cast<uint32_t const&>(source);

    // Casper: Original was 0x64646464 = {1124, 1124}
    // Optimize to 0x64806480 because divisible by 8, 16, 32, 64, 128
    // NOTE: Test out {1280, 1280} since it's also divisible by 256
    static constexpr uint32_t mask_for_elt_01     = 0x5250;
    static constexpr uint32_t mask_for_elt_23     = 0x5351;
    static constexpr uint32_t start_byte_for_fp16 = 0x64806480; 
    asm volatile("prmt.b32 %0,%1,%2,%3;\n" : "=r"(h[0]) : "r"(i8s), "n"(start_byte_for_fp16), "n"(mask_for_elt_01));
    asm volatile("prmt.b32 %0,%1,%2,%3;\n" : "=r"(h[1]) : "r"(i8s), "n"(start_byte_for_fp16), "n"(mask_for_elt_23));

    // Lastly, we subtract 1152 from our constructed number using fp16 math to get our signed integer as fp16.
    // Casper 0x64806480 = {1152, 1152}
    static constexpr uint32_t I8s_TO_F16s_MAGIC_NUM = 0x64806480; 
    asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[0]) : "r"(h[0]), "r"(I8s_TO_F16s_MAGIC_NUM));
    asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[1]) : "r"(h[1]), "r"(I8s_TO_F16s_MAGIC_NUM));
}

Falcon-7b quantization failure

using tituae/falcon 7b
compiled pull from main today
successfully quantized bloom-1b1, llama-2-7b, llama-7b, mpt-7b, vicuna-7b-v1.5

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = '/home/bruce/Downloads/models/AutoAWQ/falcon-7b'
quant_path = 'falcon-7b-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

Load model

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, load_safetensors=True, trust_remote_code=True)

Quantize

model.quantize(tokenizer, quant_config=quant_config)

Save quantized model

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

results in

Loading checkpoint shards: 100%|██████████████████████████████████████| 2/2 [00:07<00:00, 3.85s/it]
Found cached dataset json (/home/bruce/.cache/huggingface/datasets/mit-han-lab___json/mit-han-lab--pile-val-backup-39bc257d0ce73de2/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading cached shuffled indices for dataset at /home/bruce/.cache/huggingface/datasets/mit-han-lab___json/mit-han-lab--pile-val-backup-39bc257d0ce73de2/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-27bcfdae7a102b8f.arrow
Token indices sequence length is longer than the specified maximum sequence length for this model (8752 > 2048). Running this sequence through the model will result in indexing errors
AWQ Search: 0%| | 0/32 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/bruce/Downloads/AutoAWQ/unit_tests/quant-falcon-7b.py", line 15, in
model.quantize(tokenizer, quant_config=quant_config)
File "/home/bruce/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/bruce/Downloads/AutoAWQ/awq/models/base.py", line 54, in quantize
self.search_result = self._awq_search(
File "/home/bruce/Downloads/AutoAWQ/awq/models/base.py", line 193, in _awq_search
scales_list = auto_scale_block(
File "/home/bruce/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/bruce/Downloads/AutoAWQ/awq/quantize/auto_scale.py", line 179, in auto_scale_block
scales_list = [_auto_get_scale(**layer) for layer in layers]
File "/home/bruce/Downloads/AutoAWQ/awq/quantize/auto_scale.py", line 179, in
scales_list = [_auto_get_scale(**layer) for layer in layers]
File "/home/bruce/Downloads/AutoAWQ/awq/quantize/auto_scale.py", line 171, in _auto_get_scale
scales = _search_module_scale(module2inspect, layers, inp, kwargs)
File "/home/bruce/Downloads/AutoAWQ/awq/quantize/auto_scale.py", line 143, in _search_module_scale
fc.weight.data = w_quantize_func(
File "/home/bruce/Downloads/AutoAWQ/awq/quantize/auto_scale.py", line 101, in w_quantize_func
def w_quantize_func(p): return pseudo_quantize_tensor(p, w_bit=quant_config["w_bit"], q_group_size=quant_config["q_group_size"]).detach()
File "/home/bruce/Downloads/AutoAWQ/awq/quantize/quantizer.py", line 12, in pseudo_quantize_tensor
assert org_w_shape[-1] % q_group_size == 0
AssertionError

btw, get the 'Token indices sequence...' message on all models, since none have 8k context. Havent found option to change it yet, but hasn't stopped other models from quantizing. gptj also fails, need to recheck, haven't run OPT yet.

add llava model support

Original llm-awq project support LlavaL model, hope this could be add too.

AWQ_CAUSAL_LM_MODEL_MAP = {
    "mpt": MptAWQForCausalLM,
    "llama": LlamaAWQForCausalLM,
    "opt": OptAWQForCausalLM,
    "RefinedWeb": FalconAWQForCausalLM,
    "RefinedWebModel": FalconAWQForCausalLM,
    "bloom": BloomAWQForCausalLM,
}

Compatibility in Python 3.8 when running entry.py

The argparse.BooleanOptionalAction feature is only available in Python 3.9 and above. If we try to run in Python 3.8, the error occurred.

change
action=argparse.BooleanOptionalAction
to
action='store_true'
in https://github.com/casper-hansen/AutoAWQ/blob/main/awq/entry.py#L155

works fine.

Benchmark test data

Based on the newest released version AutoAWQ==0.0.2, the benchmark tests haved been conducted on 1*A100 80GB, the test data listed below.

It seemed that the acceleration achieved (compared to FP16) through INT4 quantization is not meeting the expected levels (2x).

Considering that the quantized INT4 model has been fused, what are the reasons for this?

The run_speed() in v0.0.2 using model.generate() API, which differs from v0.0.1 using model(). So, is this one of the reasons?

AutoAWQ adopation in other porjects yet?

Hi,

Just want to know the current state of adopation of AutoAWQ. Has there been any integration as a AWQ loader for other higher-level projects yet?

Support GPT-2

Implement faster LayerNorm than nn.LayerNorm

A variety of transformer models today use nn.LayerNorm (MPT, Falcon, GPT2). Some models have 2x LayerNorm per block, so there is room for optimization with a faster layernorm. 64 LayerNorms for 7B models and 96 LayerNorms for 30B models, so for every token generated, 64-96 LayerNorms can run faster.

Investigate the following:

xformers triton layernorm link
fastertransformers cuda layernorm link
transformer_engine layernorm link
llama.cpp f32 layernorm link
tvm.relax.op.nn.layer_norm link
CTranslate2 layernorm link

Create tuning section in quant_config

The idea is to tune all WQLinear on the split_k_iters in the range of [1,2,4,8,16,32] to find which value is the optimal value for each layer based on the latency. This can be run right after quantization and saved in the quant_config.json file, and then it can load the optimal values for each layer once you load the quantized model.

Source of idea:
https://github.com/mit-han-lab/llm-awq/blob/main/tinychat/utils/tune.py

Add LoRA fine-tuning to AWQ

It would be fantastic if we could add the ability to do LoRA fine-tuning and merging of adapters.

Background on QLoRA

Interestingly, for many fine-tunings, the results of QLoRA are very similar to doing an unquantized LoRA fine-tune. Of course, QLoRA allows for fine-tuning with one third of the VRAM requirement (if doing 4bit).

The two common libraries I use are:

bitsandbytes: This does not allow correct merging of adapters to the dequantized base model.
gptq: also does not allow merging of adapters, plus the perplexity is worse than awq (and bnb in some cases).

Why add LoRA to AWQ

AWQ has the best perplexity and good inference speed
If it were possible to do QLoRA AND merge adapters to the base dequantized model, AWQ would be the best available solution for doing fine-tuning, at least in quantized form.

Cuda issue when trying to install

Hej Casper (and whoever might read this),

I'm having some strange cuda issue awq when trying to install on a machine with py3.10, cu118.

`Installing collected packages: awq
Running setup.py develop for awq
error: subprocess-exited-with-error

× python setup.py develop did not run successfully.
│ exit code: 1
╰─> [224 lines of output]
    running develop
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/develop.py:40: EasyInstallDeprecationWarning: easy_install command is deprecated.
    !!
    
            ********************************************************************************
            Please avoid running ``setup.py`` and ``easy_install``.
            Instead, use pypa/build, pypa/installer or other
            standards-based tools.
    
            See https://github.com/pypa/setuptools/issues/917 for details.
            ********************************************************************************
    
    !!
      easy_install.initialize_options(self)
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
    !!
    
            ********************************************************************************
            Please avoid running ``setup.py`` directly.
            Instead, use pypa/build, pypa/installer or other
            standards-based tools.
    
            See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
            ********************************************************************************
    
    !!
      self.initialize_options()
    running egg_info
    writing awq.egg-info/PKG-INFO
    writing dependency_links to awq.egg-info/dependency_links.txt
    writing requirements to awq.egg-info/requires.txt
    writing top-level names to awq.egg-info/top_level.txt
    reading manifest file 'awq.egg-info/SOURCES.txt'
    adding license file 'LICENSE'
    writing manifest file 'awq.egg-info/SOURCES.txt'
    running build_ext
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py:398: UserWarning: There are no g++ version bounds defined for CUDA version 11.8
      warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
    building 'awq_inference_engine' extension
    Emitting ninja build file /workspace/AutoAWQ/build/temp.linux-x86_64-cpython-310/build.ninja...
    Compiling objects...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    [1/1] /usr/local/cuda/bin/nvcc  -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/py3.10/include/python3.10 -c -c /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu -o /workspace/AutoAWQ/build/temp.linux-x86_64-cpython-310/awq_cuda/quantization/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90
    FAILED: /workspace/AutoAWQ/build/temp.linux-x86_64-cpython-310/awq_cuda/quantization/gemm_cuda_gen.o
    /usr/local/cuda/bin/nvcc  -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/py3.10/include/python3.10 -c -c /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu -o /workspace/AutoAWQ/build/temp.linux-x86_64-cpython-310/awq_cuda/quantization/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=awq_inference_engine -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(35): warning #177-D: variable "scaling_factors_shared" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(36): warning #177-D: variable "zeros_shared" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(39): warning #177-D: variable "blockIdx_x" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(53): warning #177-D: variable "ld_zero_flag" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(228): warning #177-D: variable "scaling_factors_shared" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(229): warning #177-D: variable "zeros_shared" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(233): warning #177-D: variable "blockIdx_x" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(247): warning #177-D: variable "ld_zero_flag" was declared but never referenced
    
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
              detected during:
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
    (61): here
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/c10/core/TensorImpl.h(77): here
    
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
              detected during:
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
    (61): here
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/ATen/core/qualified_name.h(73): here
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(22): warning #177-D: function "__pack_half2" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(35): warning #177-D: variable "scaling_factors_shared" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(36): warning #177-D: variable "zeros_shared" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(39): warning #177-D: variable "blockIdx_x" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(53): warning #177-D: variable "ld_zero_flag" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(228): warning #177-D: variable "scaling_factors_shared" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(229): warning #177-D: variable "zeros_shared" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(233): warning #177-D: variable "blockIdx_x" was declared but never referenced
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(247): warning #177-D: variable "ld_zero_flag" was declared but never referenced
    
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
              detected during:
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
    (61): here
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/c10/core/TensorImpl.h(77): here
    
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
              detected during:
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
    (61): here
                instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]"
    /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/ATen/core/qualified_name.h(73): here
    
    /workspace/AutoAWQ/awq_cuda/quantization/gemm_cuda_gen.cu(22): warning #177-D: function "__pack_half2" was declared but never referenced
    
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 892; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 892; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 900; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 900; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 908; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 908; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 916; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 916; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 924; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 924; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 928; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 932; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 936; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 940; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 944; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 948; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 952; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 956; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 964; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 964; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 972; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 972; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 980; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 980; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 988; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 988; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 996; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 996; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1000; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1004; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1008; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1012; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1016; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1020; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1024; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1028; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1834; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1834; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1842; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1842; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1850; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1850; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1854; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1858; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1862; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1866; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1874; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1874; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1882; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1882; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1890; error   : Feature 'ldmatrix' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1890; error   : Modifier '.m8n8' requires .target sm_75 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1894; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1898; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1902; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas /tmp/tmpxft_00000313_00000000-10_gemm_cuda_gen.compute_70.ptx, line 1906; error   : Feature '.m16n8k16' requires .target sm_80 or higher
    ptxas fatal   : Ptx assembly aborted due to errors
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
        subprocess.run(
      File "/root/miniconda3/envs/py3.10/lib/python3.10/subprocess.py", line 526, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<string>", line 2, in <module>
      File "<pip-setuptools-caller>", line 34, in <module>
      File "/workspace/AutoAWQ/setup.py", line 41, in <module>
        setup(
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/__init__.py", line 107, in setup
        return distutils.core.setup(**attrs)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
        return run_commands(dist)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
        dist.run_commands()
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
        self.run_command(cmd)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/dist.py", line 1234, in run_command
        super().run_command(command)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
        cmd_obj.run()
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/develop.py", line 34, in run
        self.install_for_development()
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/develop.py", line 111, in install_for_development
        self.run_command('build_ext')
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
        self.distribution.run_command(command)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/dist.py", line 1234, in run_command
        super().run_command(command)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
        cmd_obj.run()
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
        _build_ext.run(self)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
        self.build_extensions()
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
        build_ext.build_extensions(self)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
        self._build_extensions_serial()
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
        self.build_extension(ext)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
        _build_ext.build_extension(self, ext)
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
        objects = self.compiler.compile(
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
        _write_ninja_file_and_compile_objects(
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
        _run_ninja_build(
      File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
        raise RuntimeError(message) from e
    RuntimeError: Error compiling objects for extension
    [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.

error: subprocess-exited-with-error`

Seems to happen when running pip install -e .

Implement metal kernel for GPUs on Mac

MIT released their TinyChatEngine which includes kernels for every kind of platform. To ensure wide availability, we should integrate the kernels for Metal which are kernels for running quantized on MacOS.

Requirements:

Run the Metal kernel through Python (maybe run on the fly?)
Automatically switch to the Metal kernel if torch.backends.mps.is_available(), otherwise use CUDA.
Figure out: How to support either GEMV/GEMM format with Metal

Kernel:
https://github.com/mit-han-lab/TinyChatEngine/blob/main/kernels/metal/kernel/op.metal#L10

... is not a folder containing a `.index.json` file or a pytorch_model.bin file

Does this library support the official META based LLama2 model(s), or just the huggingface flavor?

Hey there,

Thanks for releasing this! Does this library support the official implementations directly from META, or is the just the huggingface variety? Thanks!

Strange interaction for DeepZero via text-generation-webui?

I'm trying to integrate AutoAWQ into webui: oobabooga/text-generation-webui#3999
Unfortunately it is causing problems if DeepZero is used, I can't tell where the problem lies.

text-generation-webui  | 2023-09-19 13:13:18 INFO:Loading TheBloke_vicuna-13B-v1.5-16K-AWQ...
text-generation-webui  | [2023-09-19 13:13:19,051] [INFO] [partition_parameters.py:347:__exit__] finished initializing model - num_params = 1, num_elems = 0.16B
text-generation-webui  | 2023-09-19 13:13:19 ERROR:Failed to load the model.
text-generation-webui  | Traceback (most recent call last):
text-generation-webui  |   File "/app/modules/ui_model_menu.py", line 194, in load_model_wrapper
text-generation-webui  |     shared.model, shared.tokenizer = load_model(shared.model_name, loader)
text-generation-webui  |   File "/app/modules/models.py", line 78, in load_model
text-generation-webui  |     output = load_func_map[loader](model_name)
text-generation-webui  |   File "/app/modules/models.py", line 285, in AutoAWQ_loader
text-generation-webui  |     return AutoAWQForCausalLM.from_quantized(quant_path=Path(f'{shared.args.model_dir}/{model_name}'),
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/awq/models/auto.py", line 44, in from_quantized
text-generation-webui  |     return AWQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized(
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/awq/models/base.py", line 338, in from_quantized
text-generation-webui  |     model = AutoModelForCausalLM.from_config(config=config, torch_dtype=torch_dtype, trust_remote_code=trust_remote_code)
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 446, in from_config
text-generation-webui  |     return model_class._from_config(config, **kwargs)
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1168, in _from_config
text-generation-webui  |     model = cls(config, **kwargs)
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
text-generation-webui  |     f(module, *args, **kwargs)
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 747, in __init__
text-generation-webui  |     self.model = LlamaModel(config)
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 458, in wrapper
text-generation-webui  |     f(module, *args, **kwargs)
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 576, in __init__
text-generation-webui  |     self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 465, in wrapper
text-generation-webui  |     self._post_init_method(module)
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 984, in _post_init_method
text-generation-webui  |     param.data = param.data.to(self.local_device)
text-generation-webui  | NotImplementedError: Cannot copy out of meta tensor; no data!

Any ideas?

不使用显存加载模型，仅使用内存和cpu

什么时候可以仅使用cpu和内存进行量化模型啊

使用conda安装autoawq，报错

使用conda安装autoawq，报错。

Traceback (most recent call last):
File "/root/bei/Models/autoawq_quention.py", line 1, in
from awq import AutoAWQForCausalLM
File "/root/miniconda/envs/autoawq/lib/python3.10/site-packages/awq/init.py", line 2, in
from awq.models.auto import AutoAWQForCausalLM
File "/root/miniconda/envs/autoawq/lib/python3.10/site-packages/awq/models/init.py", line 1, in
from .mpt import MptAWQForCausalLM
File "/root/miniconda/envs/autoawq/lib/python3.10/site-packages/awq/models/mpt.py", line 1, in
from .base import BaseAWQForCausalLM
File "/root/miniconda/envs/autoawq/lib/python3.10/site-packages/awq/models/base.py", line 11, in
from awq.quantize.quantizer import AwqQuantizer
File "/root/miniconda/envs/autoawq/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 10, in
from awq.modules.linear import WQLinear_GEMM, WQLinear_GEMV
File "/root/miniconda/envs/autoawq/lib/python3.10/site-packages/awq/modules/linear.py", line 4, in
import awq_inference_engine # with CUDA kernels
ImportError: libc10_cuda.so: cannot open shared object file: No such file or directory

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

cuda环境

'ellipsis' object has no attribute 'shape'

Getting this error when trying to load a model for inference:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)

AttributeError Traceback (most recent call last)
Cell In[1], line 10
7 model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
8 tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
---> 10 model.generate(...)

File /workspace/AutoAWQ/awq/models/base.py:37, in BaseAWQForCausalLM.generate(self, *args, **kwargs)
35 def generate(self, *args, **kwargs):
36 with torch.inference_mode():
---> 37 return self.model.generate(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1455, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1447 # 3. Define model inputs
1448 # inputs_tensor has to be defined
1449 # model_input_name is defined if model-specific keyword input is passed
1450 # otherwise model_input_name is None
1451 # all model-specific keyword inputs are removed from model_kwargs
1452 inputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(
1453 inputs, generation_config.bos_token_id, model_kwargs
1454 )
-> 1455 batch_size = inputs_tensor.shape[0]
1457 # 4. Define other model kwargs
1458 model_kwargs["output_attentions"] = generation_config.output_attentions

AttributeError: 'ellipsis' object has no attribute 'shape'

Experiment with implementing AWQ for BERT models

If we can speed up the BERT model, we will significantly increase the throughput of many cases. Experiment with SentenceTransformers first.

Windows build support

Resolve LNK2001: unresolved external symbol errors when building on Windows.

Also needs to work with GitHub workflows.

Bug hunt: illegal memory access

This occurs when modifying a few things in AWQ. I have not yet found the reason why but this bug needs to be squashed because it is preventing a few upgrades.

RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

You can trigger this by modifying batch_size from 1 to 2 here:

AutoAWQ/awq/entry.py

Line 105 in 560fbe5

ids = torch.randint(0, tokenizer.vocab_size, (1, n_context)).cuda()

It also triggers on the exllama branch when running the speed test:

AutoAWQ/awq/entry.py

Line 141 in 855854a

    
               python -m awq.entry --entry_type speed --model_path vicuna-7b-v1.5-awq --quant_file awq_model_w4_g128.pt

CodeLlama 34B errors out after 3+ completions

Running codellama 34b using latest autoawq (installed from repo):

  File "/home/anton/personal/transformer-experiments/env/lib/python3.10/site-packages/awq/modules/fused/attn.py", line 183, in forward
    self.cache_v[:bsz, :, self.start_pos : self.start_pos + seqlen, :] = values_store
RuntimeError: The expanded size of the tensor (0) must match the existing size (17) at non-singleton dimension 2.  Target sizes: [1, 8, 0, 128].  Tensor sizes: [8, 17, 128]

To reproduce:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/CodeLlama-34B-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(
    model_name_or_path,
    fuse_layers=True,
    trust_remote_code=False,
    safetensors=True,
    max_new_tokens=1024,
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)

tokens = tokenizer(
    "# Write a python function to loop to 1000\n\ndef", return_tensors="pt"
).to("cuda")

# Generate output
for _ in range(10):
    generation_output = model.generate(
        **tokens,
        do_sample=True,
        temperature=0.2,
        top_p=0.95,
        top_k=0,
        max_new_tokens=512,
    )

    print(tokenizer.decode(generation_output[0], skip_special_tokens=True))

Quantize models with custom datasets

Currently, the only way to quantize using AWQ is with the validation dataset from The Pile. This is good for general use cases, but it would be beneficial to find the optimal scaling factor based on custom datasets for minimum accuracy loss.

Requirements:

either pass a huggingface path to a dataset
or pass a custom function that loads the dataset and applies custom preprocessing before quantization

Clean up fused modules

Fused modules were moved from the original TinyChat into the core of AWQ. Work on generalizing the code better with the parameters in the fused modules.

A few things that need cleaning:

https://github.com/casper-hansen/AutoAWQ/blob/main/awq/modules/fused_attn.py#L12
https://github.com/casper-hansen/AutoAWQ/blob/main/awq/modules/fused_attn.py#L107
https://github.com/casper-hansen/AutoAWQ/blob/main/awq/modules/fused_attn.py#L139
https://github.com/casper-hansen/AutoAWQ/blob/main/awq/modules/fused_mlp.py#L79C8-L79C31
https://github.com/casper-hansen/AutoAWQ/blob/main/awq/modules/fused_norm.py#L27

Additionally, move the code for isinstance(m, LlamaMLP) into the actual model class instead.

Interested in Hugging Face transformers integration?

Hi @casper-hansen !
Amazing work on building this repository! We were interested in integrating AWQ into transformers to the official list of supported quantized methods together with bitsandbytes (LLM.int8 & QLoRA) and recently auto-gptq.

https://huggingface.co/blog/gptq-integration#native-support-of-gptq-models-in--transformers

Would you be happy to collaborate on this?
Thanks and looking forward to your reply!

Must split_k_iters be 8 for GEMM or GEMMv2 kernel?

Thanks for your jobs!

I'd like to inference with larger batchsize(e.g. 64).
For better performance, can I adjust the value of split_k_iters?

Optimize GEMV kernel - context and batch size

Currently, the GEMV kernel is blazing fast at token generation - it achieves 90-95% of ExLlama V2 in terms of speed. However, where it lacks is in processing the context. Specifically, processing the context is 10x slower than with the GEMM kernel which achieves 75-80% of ExLlama V2 speed.

This issue is simple: optimize the GEMV kernel such that we keep the fast token generation but get faster context processing (ideally close to same speed as GEMM).

Create class QuantConfig

Should implement:

from_pretrained
save_pretrained

A few more requirements:

Move config loading into this class.

Handle `n_kv_heads` for fused layers

Currently, the TinyLlama model fails with fused layers because the shapes are not set up correctly for that specific model due to not handling n_kv_heads.

The general calculation seems to be:
n_tokens*(n_heads+(n_kv_heads*2))*(hidden_size // n_heads)

This should fix problems for all models with GQA.

EDIT: Potentially also fix shapes during forward of GEMM.

print(x.reshape(-1, x.shape[-1]).shape[1] / self.scales.shape[0]) is 4 for TinyLlama, needs to be multiple of 32

EDIT: Probably best to take inspiration from Llama / CodeLlama

No kv_heads:
https://github.com/facebookresearch/llama/blob/1076b9c51c77ad06e9d7ba8a4c6df775741732bd/llama/model.py#L76

With kv_heads:
https://github.com/facebookresearch/codellama/blob/427d6ac90f0b7db206bc4c62f4c5d38f92ca4d10/llama/model.py#L90

!python -m pip install --upgrade pip
!pip install git+https://github.com/huggingface/transformers
!git clone https://github.com/casper-hansen/AutoAWQ
%cd AutoAWQ
!pip3 install .

Replication:

model_name_or_path = "TheBloke/Llama-2-7B-chat-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]

'''

print("\n\n*** Generate:")

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

print("Output: ", tokenizer.decode(generation_output[0]))

Error:

TypeError                                 Traceback (most recent call last)
Cell In[12], line 24
     18 tokens = tokenizer(
     19     prompt_template,
     20     return_tensors='pt'
     21 ).input_ids.cuda()
     23 # Generate output
---> 24 generation_output = model.generate(
     25     tokens,
     26     do_sample=True,
     27     temperature=0.7,
     28     top_p=0.95,
     29     top_k=40,
     30     max_new_tokens=512
     31 )
     33 print("Output: ", tokenizer.decode(generation_output[0]))

File /usr/local/lib/python3.10/dist-packages/awq/models/base.py:36, in BaseAWQForCausalLM.generate(self, *args, **kwargs)
     34 def generate(self, *args, **kwargs):
     35     with torch.inference_mode():
---> 36         return self.model.generate(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1652, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1644     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1645         input_ids=input_ids,
   1646         expand_size=generation_config.num_return_sequences,
   1647         is_encoder_decoder=self.config.is_encoder_decoder,
   1648         **model_kwargs,
   1649     )
   1651     # 13. run sample
-> 1652     return self.sample(
   1653         input_ids,
   1654         logits_processor=logits_processor,
   1655         logits_warper=logits_warper,
   1656         stopping_criteria=stopping_criteria,
   1657         pad_token_id=generation_config.pad_token_id,
   1658         eos_token_id=generation_config.eos_token_id,
   1659         output_scores=generation_config.output_scores,
   1660         return_dict_in_generate=generation_config.return_dict_in_generate,
   1661         synced_gpus=synced_gpus,
   1662         streamer=streamer,
   1663         **model_kwargs,
   1664     )
   1666 elif generation_mode == GenerationMode.BEAM_SEARCH:
   1667     # 11. prepare beam search scorer
   1668     beam_scorer = BeamSearchScorer(
   1669         batch_size=batch_size,
   1670         num_beams=generation_config.num_beams,
   (...)
   1675         max_length=generation_config.max_length,
   1676     )

File /usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:2734, in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2731 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2733 # forward pass to get next token
-> 2734 outputs = self(
   2735     **model_inputs,
   2736     return_dict=True,
   2737     output_attentions=output_attentions,
   2738     output_hidden_states=output_hidden_states,
   2739 )
   2741 if synced_gpus and this_peer_finished:
   2742     continue  # don't waste resources running the code we don't need

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:1034, in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1031 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   1033 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1034 outputs = self.model(
   1035     input_ids=input_ids,
   1036     attention_mask=attention_mask,
   1037     position_ids=position_ids,
   1038     past_key_values=past_key_values,
   1039     inputs_embeds=inputs_embeds,
   1040     use_cache=use_cache,
   1041     output_attentions=output_attentions,
   1042     output_hidden_states=output_hidden_states,
   1043     return_dict=return_dict,
   1044 )
   1046 hidden_states = outputs[0]
   1047 if self.config.pretraining_tp > 1:

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:921, in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    917     layer_outputs = torch.utils.checkpoint.checkpoint(
    918         create_custom_forward(decoder_layer), hidden_states, attention_mask, position_ids
    919     )
    920 else:
--> 921     layer_outputs = decoder_layer(
    922         hidden_states,
    923         attention_mask=attention_mask,
    924         position_ids=position_ids,
    925         past_key_value=past_key_value,
    926         output_attentions=output_attentions,
    927         use_cache=use_cache,
    928         padding_mask=padding_mask,
    929     )
    931 hidden_states = layer_outputs[0]
    933 if use_cache:

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:631, in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, padding_mask)
    628 hidden_states = self.input_layernorm(hidden_states)
    630 # Self Attention
--> 631 hidden_states, self_attn_weights, present_key_value = self.self_attn(
    632     hidden_states=hidden_states,
    633     attention_mask=attention_mask,
    634     position_ids=position_ids,
    635     past_key_value=past_key_value,
    636     output_attentions=output_attentions,
    637     use_cache=use_cache,
    638     padding_mask=padding_mask,
    639 )
    640 hidden_states = residual + hidden_states
    642 # Fully Connected

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

TypeError: QuantAttentionFused.forward() got an unexpected keyword argument 'padding_mask'