Unable to load the package after: <div class="snippet-clipboard-content notranslat

Try this: <div class="snippet-clipboard-content notranslate position-relative over

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Are you sure you ran it exactly like I showed? <div class="snippet-clipboard-conte

Sure I tried that and it worked. While trying the demo: <div cla

It seems the repo path is wrong: <div class="snippet-clipboard-content notranslate

Module 'quant_cuda' has no attribute 'vecquant4matmul',about autogptq/autogptq

Comments (23)

PanQiWei commented on May 20, 2024 2

download quantized model from HF hub is a feature that I plan to add in v0.2.0, I'm now writing the features plan of v0.2.0 and v0.3.0, anyone interested can see in Projects later.

from autogptq.

TheBloke commented on May 20, 2024 1

Try this:

! BUILD_CUDA_EXT=0 pip install auto-gptq

from autogptq.

vrunm commented on May 20, 2024 1

@TheBloke That's been my concern from the start I was trying versions of Alpaca, GPT-J, Bloom, OPT, Pegasus but was not able to load them from huggingface.

from autogptq.

vrunm commented on May 20, 2024 1

Sure I have used bitsandbytes and accelerate and have experimented with Alpaca, OPT, GPT-J using their 8bit versions

from autogptq.

vrunm commented on May 20, 2024

  Preparing metadata (setup.py) ... done
Requirement already satisfied: accelerate>=0.18.0 in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (0.18.0)
Requirement already satisfied: datasets in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (2.1.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (1.23.5)
Requirement already satisfied: rouge in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (1.0.1)
Requirement already satisfied: torch>=1.13.0 in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (2.0.0)
Requirement already satisfied: safetensors in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (0.3.1)
Requirement already satisfied: transformers>=4.26.1 in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (4.28.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from accelerate>=0.18.0->auto-gptq==0.1.0.dev0) (5.9.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from accelerate>=0.18.0->auto-gptq==0.1.0.dev0) (21.3)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from accelerate>=0.18.0->auto-gptq==0.1.0.dev0) (6.0)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (1.11.1)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (3.1.2)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (3.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (4.5.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (3.11.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (0.13.3)
Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (0.13.4)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (2023.3.23)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (2.28.2)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (4.64.1)
Requirement already satisfied: fsspec[http]>=2021.05.0 in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (2023.4.0)
Requirement already satisfied: pyarrow>=5.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (10.0.1)
Requirement already satisfied: dill in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (0.3.6)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (3.2.0)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (1.5.3)
Requirement already satisfied: responses<0.19 in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (0.18.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (0.70.14)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (3.8.4)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from rouge->auto-gptq==0.1.0.dev0) (1.16.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (2.1.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (1.8.2)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (4.0.2)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (1.3.3)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (22.2.0)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (1.3.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging>=20.0->accelerate>=0.18.0->auto-gptq==0.1.0.dev0) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->transformers>=4.26.1->auto-gptq==0.1.0.dev0) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->transformers>=4.26.1->auto-gptq==0.1.0.dev0) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->transformers>=4.26.1->auto-gptq==0.1.0.dev0) (1.26.15)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch>=1.13.0->auto-gptq==0.1.0.dev0) (2.1.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets->auto-gptq==0.1.0.dev0) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets->auto-gptq==0.1.0.dev0) (2023.3)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch>=1.13.0->auto-gptq==0.1.0.dev0) (1.3.0)
Building wheels for collected packages: auto-gptq
  Building wheel for auto-gptq (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [45 lines of output]
      /opt/conda/lib/python3.10/site-packages/setuptools/dist.py:493: UserWarning: Normalizing 'v0.1.0-dev' to '0.1.0.dev0'
        warnings.warn(tmpl.format(**locals()))
      running bdist_wheel
      running build
      running build_py
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/kaggle/working/AutoGPTQ/setup.py", line 49, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 343, in run
          self.run_command("build")
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for auto-gptq
  Running setup.py clean for auto-gptq
Failed to build auto-gptq
Installing collected packages: auto-gptq
  Running setup.py install for auto-gptq ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for auto-gptq did not run successfully.
  │ exit code: 1
  ╰─> [90 lines of output]
      /opt/conda/lib/python3.10/site-packages/setuptools/dist.py:493: UserWarning: Normalizing 'v0.1.0-dev' to '0.1.0.dev0'
        warnings.warn(tmpl.format(**locals()))
      running install
      /opt/conda/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.10
      creating build/lib.linux-x86_64-3.10/auto_gptq
      copying auto_gptq/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      creating build/lib.linux-x86_64-3.10/auto_gptq/utils
      copying auto_gptq/utils/data_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/utils
      copying auto_gptq/utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/utils
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      creating build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      creating build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gpt2.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/kaggle/working/AutoGPTQ/setup.py", line 49, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run
          return orig.install.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/install.py", line 568, in run
          self.run_command('build')
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> auto-gptq

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failur

I tried that still ran into the same issue.

from autogptq.

TheBloke commented on May 20, 2024

Are you sure you ran it exactly like I showed?

! BUILD_CUDA_EXT=0 pip install auto-gptq

This is wrong syntax:

BUILD_CUDA_EXT=0
!pip install auto-gptq

Please show the output of running:

! BUILD_CUDA_EXT=0 pip install -v auto-gptq

from autogptq.

vrunm commented on May 20, 2024

Sure I tried that and it worked.

While trying the demo:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig


pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"


tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
)

# load un-quantized model, the model will always be force loaded into cpu
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask" 
# with value under torch.LongTensor type.
model.quantize(examples, use_triton=False)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

# load quantized model, currently only support cpu or single gpu
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to("cuda:0"))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /tmp/ipykernel_32/3071398276.py:38 in <module>                                                   │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_32/3071398276.py'                           │
│                                                                                                  │
│ /kaggle/working/AutoGPTQ/auto_gptq/modeling/_base.py:371 in generate                             │
│                                                                                                  │
│   368 │   def generate(self, **kwargs):                                                          │
│   369 │   │   """shortcut for model.generate"""                                                  │
│   370 │   │   with torch.inference_mode(), torch.amp.autocast(device_type=self.device.type):     │
│ ❱ 371 │   │   │   return self.model.generate(**kwargs)                                           │
│   372 │                                                                                          │
│   373 │   def prepare_inputs_for_generation(self, *args, **kwargs):                              │
│   374 │   │   """shortcut for model.prepare_inputs_for_generation"""                             │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:115 in decorate_context       │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1437 in generate        │
│                                                                                                  │
│   1434 │   │   │   │   )                                                                         │
│   1435 │   │   │                                                                                 │
│   1436 │   │   │   # 11. run greedy search                                                       │
│ ❱ 1437 │   │   │   return self.greedy_search(                                                    │
│   1438 │   │   │   │   input_ids,                                                                │
│   1439 │   │   │   │   logits_processor=logits_processor,                                        │
│   1440 │   │   │   │   stopping_criteria=stopping_criteria,                                      │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:2248 in greedy_search   │
│                                                                                                  │
│   2245 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2246 │   │   │                                                                                 │
│   2247 │   │   │   # forward pass to get next token                                              │
│ ❱ 2248 │   │   │   outputs = self(                                                               │
│   2249 │   │   │   │   **model_inputs,                                                           │
│   2250 │   │   │   │   return_dict=True,                                                         │
│   2251 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:938 in forward   │
│                                                                                                  │
│    935 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return  │
│    936 │   │                                                                                     │
│    937 │   │   # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)   │
│ ❱  938 │   │   outputs = self.model.decoder(                                                     │
│    939 │   │   │   input_ids=input_ids,                                                          │
│    940 │   │   │   attention_mask=attention_mask,                                                │
│    941 │   │   │   head_mask=head_mask,                                                          │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:704 in forward   │
│                                                                                                  │
│    701 │   │   │   │   │   None,                                                                 │
│    702 │   │   │   │   )                                                                         │
│    703 │   │   │   else:                                                                         │
│ ❱  704 │   │   │   │   layer_outputs = decoder_layer(                                            │
│    705 │   │   │   │   │   hidden_states,                                                        │
│    706 │   │   │   │   │   attention_mask=causal_attention_mask,                                 │
│    707 │   │   │   │   │   layer_head_mask=(head_mask[idx] if head_mask is not None else None),  │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:329 in forward   │
│                                                                                                  │
│    326 │   │   │   hidden_states = self.self_attn_layer_norm(hidden_states)                      │
│    327 │   │                                                                                     │
│    328 │   │   # Self Attention                                                                  │
│ ❱  329 │   │   hidden_states, self_attn_weights, present_key_value = self.self_attn(             │
│    330 │   │   │   hidden_states=hidden_states,                                                  │
│    331 │   │   │   past_key_value=past_key_value,                                                │
│    332 │   │   │   attention_mask=attention_mask,                                                │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:174 in forward   │
│                                                                                                  │
│    171 │   │   bsz, tgt_len, _ = hidden_states.size()                                            │
│    172 │   │                                                                                     │
│    173 │   │   # get query proj                                                                  │
│ ❱  174 │   │   query_states = self.q_proj(hidden_states) * self.scaling                          │
│    175 │   │   # get key, value proj                                                             │
│    176 │   │   if is_cross_attention and past_key_value is not None:                             │
│    177 │   │   │   # reuse k,v, cross_attentions                                                 │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /kaggle/working/AutoGPTQ/auto_gptq/nn_modules/qlinear.py:189 in forward                          │
│                                                                                                  │
│   186 │   │   │   elif self.bits == 3:                                                           │
│   187 │   │   │   │   quant_cuda.vecquant3matmul(x.float(), self.qweight, out, self.scales.flo   │
│   188 │   │   │   elif self.bits == 4:                                                           │
│ ❱ 189 │   │   │   │   quant_cuda.vecquant4matmul(x.float(), self.qweight, out, self.scales.flo   │
│   190 │   │   │   elif self.bits == 8:                                                           │
│   191 │   │   │   │   quant_cuda.vecquant8matmul(x.float(), self.qweight, out, self.scales.flo   │
│   192 │   │   │   else:                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: module 'quant_cuda' has no attribute 'vecquant4matmul'

The bug seems to be referenced here

from autogptq.

TheBloke commented on May 20, 2024

I think the reason for the error is that you have another version of quant-cuda already installed, likely from GPTQ-for-LLaMa. You should first do: pip uninstall quant-cuda.

However you can't use CUDA AutoGPTQ anyway. You just told auto-gptq to install without CUDA (BUILD_CUDA_EXT=0), and you need to do that because your CUDA Toolkit version doesn't match the version used to compiled pytorch.

If you want to fix that you could uninstall CUDA Toolkit 12.1 and install CUDA Toolkit 11.8 instead. Or you could build pytorch from source, so it can use CUDA Toolkit 12.1. There's no pre-built pytorch binaries for CUDA 12.x yet. Instructions for doing that are on the pytorch Github - it takes a while though, at least an hour to build.

Or forget about CUDA and use Triton instead:

Here is simple example code to use Triton to load this GPTQ model: https://huggingface.co/TheBloke/stable-vicuna-13B-GPTQ

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantized_model_dir = "/path/to/stable-vicuna-13B-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

def get_config(has_desc_act):
    return BaseQuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
        desc_act=has_desc_act
    )

def get_model(model_base, triton, model_has_desc_act):
    return AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_safetensors=True, model_basename=model_base, device="cuda:0", use_triton=triton, quantize_config=get_config(model_has_desc_act))

# Prevent printing spurious transformers error
logging.set_verbosity(logging.CRITICAL)

prompt='''### Human: Write a story about llamas
### Assistant:'''

model = get_model("stable-vicuna-13B-GPTQ-4bit.compat.no-act-order", triton=True, model_has_desc_act=False)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print("### Inference:")
print(pipe(prompt)[0]['generated_text'])

from autogptq.

vrunm commented on May 20, 2024

It seems the repo path is wrong:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /tmp/ipykernel_32/3845330370.py:6 in <module>                                                    │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_32/3845330370.py'                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:642 in     │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   639 │   │   │   return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *input   │
│   640 │   │                                                                                      │
│   641 │   │   # Next, let's try to use the tokenizer_config file to get the tokenizer class.     │
│ ❱ 642 │   │   tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)   │
│   643 │   │   if "_commit_hash" in tokenizer_config:                                             │
│   644 │   │   │   kwargs["_commit_hash"] = tokenizer_config["_commit_hash"]                      │
│   645 │   │   config_tokenizer_class = tokenizer_config.get("tokenizer_class")                   │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:486 in     │
│ get_tokenizer_config                                                                             │
│                                                                                                  │
│   483 │   tokenizer_config = get_tokenizer_config("tokenizer-test")                              │
│   484 │   ```"""                                                                                 │
│   485 │   commit_hash = kwargs.get("_commit_hash", None)                                         │
│ ❱ 486 │   resolved_config_file = cached_file(                                                    │
│   487 │   │   pretrained_model_name_or_path,                                                     │
│   488 │   │   TOKENIZER_CONFIG_FILE,                                                             │
│   489 │   │   cache_dir=cache_dir,                                                               │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:409 in cached_file             │
│                                                                                                  │
│    406 │   user_agent = http_user_agent(user_agent)                                              │
│    407 │   try:                                                                                  │
│    408 │   │   # Load from URL or cache if already cached                                        │
│ ❱  409 │   │   resolved_file = hf_hub_download(                                                  │
│    410 │   │   │   path_or_repo_id,                                                              │
│    411 │   │   │   filename,                                                                     │
│    412 │   │   │   subfolder=None if len(subfolder) == 0 else subfolder,                         │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:112 in _inner_fn    │
│                                                                                                  │
│   109 │   │   │   kwargs.items(),  # Kwargs values                                               │
│   110 │   │   ):                                                                                 │
│   111 │   │   │   if arg_name in ["repo_id", "from_id", "to_id"]:                                │
│ ❱ 112 │   │   │   │   validate_repo_id(arg_value)                                                │
│   113 │   │   │                                                                                  │
│   114 │   │   │   elif arg_name == "token" and arg_value is not None:                            │
│   115 │   │   │   │   has_token = True                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:160 in              │
│ validate_repo_id                                                                                 │
│                                                                                                  │
│   157 │   │   raise HFValidationError(f"Repo id must be a string, not {type(repo_id)}: '{repo_   │
│   158 │                                                                                          │
│   159 │   if repo_id.count("/") > 1:                                                             │
│ ❱ 160 │   │   raise HFValidationError(                                                           │
│   161 │   │   │   "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"            │
│   162 │   │   │   f" '{repo_id}'. Use `repo_type` argument if needed."                           │
│   163 │   │   )                                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 
'/path/to/stable-vicuna-13B-GPTQ'. Use `repo_type` argument if needed.

from autogptq.

TheBloke commented on May 20, 2024

My example code assumes the model is downloaded locally. Download the model then update quantized_model_dir to point to the folder where you downloaded it.

AutoGPTQ does not yet support loading a model directly from Hugging Face.

from autogptq.

vrunm commented on May 20, 2024

@TheBloke Is there an open PR to do this?

from autogptq.

TheBloke commented on May 20, 2024

No, no-one is looking at it yet to my knowledge. Remember that AutoGPTQ is still new and under active development. Such improvements will come over time.

It's easy to download the model first. If you want a quick way to do that, clone text-generation-webui and call its download-model.py:

root@bce51ed9603e:/workspace# python /root/text-generation-webui/download-model.py -h
usage: download-model.py [-h] [--branch BRANCH] [--threads THREADS] [--text-only] [--output OUTPUT] [--clean] [--check] [MODEL]

positional arguments:
  MODEL

options:
  -h, --help         show this help message and exit
  --branch BRANCH    Name of the Git branch to download from.
  --threads THREADS  Number of files to download simultaneously.
  --text-only        Only download text files (txt/json).
  --output OUTPUT    The folder where the model should be saved.
  --clean            Does not resume the previous download.
  --check            Validates the checksums of model files.

root@bce51ed9603e:/workspace# python /root/text-generation-webui/download-model.py TheBloke/stable-vicuna-13B-GPTQ
Downloading the model to models/TheBloke_stable-vicuna-13B-GPTQ
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.2k /13.2k  11.9MiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0  /21.0   22.7kiB/s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 587   /587    649kiB/s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137   /137    157kiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81.0  /81.0   93.1kiB/s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96.0  /96.0   107kiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.26G /7.26G  36.8MiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M /1.84M  3.27MiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k  /500k   1.97MiB/s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 715   /715    808kiB/s

root@bce51ed9603e:/workspace# ll models/TheBloke_stable-vicuna-13B-GPTQ/
total 7093305
drwxrwxrwx 2 root root    3000675 May  4 08:30 ./
drwxrwxrwx 7 root root    3002107 May  4 08:27 ../
-rw-rw-rw- 1 root root      13247 May  4 08:27 README.md
-rw-rw-rw- 1 root root         21 May  4 08:27 added_tokens.json
-rw-rw-rw- 1 root root        587 May  4 08:27 config.json
-rw-rw-rw- 1 root root        137 May  4 08:27 generation_config.json
-rw-rw-rw- 1 root root        333 May  4 08:27 huggingface-metadata.txt
-rw-rw-rw- 1 root root         81 May  4 08:27 quantize_config.json
-rw-rw-rw- 1 root root         96 May  4 08:27 special_tokens_map.json
-rw-rw-rw- 1 root root 7255179696 May  4 08:30 stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors
-rw-rw-rw- 1 root root    1842847 May  4 08:30 tokenizer.json
-rw-rw-rw- 1 root root     499723 May  4 08:30 tokenizer.model
-rw-rw-rw- 1 root root        715 May  4 08:30 tokenizer_config.json
root@bce51ed9603e:/workspace#

In this example, using my example code you would then set

quantized_model_dir =  "/workspace/models/TheBloke_stable-vicuna-13B-GPTQ"

from autogptq.

vrunm commented on May 20, 2024

@TheBloke Sure will try that. While loading TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g from huggingface. It takes very long to load on a kaggle cpu. Is there any way around that?

from autogptq.

TheBloke commented on May 20, 2024

I don't think so, no. Are you trying to use the CPU only, or do you have a GPU for inference?

from autogptq.

vrunm commented on May 20, 2024

@TheBloke I am trying using CPU only but even while GPU only it does not work. I think it requires 28GB of GPU RAM to load?

from autogptq.

TheBloke commented on May 20, 2024

CPU only will definitely be terribly slow.

For a 13B 4bit model like stable-vicuna-13B you need around 9GB VRAM. Nothing like 28GB - that would be for an unquantised fp16 model.

from autogptq.

TheBloke commented on May 20, 2024

nvidia-smi stats while running the example code I showed before on stable-vicuna-13B:

timestamp, name, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/04 08:41:04.156, NVIDIA RTX A4500, 20 %, 7 %, 20470 MiB, 11292 MiB, 8894 MiB

Last column is used VRAM

from autogptq.

vrunm commented on May 20, 2024

I have a VRAM of 14GB. But it still is not able to run?

from autogptq.

TheBloke commented on May 20, 2024

Please show the full output of running the script

from autogptq.

vrunm commented on May 20, 2024

After running the model I got this:


/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[1], line 5
      1 from transformers import AutoTokenizer, AutoModelForCausalLM
      3 tokenizer = AutoTokenizer.from_pretrained("TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g")
----> 5 model = AutoModelForCausalLM.from_pretrained("TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g")

File /opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:471, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    469 elif type(config) in cls._model_mapping.keys():
    470     model_class = _get_model_class(config, cls._model_mapping)
--> 471     return model_class.from_pretrained(
    472         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    473     )
    474 raise ValueError(
    475     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    476     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    477 )

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:2511, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2505             raise EnvironmentError(
   2506                 f"{pretrained_model_name_or_path} does not appear to have a file named"
   2507                 f" {_add_variant(WEIGHTS_NAME, variant)} but there is a file without the variant"
   2508                 f" {variant}. Use `variant=None` to load this model from those weights."
   2509             )
   2510         else:
-> 2511             raise EnvironmentError(
   2512                 f"{pretrained_model_name_or_path} does not appear to have a file named"
   2513                 f" {_add_variant(WEIGHTS_NAME, variant)}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME} or"
   2514                 f" {FLAX_WEIGHTS_NAME}."
   2515             )
   2516 except EnvironmentError:
   2517     # Raise any environment error raise by `cached_file`. It will have a helpful error message adapted
   2518     # to the original exception.
   2519     raise

OSError: TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

from autogptq.

TheBloke commented on May 20, 2024

This is wrong:

model = AutoModelForCausalLM.from_pretrained("TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g")

It should be AutoGPTQForCausalLM, and needs the other arguments I showed in my example code.

Please use the example code I provided as a base.

from autogptq.

vrunm commented on May 20, 2024

@TheBloke I had taken it from the huggingface model card. Do all your models require AutoGPTQ or you have support for huggingface transformers ?

from autogptq.

TheBloke commented on May 20, 2024

@TheBloke I had taken it from the huggingface model card. Do all your models require AutoGPTQ or you have support for huggingface transformers ?

I have many models that support unquantised transformers inference. But they will need a lot more VRAM. I don't think you can load any of them in fp16 in 16GB, but you could load them in 8bit.

Here are my HF models: https://huggingface.co/models?search=thebloke%20hf

A 7B fp16 model may load in 16GB VRAM. A 13B fp16 definitely will not.

But either way I recommend you add load_in_8bit=Trueto the model = AutoModelForCausalLM() call, and then it will require half as much VRAM. This requires the bitsandbytes library: pip install bitsandbytes (you may have it installed already.)

If you are loading unquantised HF models then that is not relevant to AutoGPTQ. For further support with that, please open a Discussion on the HF model page for whichever of my models you try to use.

from autogptq.

Module 'quant_cuda' has no attribute 'vecquant4matmul' about autogptq HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent