Comments (10)
Thank you for this! My best guess is that the number of tokens exceed the cache. Will have to investigate this
from autoawq.
I've seen the same with other models. Thanks for the script @abacaj I'm going to run some other models through their paces to see if I can reproduce.
AutoAWQ=0.1.0, python=3.10, cuda=11.8, rtx 3090
I can reproduce the error with any model:
- TheBloke/Llama-2-7b-Chat-AWQ
- TheBloke/vicuna-7B-v1.5-AWQ
- casperhansen/vicuna-7B-v1.5-AWQ
File "/home/alansrobotlab/anaconda3/envs/textgen/lib/python3.11/site-packages/awq/modules/fused/attn.py", line 183, in forward self.cache_v[:bsz, :, self.start_pos : self.start_pos + seqlen, :] = values_store ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The expanded size of the tensor (0) must match the existing size (24) at non-singleton dimension 2. Target sizes: [1, 32, 0, 128]. Tensor sizes: [32, 24, 128]
from autoawq.
I switched up the approach entirely, and we are rolling over the cache now. This seems to produce correct outputs, and we get as close to HF output with FT modules. They are not meant to be the exact same outputs as slight numerical differences will lead to different outputs in some cases - however, they are very close now.
from autoawq.
I have closed this issue as the main error has been solved. However, it seems there is a problem with the fused modules and the CodeLlama models, although it should already be supported as GQA is implemented.
from autoawq.
Setting fuse_layers=False
seems to work with the same code (though slower generations).
from autoawq.
Fixed this now in #75, at least I cannot produce this error anymore even when running for 1000 iterations:
@abacaj and @gestalt73, would appreciate it if you could take the time to test out the pull request to see if something else breaks
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name_or_path = "casperhansen/vicuna-7b-v1.5-awq"
max_new_tokens = 1024
# Load model
model = AutoAWQForCausalLM.from_quantized(
model_name_or_path,
quant_filename="awq_model_w4_g128.pt",
fuse_layers=True,
trust_remote_code=False,
safetensors=True,
max_new_tokens=1024,
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
tokens = tokenizer(
"# Write a python function to loop to 1000\n\ndef", return_tensors="pt"
).to("cuda")
# Generate output
cumulative_tokens = 0
for i in range(1000):
if cumulative_tokens > max_new_tokens:
cumulative_tokens = 0
generation_output = model.generate(
**tokens,
do_sample=True,
temperature=0.2,
top_p=0.95,
top_k=0,
max_new_tokens=512,
)
num_tokens = len(generation_output[0])
cumulative_tokens += num_tokens
print(i, num_tokens, cumulative_tokens)
# print(tokenizer.decode(generation_output[0], skip_special_tokens=True))
from autoawq.
Hey @casper-hansen I ran it a bit with TheBloke/Llama-2-7b-Chat-AWQ and things look normal until the first cache clear, then things get weird. It doesn't error out though.
Take a look at the output after the first set of cache clear messages around line 192.
Output is consistent for the first x generations, then after the resetting cache message it starts ok in generation but gets interesting towards the end of line 209. from there on out it's hit or miss, but I'm also seeing the huge amount of newlines which I would occasionally see in 0.1.0.
Fix KV cache shapes error 75 results.txt
from autoawq.
I don't see the expanded tensor error anymore. But model generations using fused=True
are different (worse) compared to fused=False
from autoawq.
Added fused_true and fused_false samples here. I turned sampling off so it should be greedy generation. For fused=False
the output seems good
https://gist.github.com/abacaj/aefb5e9dd85a6fc8b54b5b655a9a632e
from autoawq.
Thank you all for testing. The fact that the outputs after resetting the cache are getting weird or not working as expected is not good enough for me to merge the PR. I will have to explore:
- How to reset the cache without weird outputs
- How to increase allocated cache dynamically as inputs are run through the model
from autoawq.
Related Issues (20)
- DeepSeek V2 MoE support HOT 2
- Question about format difference between AutoAWQ and AutoGPTQ
- After quantization,the ppl is ok but humaneval score drops sharply HOT 2
- Is AutoAWQ supported on Mac mps chip? HOT 1
- anyone found the solution to this?
- llama symmetric quantize bug HOT 3
- Vllm AutoAWQ with 4-GPU doesnt utilize GPU HOT 1
- Support for Traditional Linear/CNN Layer?
- Reduce the amount of gpu memory used in the quantification process HOT 8
- Support for deepseek v2 HOT 1
- about grid search
- Does AWQ and GPTQ kernels share with each other?
- cogvlm2 issue
- cogvlm2 issue
- cogvlm2 issue
- Cannot copy out of meta tensor; no data!
- quantize models with large context HOT 3
- why "w_bit": 6
- qwen2-72B can not be quantized by autoawq HOT 17
- when quantize qwen2 by autoawq, it not works successful. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from autoawq.