Code Monkey home page Code Monkey logo

Comments (10)

casper-hansen avatar casper-hansen commented on June 26, 2024 1

Thank you for this! My best guess is that the number of tokens exceed the cache. Will have to investigate this

from autoawq.

gestalt73 avatar gestalt73 commented on June 26, 2024 1

I've seen the same with other models. Thanks for the script @abacaj I'm going to run some other models through their paces to see if I can reproduce.

AutoAWQ=0.1.0, python=3.10, cuda=11.8, rtx 3090

I can reproduce the error with any model:

  • TheBloke/Llama-2-7b-Chat-AWQ
  • TheBloke/vicuna-7B-v1.5-AWQ
  • casperhansen/vicuna-7B-v1.5-AWQ

File "/home/alansrobotlab/anaconda3/envs/textgen/lib/python3.11/site-packages/awq/modules/fused/attn.py", line 183, in forward self.cache_v[:bsz, :, self.start_pos : self.start_pos + seqlen, :] = values_store ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The expanded size of the tensor (0) must match the existing size (24) at non-singleton dimension 2. Target sizes: [1, 32, 0, 128]. Tensor sizes: [32, 24, 128]

from autoawq.

casper-hansen avatar casper-hansen commented on June 26, 2024 1

I switched up the approach entirely, and we are rolling over the cache now. This seems to produce correct outputs, and we get as close to HF output with FT modules. They are not meant to be the exact same outputs as slight numerical differences will lead to different outputs in some cases - however, they are very close now.

from autoawq.

casper-hansen avatar casper-hansen commented on June 26, 2024 1

I have closed this issue as the main error has been solved. However, it seems there is a problem with the fused modules and the CodeLlama models, although it should already be supported as GQA is implemented.

from autoawq.

abacaj avatar abacaj commented on June 26, 2024

Setting fuse_layers=False seems to work with the same code (though slower generations).

from autoawq.

casper-hansen avatar casper-hansen commented on June 26, 2024

Fixed this now in #75, at least I cannot produce this error anymore even when running for 1000 iterations:

@abacaj and @gestalt73, would appreciate it if you could take the time to test out the pull request to see if something else breaks

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "casperhansen/vicuna-7b-v1.5-awq"
max_new_tokens = 1024

# Load model
model = AutoAWQForCausalLM.from_quantized(
    model_name_or_path,
    quant_filename="awq_model_w4_g128.pt",
    fuse_layers=True,
    trust_remote_code=False,
    safetensors=True,
    max_new_tokens=1024,
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)

tokens = tokenizer(
    "# Write a python function to loop to 1000\n\ndef", return_tensors="pt"
).to("cuda")

# Generate output
cumulative_tokens = 0

for i in range(1000):
    if cumulative_tokens > max_new_tokens:
        cumulative_tokens = 0
    
    generation_output = model.generate(
        **tokens,
        do_sample=True,
        temperature=0.2,
        top_p=0.95,
        top_k=0,
        max_new_tokens=512,
    )

    num_tokens = len(generation_output[0])
    cumulative_tokens += num_tokens

    print(i, num_tokens, cumulative_tokens)

    # print(tokenizer.decode(generation_output[0], skip_special_tokens=True))

from autoawq.

gestalt73 avatar gestalt73 commented on June 26, 2024

Hey @casper-hansen I ran it a bit with TheBloke/Llama-2-7b-Chat-AWQ and things look normal until the first cache clear, then things get weird. It doesn't error out though.

Take a look at the output after the first set of cache clear messages around line 192.

Output is consistent for the first x generations, then after the resetting cache message it starts ok in generation but gets interesting towards the end of line 209. from there on out it's hit or miss, but I'm also seeing the huge amount of newlines which I would occasionally see in 0.1.0.

Fix KV cache shapes error 75 results.txt

from autoawq.

abacaj avatar abacaj commented on June 26, 2024

I don't see the expanded tensor error anymore. But model generations using fused=True are different (worse) compared to fused=False

from autoawq.

abacaj avatar abacaj commented on June 26, 2024

Added fused_true and fused_false samples here. I turned sampling off so it should be greedy generation. For fused=False the output seems good

https://gist.github.com/abacaj/aefb5e9dd85a6fc8b54b5b655a9a632e

from autoawq.

casper-hansen avatar casper-hansen commented on June 26, 2024

Thank you all for testing. The fact that the outputs after resetting the cache are getting weird or not working as expected is not good enough for me to merge the PR. I will have to explore:

  1. How to reset the cache without weird outputs
  2. How to increase allocated cache dynamically as inputs are run through the model

from autoawq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.