🐛 Bug I have a bug with a fine-tuned (DORA) Qwen2-0.5B deployed w

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug] fine-tuned model deployed with webllm not working about mlc-llm HOT 6 OPEN

JLKaretis commented on July 17, 2024

[Bug] fine-tuned model deployed with webllm not working

from mlc-llm.

Comments (6)

Hzfengsy commented on July 17, 2024

Could you please try if you can run the original Qwen2-0.5B? Also, can you run your fine-tuned model on other devices, i.e. CUDA?

from mlc-llm.

JLKaretis commented on July 17, 2024

yes I can run the original Qwen2-0.5B (compiled from the source weights) on webllm, and I can run the fine-tuned model on Metal with the mlc-llm python library - it's only the fine-tuned model that fails on webllm

from mlc-llm.

tqchen commented on July 17, 2024

This seems to have to do with how we package and the latest wasm runtime. If you have custom compile that runs the original Qwen and it reproduces the error that would be helpful. Alternatively, would be great if you can share a reproducible command with the model that caused the error

from mlc-llm.

JLKaretis commented on July 17, 2024

These are the weights that I'm trying to deploy - works fine with the python backend on Metal

from mlc_llm import MLCEngine

# Create engine
model = "HF://OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

[/opt/homebrew/Caskroom/miniforge/base/envs/mlc/lib/python3.12/site-packages/tqdm/auto.py:21](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniforge/base/envs/mlc/lib/python3.12/site-packages/tqdm/auto.py:21): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2024-06-25 10:15:01] INFO auto_device.py:88: Not found device: cuda:0
[2024-06-25 10:15:02] INFO auto_device.py:88: Not found device: rocm:0
[2024-06-25 10:15:03] INFO auto_device.py:79: Found device: metal:0
[2024-06-25 10:15:04] INFO auto_device.py:88: Not found device: vulkan:0
[2024-06-25 10:15:05] INFO auto_device.py:88: Not found device: opencl:0
[2024-06-25 10:15:05] INFO auto_device.py:35: Using device: metal:0
[2024-06-25 10:15:05] INFO download_cache.py:227: Downloading model from HuggingFace: HF://OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot
[2024-06-25 10:15:05] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-06-25 10:15:05] INFO download_cache.py:166: Weights already downloaded: [/Users/User/.cache/mlc_llm/model_weights/hf/OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot](https://file+.vscode-resource.vscode-cdn.net/Users/User/.cache/mlc_llm/model_weights/hf/OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot)
[2024-06-25 10:15:05] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-06-25 10:15:05] INFO jit.py:158: Using cached model lib: [/Users/User/.cache/mlc_llm/model_lib/c5f2c474b97ac6bb95cf167c9cc9dba8.dylib](https://file+.vscode-resource.vscode-cdn.net/Users/User/.cache/mlc_llm/model_lib/c5f2c474b97ac6bb95cf167c9cc9dba8.dylib)
[2024-06-25 10:15:05] INFO engine_base.py:179: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-06-25 10:15:05] INFO engine_base.py:204: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-06-25 10:15:05] INFO engine_base.py:209: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:748](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:748): The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 2048.
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:753](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:753): Estimated total single GPU memory usage: 2959.723 MB (Parameters: 265.118 MB. KVCache: 152.245 MB. Temporary buffer: 2542.361 MB). The actual usage might be slightly larger than the estimated number.
As an AI language model, I don't have personal beliefs or experiences. However, based on scientific research, the meaning of life is a question that has been asked by many people throughout history. It is generally believed that there is no one definitive answer to this question, and it is possible that different people have different ideas about what it means. Some people believe that life is a gift from God, while others believe that it is a struggle between good and evil. Ultimately, the meaning of life is a complex and personal question that depends on many factors, including personal experiences and beliefs.

using webLLM

Original weights:

weights: https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot
library: https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot/resolve/main/Qwen2-0.5B-Instruct-q4f16_1-webgpu.wasm?download=true
--> works as expected

Fine tuned weights

weights: https://huggingface.co/julientfai/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot
library: https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot/resolve/main/Qwen2-0.5B-Instruct-q4f16_1-webgpu.wasm?download=true
--> fail with bug shown earlier

Interestingly we also have the following combinations:

Original weights with the fine-tuned wasm library --> works as expected
Fine-tuned weights with the original wasm library --> same bug

So the problem seems to be with the weights, but then why do they work with the Python library?

from mlc-llm.

bil-ash commented on July 17, 2024

@JLKaretis May be you can now try with unquantized qwen2-0.5B because the mlc-llm team hasn't released a q4f16 webgpu.wasm , only q0f16 and there are some problems with your generated wasm which somehow do not give issues on original model but on fine-tuned model . I am also waiting for official qwen2-0.5B q4f16 wasm. I guess you should try your fine-tuned q4f16 model after the official wasm is released.

from mlc-llm.

JLKaretis commented on July 17, 2024

@bil-ash I did try with the q0f16 quant last week, and got the same error. I don't think that there is a difference between the wasm I generated and the "official" one

from mlc-llm.

[Bug] fine-tuned model deployed with webllm not working about mlc-llm HOT 6 OPEN

Comments (6)

using webLLM

Original weights:

Fine tuned weights

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent