Code Monkey home page Code Monkey logo

Comments (6)

Hzfengsy avatar Hzfengsy commented on July 17, 2024

Could you please try if you can run the original Qwen2-0.5B? Also, can you run your fine-tuned model on other devices, i.e. CUDA?

from mlc-llm.

JLKaretis avatar JLKaretis commented on July 17, 2024

yes I can run the original Qwen2-0.5B (compiled from the source weights) on webllm, and I can run the fine-tuned model on Metal with the mlc-llm python library - it's only the fine-tuned model that fails on webllm

from mlc-llm.

tqchen avatar tqchen commented on July 17, 2024

This seems to have to do with how we package and the latest wasm runtime. If you have custom compile that runs the original Qwen and it reproduces the error that would be helpful. Alternatively, would be great if you can share a reproducible command with the model that caused the error

from mlc-llm.

JLKaretis avatar JLKaretis commented on July 17, 2024

These are the weights that I'm trying to deploy - works fine with the python backend on Metal

from mlc_llm import MLCEngine

# Create engine
model = "HF://OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

[/opt/homebrew/Caskroom/miniforge/base/envs/mlc/lib/python3.12/site-packages/tqdm/auto.py:21](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniforge/base/envs/mlc/lib/python3.12/site-packages/tqdm/auto.py:21): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2024-06-25 10:15:01] INFO auto_device.py:88: Not found device: cuda:0
[2024-06-25 10:15:02] INFO auto_device.py:88: Not found device: rocm:0
[2024-06-25 10:15:03] INFO auto_device.py:79: Found device: metal:0
[2024-06-25 10:15:04] INFO auto_device.py:88: Not found device: vulkan:0
[2024-06-25 10:15:05] INFO auto_device.py:88: Not found device: opencl:0
[2024-06-25 10:15:05] INFO auto_device.py:35: Using device: metal:0
[2024-06-25 10:15:05] INFO download_cache.py:227: Downloading model from HuggingFace: HF://OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot
[2024-06-25 10:15:05] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-06-25 10:15:05] INFO download_cache.py:166: Weights already downloaded: [/Users/User/.cache/mlc_llm/model_weights/hf/OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot](https://file+.vscode-resource.vscode-cdn.net/Users/User/.cache/mlc_llm/model_weights/hf/OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot)
[2024-06-25 10:15:05] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-06-25 10:15:05] INFO jit.py:158: Using cached model lib: [/Users/User/.cache/mlc_llm/model_lib/c5f2c474b97ac6bb95cf167c9cc9dba8.dylib](https://file+.vscode-resource.vscode-cdn.net/Users/User/.cache/mlc_llm/model_lib/c5f2c474b97ac6bb95cf167c9cc9dba8.dylib)
[2024-06-25 10:15:05] INFO engine_base.py:179: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-06-25 10:15:05] INFO engine_base.py:204: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-06-25 10:15:05] INFO engine_base.py:209: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:748](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:748): The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 2048.
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:753](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:753): Estimated total single GPU memory usage: 2959.723 MB (Parameters: 265.118 MB. KVCache: 152.245 MB. Temporary buffer: 2542.361 MB). The actual usage might be slightly larger than the estimated number.
As an AI language model, I don't have personal beliefs or experiences. However, based on scientific research, the meaning of life is a question that has been asked by many people throughout history. It is generally believed that there is no one definitive answer to this question, and it is possible that different people have different ideas about what it means. Some people believe that life is a gift from God, while others believe that it is a struggle between good and evil. Ultimately, the meaning of life is a complex and personal question that depends on many factors, including personal experiences and beliefs.

using webLLM

Original weights:

Fine tuned weights

Interestingly we also have the following combinations:

  • Original weights with the fine-tuned wasm library --> works as expected
  • Fine-tuned weights with the original wasm library --> same bug

So the problem seems to be with the weights, but then why do they work with the Python library?

from mlc-llm.

bil-ash avatar bil-ash commented on July 17, 2024

@JLKaretis May be you can now try with unquantized qwen2-0.5B because the mlc-llm team hasn't released a q4f16 webgpu.wasm , only q0f16 and there are some problems with your generated wasm which somehow do not give issues on original model but on fine-tuned model . I am also waiting for official qwen2-0.5B q4f16 wasm. I guess you should try your fine-tuned q4f16 model after the official wasm is released.

from mlc-llm.

JLKaretis avatar JLKaretis commented on July 17, 2024

@bil-ash I did try with the q0f16 quant last week, and got the same error. I don't think that there is a difference between the wasm I generated and the "official" one

from mlc-llm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.