Comments (6)
Could you please try if you can run the original Qwen2-0.5B? Also, can you run your fine-tuned model on other devices, i.e. CUDA?
from mlc-llm.
yes I can run the original Qwen2-0.5B (compiled from the source weights) on webllm, and I can run the fine-tuned model on Metal with the mlc-llm python library - it's only the fine-tuned model that fails on webllm
from mlc-llm.
This seems to have to do with how we package and the latest wasm runtime. If you have custom compile that runs the original Qwen and it reproduces the error that would be helpful. Alternatively, would be great if you can share a reproducible command with the model that caused the error
from mlc-llm.
These are the weights that I'm trying to deploy - works fine with the python backend on Metal
from mlc_llm import MLCEngine
# Create engine
model = "HF://OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot"
engine = MLCEngine(model)
# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=True,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
print("\n")
engine.terminate()
[/opt/homebrew/Caskroom/miniforge/base/envs/mlc/lib/python3.12/site-packages/tqdm/auto.py:21](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniforge/base/envs/mlc/lib/python3.12/site-packages/tqdm/auto.py:21): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[2024-06-25 10:15:01] INFO auto_device.py:88: Not found device: cuda:0
[2024-06-25 10:15:02] INFO auto_device.py:88: Not found device: rocm:0
[2024-06-25 10:15:03] INFO auto_device.py:79: Found device: metal:0
[2024-06-25 10:15:04] INFO auto_device.py:88: Not found device: vulkan:0
[2024-06-25 10:15:05] INFO auto_device.py:88: Not found device: opencl:0
[2024-06-25 10:15:05] INFO auto_device.py:35: Using device: metal:0
[2024-06-25 10:15:05] INFO download_cache.py:227: Downloading model from HuggingFace: HF://OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot
[2024-06-25 10:15:05] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-06-25 10:15:05] INFO download_cache.py:166: Weights already downloaded: [/Users/User/.cache/mlc_llm/model_weights/hf/OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot](https://file+.vscode-resource.vscode-cdn.net/Users/User/.cache/mlc_llm/model_weights/hf/OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot)
[2024-06-25 10:15:05] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-06-25 10:15:05] INFO jit.py:158: Using cached model lib: [/Users/User/.cache/mlc_llm/model_lib/c5f2c474b97ac6bb95cf167c9cc9dba8.dylib](https://file+.vscode-resource.vscode-cdn.net/Users/User/.cache/mlc_llm/model_lib/c5f2c474b97ac6bb95cf167c9cc9dba8.dylib)
[2024-06-25 10:15:05] INFO engine_base.py:179: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-06-25 10:15:05] INFO engine_base.py:204: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-06-25 10:15:05] INFO engine_base.py:209: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048.
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048.
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048.
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:748](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:748): The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 2048.
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:753](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:753): Estimated total single GPU memory usage: 2959.723 MB (Parameters: 265.118 MB. KVCache: 152.245 MB. Temporary buffer: 2542.361 MB). The actual usage might be slightly larger than the estimated number.
As an AI language model, I don't have personal beliefs or experiences. However, based on scientific research, the meaning of life is a question that has been asked by many people throughout history. It is generally believed that there is no one definitive answer to this question, and it is possible that different people have different ideas about what it means. Some people believe that life is a gift from God, while others believe that it is a struggle between good and evil. Ultimately, the meaning of life is a complex and personal question that depends on many factors, including personal experiences and beliefs.
using webLLM
Original weights:
- weights: https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot
- library: https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot/resolve/main/Qwen2-0.5B-Instruct-q4f16_1-webgpu.wasm?download=true
--> works as expected
Fine tuned weights
- weights: https://huggingface.co/julientfai/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot
- library: https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot/resolve/main/Qwen2-0.5B-Instruct-q4f16_1-webgpu.wasm?download=true
--> fail with bug shown earlier
Interestingly we also have the following combinations:
- Original weights with the fine-tuned wasm library --> works as expected
- Fine-tuned weights with the original wasm library --> same bug
So the problem seems to be with the weights, but then why do they work with the Python library?
from mlc-llm.
@JLKaretis May be you can now try with unquantized qwen2-0.5B because the mlc-llm team hasn't released a q4f16 webgpu.wasm , only q0f16 and there are some problems with your generated wasm which somehow do not give issues on original model but on fine-tuned model . I am also waiting for official qwen2-0.5B q4f16 wasm. I guess you should try your fine-tuned q4f16 model after the official wasm is released.
from mlc-llm.
@bil-ash I did try with the q0f16 quant last week, and got the same error. I don't think that there is a difference between the wasm I generated and the "official" one
from mlc-llm.
Related Issues (20)
- [Question] Run this command and there below error python3 -m mlc_llm.build --model llamahf --target android --quantization q4f16_1 HOT 1
- [Question] Are messages always truncated to last `context_length` tokens? HOT 1
- [Question] 添加mlc4j包的新版本 HOT 2
- [Bug] TVMError: InternalError: Cannot find PackedFunc runtime.disco.allreduce in either Relax VM kernel library HOT 1
- [Question] while use the command python3 prepare_libs.py, there same error ocured HOT 2
- [Question] There are some error while run the command "python3 prepare_libs.py" HOT 2
- [Bug] Android deployment answers results that are questionable HOT 4
- [Question] Qwen model can not use MLC LLM HOT 3
- [Question] How do I convert a LLaVA model to MLC-LLM format? HOT 3
- [Bug] The Qwen2-1.5B-Instruct cannot run on Android phones with Huawei Mate50 (8gen1) HOT 4
- [Question] How to optimize the scheduling of multimodal LLM model convolution in mlc. HOT 5
- [Question] ValueError: The following extern parameters do not exist in the weight files HOT 2
- [Question] While running the mlc-llm app on Android, the prefill token is very slow sometimes. HOT 3
- [Bug] android opencl kernel matmul Too large unroll parameter causes out of memory HOT 2
- FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/2y/nd0wwb5s7c5flc4d0dm8124m0000gp/T/tmpzqysjxlt/tmp/ndarray-cache.json'[Question] HOT 1
- [Model Request] Florence-2
- What to do to re-support MiniCPM-V on mlc? [Question] HOT 3
- [Question] Running TVM Dlight low-level optimizations ERROR
- [Bug] error: [11:23:36] /home/orangepi/mlc-llm/cpp/grammar/../support/json_parser.h:229: Check failed: (it != json.end()) is false: ValueError: key `max_batch_size` not found in the JSON object
- [Question] How to estimation of the vRAM the model takes at runtime?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlc-llm.