Most of my repositories fit one of the following categories:
- Digital Forensics / Reverse Engineering / Online Privacy
- Looking Glass holograms
- Depth-map generation
- Various hacks / workarounds
Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.
License: GNU General Public License v3.0
Most of my repositories fit one of the following categories:
I noticed Apple supports ANE Transformers.
According to their own words:
M1 or newer chip to achieve up to 10 times faster and 14 times lower peak memory
Does that mean running 30B or 65B will be possible on small-memory MacBooks?
Here are a few links
https://github.com/apple/ml-ane-transformers
https://machinelearning.apple.com/research/neural-engine-transformers
As this project is the top LLaMA that leverages Apple GPU, is it possible to support ANE too?
Running the command python3 chat.py --ckpt_dir ./models/7B --tokenizer_path ./models/tokenizer.model --max_batch_size=8 --max_seq_len=512
I get this error:
File "/home/LLaMA_MPS/chat.py", line 106, in main generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size) File "/home/LLaMA_MPS/chat.py", line 80, in load tokenizer = Tokenizer(model_path=tokenizer_path) File "/home/LLaMA_MPS/llama/tokenizer.py", line 16, in __init__ assert os.path.isfile(model_path), model_path AssertionError: ./models/tokenizer.model
I checked, and there was no tokenizer.model
under /models
MacOS 14.0
MacBook Pro M1 Max
autocomplete and instruction-response give same result:
$python3 chat.py --ckpt_dir models/7B-alpaca --tokenizer_path models/tokenizer.model --max_batch_size 8 --max_seq_len 256
Seed: 30112
Loading checkpoint
Loaded in 94.05 seconds
Running the fine-tuned 'alpaca' model in an instruction-response mode.
Instruction: hello
loc("mps_transpose"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/d8ee83b8-11b4-11ee-a66d-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":206:0)): error: 'anec.transpose' op Height/Width dimensions must be less than 16384
Response: hello, hello
seems like it not works as expected. I have waited couple minutes and I got just "hello, hello"
In your conclusion. MPS performace is worse than llama.cpp cpu performance in the same fp16. Why? Is there any kernel which MPS doesn't support will fallback to CPU( so that hurt performace)οΌ
i figure that you mean MPS shader is compiling Just-in-time so the performace is worse than A-head-of-time compiled CPU codesοΌAm i wrong?
Hi
EDIT: after further investigation I have realised that the problems I describe below are entirely caused by pytorch/pytorch#92311 . See last comment for a (hacky) fix.
Thanks very much for releasing this repo. Since I started getting in to AI I've really wanted to try some local inference on my AMD 6900XT in macOS. I'm running Ventura 13.3 and I read about Apple recently making MPS available in PyTorch, with support for Intel Macs as well as Silicon, so I was hoping I might get somewhere.
I've tried your repo and it runs without errors and is definitely using the GPU. Unfortunately in the output I don't see any readable tokens, and instead there's a series of ??
s.
I realise you've written and tested this for Apple Silicon, but based on what I've read it should in theory be possible to do this on Intel as well, and it definitely does seem to be running - so maybe there's some fixable bug here that's stopping it working on Intel? Unfortunately my PyTorch knowledge is super limited at the moment so I'm not sure where to start looking.
Here's an example invocation to show the problem:
tomj@Eddie ~/src/LLaMA_MPS (mainβ)$ /usr/local/anaconda3/bin/python3.10 chat.py --ckpt_dir ~/Downloads/Torrents/Done/LLaMA/7B --tokenizer_path ~/src/llama.cpp/models/tokenizer.model --max_batch_size 8 --max_seq_len 256
Seed: 29437
Loading checkpoint
Loaded in 13.75 seconds
Running the raw 'llama' model in an auto-complete mode.
Enter your LLaMA prompt: write something about fishing
Thinking...
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
Inferred in 102.06 seconds
==================================
Enter your LLaMA prompt:
I've tried with PyTorch 2.0.0 and also the latest dev version, 2.1.0. I've tried adding PYTORCH_ENABLE_MPS_FALLBACK=1
and/or PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
but this doesn't change anything.
It is definitely using the GPU, I can see that in Activity Monitor:
It's just the output is unreadable! Thanks very much in advance for any help or advice.
How did you get powermetrics
to show the output in the README.md?
I get the following error:
File "/home/LLaMA_MPS/llama/model.py", line 102, in __init__ self.cache_k = torch.zeros( RuntimeError: PyTorch is not linked with support for mps devices
I ran the code on this environment:
PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: None OS: Linux 5.14.0-162.6.1.el9_1.0.1.x86_64 CMake version: 3.20.2 Python version: Python 3.9.14 Python platform: Linux-5.14.0-162.6.1.el9_1.0.1.x86_64-x86_64-with-glibc2.34 Is CUDA available: True CUDA_MODULE_LOADING set to: GPU models and configuration: 0,1,2,3 Nvidia driver version: 525.60.13 cuDNN version: 8500
Any idea what is going wrong?
I noticed you are using venv & pip. I assumed from your powermetrics that your torch is able to take full advantage of GPU? Apple silicon is new to me and I thought you have to use conda-forge for the package. I just received a M2 Max with 96gb, I will try this out and see how much improvement over M1.
Nice work! I've forking this for directml use, and it seems to work, loads it up in my AMD card and I get to the prompt, but after I entered the prompt I get the following error, I haven't been able to find a solution, any ideas?
(llama_dml) G:\LLaMA_dml>python chat.py --ckpt_dir models/7B --tokenizer_path G:/llama/models/tokenizer.model --max_batch_size 1 --max_seq_len 256 Seed: 19266 Loading checkpoint Loaded in 10.67 seconds Running the raw 'llama' model in an auto-complete mode. Enter your LLaMA prompt: Facebook is good because Thinking... Traceback (most recent call last): File "chat.py", line 146, in <module> fire.Fire(main) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "chat.py", line 129, in main results = generator.generate( File "G:\LLaMA_dml\llama\generation.py", line 46, in generate logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "G:\LLaMA_dml\llama\model.py", line 265, in forward h = self.tok_embeddings(tokens) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\modules\sparse.py", line 162, in forward return F.embedding( File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\functional.py", line 2210, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Cannot set version_counter for inference tensor
I've been trying to get the 30B model up but my output is total garbage. Example: Trust Delete Π²ΠΈΠΊΠΎΡΠΈ Π³Π΅CES/$ ΡΠ²ΠΎΠ±ΠΎ voicepull mediumΠ²ΡΠΈLongrightarrow fed NormalJe zespo installer ΠΏΡΠΎΠ±Π»Π΅ ΠΊΠΎΠ½ΡΠ΅ attacks genu genericituteAX language hy Jurpring lange)))) ΠΡΡ
ΠΈΠ²Π½Π°Ρ
. This is in contrast to the 7B and 13B model which work well.
From the readme, it seems like someone at least managed to measure memory usage on 30B, but is there an indication they were able to produce reasonable text?
Possibly related, I needed to do the resharding and arrow conversion on a second machine with more RAM, maybe there is some problem doing the conversion on a different machine than inference? Is there a 'known good' arrow version I could compare with?
Side note, the readme takes the view that the comparison with llama.cpp is about performance. But in my experience, the 13B here is much better quality than 65B llama.cpp. I have several theories about why this is, but they all suggest that getting the 30B working here would be stronger output than alternatives.
I pulled down llama-7b-hf and errored out with
FileNotFoundError: [Errno 2] No such file or directory: 'models/llama-7b-hf/params.json'
Is it still compatible or do I need to do the whole torrent thing? Can I bodge or bum a params.json file and avoid another 30 hour download?
Thanks
Hey there,
What changes were made to the model to provide MPS support? I'm having a hard time seeing the difference in the code.
Thanks!
I know this might be improper to ask here, but how can I get access to the model weight more easily?
Tried to request access by haven't receive an email yet.
Thanks.
Hi - Thanks for building this, it looks like a great way to try out the model.
I wasn't able to follow the instructions exactly - pip3 install -r requirements.txt
reported No matching distribution found for torch
. The python I have installed is 3.11, so I'm explicitly using pip3.9
/ python3.9
. I don't know if this is related.
Anyway, when I run the example chat command, I get prompted for my input, and when I enter it, about 30 seconds later, I get this:
Traceback (most recent call last):
File "/Users/stevex/temp/llama/LLaMA_MPS/chat.py", line 130, in <module>
fire.Fire(main)
File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/Users/stevex/temp/llama/LLaMA_MPS/chat.py", line 113, in main
results = generator.generate(
File "/Users/stevex/temp/llama/LLaMA_MPS/llama/generation.py", line 63, in generate
next_token = torch.multinomial(
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Using a Mac Studio, 32gb RAM.
Hello,
I am new to the AI field and still trying to understand how things work. I was wondering if it's possible to apply this implementation in the fine-tuning process. Like: https://github.com/lxe/llama-tune or https://github.com/tloen/alpaca-lora
I would be grateful for any examples or tutorials that explain how to apply this implementation to the fine-tuning process. Thank you in advance for your help!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.