jankais3r / llama_mps Goto Github PK

View Code? Open in Web Editor NEW

581.0 16.0 47.0 9.64 MB

Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.

License: GNU General Public License v3.0

Python 100.00%

chat llama ml mps torch apple-silicon chatgpt chatbot llms macos

llama_mps's Introduction

Hello friend 👋

Find me on and .

Most of my repositories fit one of the following categories:

Digital Forensics / Reverse Engineering / Online Privacy
Looking Glass holograms
Depth-map generation
Various hacks / workarounds

Digital Forensics / Reverse Engineering / Online Privacy

Looking Glass holograms

Depth-map generation

Various hacks / workarounds

llama_mps's People

Contributors

Stargazers

Watchers

Forkers

lorant-csonka techthiyanes npho birch-san brainiarc7 dumpmemory adambear zhsu-private thinkingarch willcode2surf hackerfriendly tech1media aussiegingersnap hlyahol hailoc12 mizasquare sarathsahadevan mfkiwl ravish-dhawan jooray vu1seek leihao0 ofirattia arunbaruah eriker75 ianrtracey pixelkaiser stephen-pilli darveenvijayan zirandu caffeineshawn telcotronics snailinator3000x gburachas projecttopstep jesusoctavioas lannerate theodlz jawond christhroue ycbdlxw glopaq apollohuang1 williamtbarker abdulshakir

llama_mps's Issues

Support Apple Neural Engine (ANE) Transformers

I noticed Apple supports ANE Transformers.

According to their own words:

M1 or newer chip to achieve up to 10 times faster and 14 times lower peak memory

Does that mean running 30B or 65B will be possible on small-memory MacBooks?

Here are a few links
https://github.com/apple/ml-ane-transformers
https://machinelearning.apple.com/research/neural-engine-transformers

As this project is the top LLaMA that leverages Apple GPU, is it possible to support ANE too?

AssertionError: ./models/tokenizer.model

Running the command python3 chat.py --ckpt_dir ./models/7B --tokenizer_path ./models/tokenizer.model --max_batch_size=8 --max_seq_len=512
I get this error:
File "/home/LLaMA_MPS/chat.py", line 106, in main generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size) File "/home/LLaMA_MPS/chat.py", line 80, in load tokenizer = Tokenizer(model_path=tokenizer_path) File "/home/LLaMA_MPS/llama/tokenizer.py", line 16, in __init__ assert os.path.isfile(model_path), model_path AssertionError: ./models/tokenizer.model
I checked, and there was no tokenizer.model under /models

op Height/Width dimensions must be less than 16384

MacOS 14.0
MacBook Pro M1 Max

autocomplete and instruction-response give same result:

$python3 chat.py --ckpt_dir models/7B-alpaca --tokenizer_path models/tokenizer.model --max_batch_size 8 --max_seq_len 256
Seed: 30112
Loading checkpoint
Loaded in 94.05 seconds
Running the fine-tuned 'alpaca' model in an instruction-response mode.
Instruction: hello
loc("mps_transpose"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/d8ee83b8-11b4-11ee-a66d-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":206:0)): error: 'anec.transpose' op Height/Width dimensions must be less than 16384
Response: hello, hello

seems like it not works as expected. I have waited couple minutes and I got just "hello, hello"

Why fp16 MPS performance is worse than CPU?

In your conclusion. MPS performace is worse than llama.cpp cpu performance in the same fp16. Why? Is there any kernel which MPS doesn't support will fallback to CPU( so that hurt performace)？

you said this:

i figure that you mean MPS shader is compiling Just-in-time so the performace is worse than A-head-of-time compiled CPU codes？Am i wrong?

Corrupted output due to PyTorch MPS issue affecting torch.argmax() and torch.multinomial()

EDIT: after further investigation I have realised that the problems I describe below are entirely caused by pytorch/pytorch#92311 . See last comment for a (hacky) fix.

Thanks very much for releasing this repo. Since I started getting in to AI I've really wanted to try some local inference on my AMD 6900XT in macOS. I'm running Ventura 13.3 and I read about Apple recently making MPS available in PyTorch, with support for Intel Macs as well as Silicon, so I was hoping I might get somewhere.

I've tried your repo and it runs without errors and is definitely using the GPU. Unfortunately in the output I don't see any readable tokens, and instead there's a series of ??s.

I realise you've written and tested this for Apple Silicon, but based on what I've read it should in theory be possible to do this on Intel as well, and it definitely does seem to be running - so maybe there's some fixable bug here that's stopping it working on Intel? Unfortunately my PyTorch knowledge is super limited at the moment so I'm not sure where to start looking.

Here's an example invocation to show the problem:

tomj@Eddie ~/src/LLaMA_MPS (main●)$ /usr/local/anaconda3/bin/python3.10 chat.py --ckpt_dir ~/Downloads/Torrents/Done/LLaMA/7B --tokenizer_path ~/src/llama.cpp/models/tokenizer.model --max_batch_size 8 --max_seq_len 256
Seed: 29437
Loading checkpoint
Loaded in 13.75 seconds
Running the raw 'llama' model in an auto-complete mode.
Enter your LLaMA prompt: write something about fishing
Thinking...
 ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇

Inferred in 102.06 seconds
==================================

Enter your LLaMA prompt:

I've tried with PyTorch 2.0.0 and also the latest dev version, 2.1.0. I've tried adding PYTORCH_ENABLE_MPS_FALLBACK=1 and/or PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 but this doesn't change anything.

It is definitely using the GPU, I can see that in Activity Monitor:

It's just the output is unreadable! Thanks very much in advance for any help or advice.

powermetrics

How did you get powermetrics to show the output in the README.md?

MPS device support

I get the following error:
File "/home/LLaMA_MPS/llama/model.py", line 102, in __init__ self.cache_k = torch.zeros( RuntimeError: PyTorch is not linked with support for mps devices
I ran the code on this environment:
PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: None OS: Linux 5.14.0-162.6.1.el9_1.0.1.x86_64 CMake version: 3.20.2 Python version: Python 3.9.14 Python platform: Linux-5.14.0-162.6.1.el9_1.0.1.x86_64-x86_64-with-glibc2.34 Is CUDA available: True CUDA_MODULE_LOADING set to: GPU models and configuration: 0,1,2,3 Nvidia driver version: 525.60.13 cuDNN version: 8500
Any idea what is going wrong?

Trial runs at Llama 7B (success) and 65B (fail)

I noticed you are using venv & pip. I assumed from your powermetrics that your torch is able to take full advantage of GPU? Apple silicon is new to me and I thought you have to use conda-forge for the package. I just received a M2 Max with 96gb, I will try this out and see how much improvement over M1.

RuntimeError: Cannot set version_counter for inference tensor

Nice work! I've forking this for directml use, and it seems to work, loads it up in my AMD card and I get to the prompt, but after I entered the prompt I get the following error, I haven't been able to find a solution, any ideas?

(llama_dml) G:\LLaMA_dml>python chat.py --ckpt_dir models/7B --tokenizer_path G:/llama/models/tokenizer.model --max_batch_size 1 --max_seq_len 256 Seed: 19266 Loading checkpoint Loaded in 10.67 seconds Running the raw 'llama' model in an auto-complete mode. Enter your LLaMA prompt: Facebook is good because Thinking... Traceback (most recent call last): File "chat.py", line 146, in <module> fire.Fire(main) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "chat.py", line 129, in main results = generator.generate( File "G:\LLaMA_dml\llama\generation.py", line 46, in generate logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "G:\LLaMA_dml\llama\model.py", line 265, in forward h = self.tok_embeddings(tokens) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\modules\sparse.py", line 162, in forward return F.embedding( File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\functional.py", line 2210, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Cannot set version_counter for inference tensor

Garbage output on 30B model?

I've been trying to get the 30B model up but my output is total garbage. Example: Trust Delete викори геCES/$ свобо voicepull mediumвриLongrightarrow fed NormalJe zespo installer пробле конце attacks genu genericituteAX language hy Jurpring lange)))) Архивная. This is in contrast to the 7B and 13B model which work well.

From the readme, it seems like someone at least managed to measure memory usage on 30B, but is there an indication they were able to produce reasonable text?

Possibly related, I needed to do the resharding and arrow conversion on a second machine with more RAM, maybe there is some problem doing the conversion on a different machine than inference? Is there a 'known good' arrow version I could compare with?

Side note, the readme takes the view that the comparison with llama.cpp is about performance. But in my experience, the 13B here is much better quality than 65B llama.cpp. I have several theories about why this is, but they all suggest that getting the 30B working here would be stronger output than alternatives.

Huggingface weights repo doesn't have params.json

I pulled down llama-7b-hf and errored out with
FileNotFoundError: [Errno 2] No such file or directory: 'models/llama-7b-hf/params.json'
Is it still compatible or do I need to do the whole torrent thing? Can I bodge or bum a params.json file and avoid another 30 hour download?

Thanks

What changes were made for MPS support?

Hey there,

What changes were made to the model to provide MPS support? I'm having a hard time seeing the difference in the code.

Thanks!

The magnet link of the model resource seem to be unusable

I know this might be improper to ask here, but how can I get access to the model weight more easily?

Tried to request access by haven't receive an email yet.

Thanks.

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Hi - Thanks for building this, it looks like a great way to try out the model.

I wasn't able to follow the instructions exactly - pip3 install -r requirements.txt reported No matching distribution found for torch. The python I have installed is 3.11, so I'm explicitly using pip3.9 / python3.9. I don't know if this is related.

Anyway, when I run the example chat command, I get prompted for my input, and when I enter it, about 30 seconds later, I get this:

Traceback (most recent call last):
  File "/Users/stevex/temp/llama/LLaMA_MPS/chat.py", line 130, in <module>
    fire.Fire(main)
  File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/Users/stevex/temp/llama/LLaMA_MPS/chat.py", line 113, in main
    results = generator.generate(
  File "/Users/stevex/temp/llama/LLaMA_MPS/llama/generation.py", line 63, in generate
    next_token = torch.multinomial(
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Using a Mac Studio, 32gb RAM.

Fine tuning LLaMA on Apple Silicon GPUs

Hello,

I am new to the AI field and still trying to understand how things work. I was wondering if it's possible to apply this implementation in the fine-tuning process. Like: https://github.com/lxe/llama-tune or https://github.com/tloen/alpaca-lora

I would be grateful for any examples or tutorials that explain how to apply this implementation to the fine-tuning process. Thank you in advance for your help!