Code Monkey home page Code Monkey logo

llama_mps's Introduction

llama_mps's People

Contributors

birch-san avatar jankais3r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama_mps's Issues

Support Apple Neural Engine (ANE) Transformers

I noticed Apple supports ANE Transformers.

According to their own words:

M1 or newer chip to achieve up to 10 times faster and 14 times lower peak memory

Does that mean running 30B or 65B will be possible on small-memory MacBooks?

Here are a few links
https://github.com/apple/ml-ane-transformers
https://machinelearning.apple.com/research/neural-engine-transformers

As this project is the top LLaMA that leverages Apple GPU, is it possible to support ANE too?

AssertionError: ./models/tokenizer.model

Running the command python3 chat.py --ckpt_dir ./models/7B --tokenizer_path ./models/tokenizer.model --max_batch_size=8 --max_seq_len=512
I get this error:
File "/home/LLaMA_MPS/chat.py", line 106, in main generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size) File "/home/LLaMA_MPS/chat.py", line 80, in load tokenizer = Tokenizer(model_path=tokenizer_path) File "/home/LLaMA_MPS/llama/tokenizer.py", line 16, in __init__ assert os.path.isfile(model_path), model_path AssertionError: ./models/tokenizer.model
I checked, and there was no tokenizer.model under /models

op Height/Width dimensions must be less than 16384

MacOS 14.0
MacBook Pro M1 Max

autocomplete and instruction-response give same result:

$python3 chat.py --ckpt_dir models/7B-alpaca --tokenizer_path models/tokenizer.model --max_batch_size 8 --max_seq_len 256
Seed: 30112
Loading checkpoint
Loaded in 94.05 seconds
Running the fine-tuned 'alpaca' model in an instruction-response mode.
Instruction: hello
loc("mps_transpose"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/d8ee83b8-11b4-11ee-a66d-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":206:0)): error: 'anec.transpose' op Height/Width dimensions must be less than 16384
Response: hello, hello

seems like it not works as expected. I have waited couple minutes and I got just "hello, hello"

Why fp16 MPS performance is worse than CPU?

In your conclusion. MPS performace is worse than llama.cpp cpu performance in the same fp16. Why? Is there any kernel which MPS doesn't support will fallback to CPU( so that hurt performace)?

you said this:
image

i figure that you mean MPS shader is compiling Just-in-time so the performace is worse than A-head-of-time compiled CPU codes?Am i wrong?

Corrupted output due to PyTorch MPS issue affecting torch.argmax() and torch.multinomial()

Hi

EDIT: after further investigation I have realised that the problems I describe below are entirely caused by pytorch/pytorch#92311 . See last comment for a (hacky) fix.

Thanks very much for releasing this repo. Since I started getting in to AI I've really wanted to try some local inference on my AMD 6900XT in macOS. I'm running Ventura 13.3 and I read about Apple recently making MPS available in PyTorch, with support for Intel Macs as well as Silicon, so I was hoping I might get somewhere.

I've tried your repo and it runs without errors and is definitely using the GPU. Unfortunately in the output I don't see any readable tokens, and instead there's a series of ??s.

I realise you've written and tested this for Apple Silicon, but based on what I've read it should in theory be possible to do this on Intel as well, and it definitely does seem to be running - so maybe there's some fixable bug here that's stopping it working on Intel? Unfortunately my PyTorch knowledge is super limited at the moment so I'm not sure where to start looking.

Here's an example invocation to show the problem:

tomj@Eddie ~/src/LLaMA_MPS (main●)$ /usr/local/anaconda3/bin/python3.10 chat.py --ckpt_dir ~/Downloads/Torrents/Done/LLaMA/7B --tokenizer_path ~/src/llama.cpp/models/tokenizer.model --max_batch_size 8 --max_seq_len 256
Seed: 29437
Loading checkpoint
Loaded in 13.75 seconds
Running the raw 'llama' model in an auto-complete mode.
Enter your LLaMA prompt: write something about fishing
Thinking...
 ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇

Inferred in 102.06 seconds
==================================

Enter your LLaMA prompt:

I've tried with PyTorch 2.0.0 and also the latest dev version, 2.1.0. I've tried adding PYTORCH_ENABLE_MPS_FALLBACK=1 and/or PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 but this doesn't change anything.

It is definitely using the GPU, I can see that in Activity Monitor:
image

It's just the output is unreadable! Thanks very much in advance for any help or advice.

powermetrics

How did you get powermetrics to show the output in the README.md?

MPS device support

I get the following error:
File "/home/LLaMA_MPS/llama/model.py", line 102, in __init__ self.cache_k = torch.zeros( RuntimeError: PyTorch is not linked with support for mps devices
I ran the code on this environment:
PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: None OS: Linux 5.14.0-162.6.1.el9_1.0.1.x86_64 CMake version: 3.20.2 Python version: Python 3.9.14 Python platform: Linux-5.14.0-162.6.1.el9_1.0.1.x86_64-x86_64-with-glibc2.34 Is CUDA available: True CUDA_MODULE_LOADING set to: GPU models and configuration: 0,1,2,3 Nvidia driver version: 525.60.13 cuDNN version: 8500
Any idea what is going wrong?

Trial runs at Llama 7B (success) and 65B (fail)

I noticed you are using venv & pip. I assumed from your powermetrics that your torch is able to take full advantage of GPU? Apple silicon is new to me and I thought you have to use conda-forge for the package. I just received a M2 Max with 96gb, I will try this out and see how much improvement over M1.

RuntimeError: Cannot set version_counter for inference tensor

Nice work! I've forking this for directml use, and it seems to work, loads it up in my AMD card and I get to the prompt, but after I entered the prompt I get the following error, I haven't been able to find a solution, any ideas?

(llama_dml) G:\LLaMA_dml>python chat.py --ckpt_dir models/7B --tokenizer_path G:/llama/models/tokenizer.model --max_batch_size 1 --max_seq_len 256 Seed: 19266 Loading checkpoint Loaded in 10.67 seconds Running the raw 'llama' model in an auto-complete mode. Enter your LLaMA prompt: Facebook is good because Thinking... Traceback (most recent call last): File "chat.py", line 146, in <module> fire.Fire(main) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "chat.py", line 129, in main results = generator.generate( File "G:\LLaMA_dml\llama\generation.py", line 46, in generate logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "G:\LLaMA_dml\llama\model.py", line 265, in forward h = self.tok_embeddings(tokens) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\modules\sparse.py", line 162, in forward return F.embedding( File "C:\ProgramData\Anaconda3\envs\llama_dml\lib\site-packages\torch\nn\functional.py", line 2210, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Cannot set version_counter for inference tensor

Garbage output on 30B model?

I've been trying to get the 30B model up but my output is total garbage. Example: Trust Delete Π²ΠΈΠΊΠΎΡ€ΠΈ Π³Π΅CES/$ свобо voicepull mediumΠ²Ρ€ΠΈLongrightarrow fed NormalJe zespo installer ΠΏΡ€ΠΎΠ±Π»Π΅ ΠΊΠΎΠ½Ρ†Π΅ attacks genu genericituteAX language hy Jurpring lange)))) Архивная. This is in contrast to the 7B and 13B model which work well.

From the readme, it seems like someone at least managed to measure memory usage on 30B, but is there an indication they were able to produce reasonable text?

Possibly related, I needed to do the resharding and arrow conversion on a second machine with more RAM, maybe there is some problem doing the conversion on a different machine than inference? Is there a 'known good' arrow version I could compare with?

Side note, the readme takes the view that the comparison with llama.cpp is about performance. But in my experience, the 13B here is much better quality than 65B llama.cpp. I have several theories about why this is, but they all suggest that getting the 30B working here would be stronger output than alternatives.

Huggingface weights repo doesn't have params.json

I pulled down llama-7b-hf and errored out with
FileNotFoundError: [Errno 2] No such file or directory: 'models/llama-7b-hf/params.json'
Is it still compatible or do I need to do the whole torrent thing? Can I bodge or bum a params.json file and avoid another 30 hour download?

Thanks

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Hi - Thanks for building this, it looks like a great way to try out the model.

I wasn't able to follow the instructions exactly - pip3 install -r requirements.txt reported No matching distribution found for torch. The python I have installed is 3.11, so I'm explicitly using pip3.9 / python3.9. I don't know if this is related.

Anyway, when I run the example chat command, I get prompted for my input, and when I enter it, about 30 seconds later, I get this:

Traceback (most recent call last):
  File "/Users/stevex/temp/llama/LLaMA_MPS/chat.py", line 130, in <module>
    fire.Fire(main)
  File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/homebrew/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/Users/stevex/temp/llama/LLaMA_MPS/chat.py", line 113, in main
    results = generator.generate(
  File "/Users/stevex/temp/llama/LLaMA_MPS/llama/generation.py", line 63, in generate
    next_token = torch.multinomial(
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Using a Mac Studio, 32gb RAM.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.