cafaxo / llama2.jl Goto Github PK

View Code? Open in Web Editor NEW

126.0 126.0 18.0 55 KB

Julia package for inference and training of Llama-style language models

License: MIT License

Julia 100.00%

llama2.jl's People

Contributors

Stargazers

Watchers

Forkers

bangboom svilupp pitsianis yi cpfiffer sabyaghossh lilithhafner thodoris1999 jaewoojoung zsz00 lostella vaiton mahavatara dhruvdh trholding lazarusa dagelf dhairyalgandhi sixzero

llama2.jl's Issues

amazing speed!

I'm new to Julia Lang. It runs faster than the llama2.c implementation. On average, it's 300%+ faster. And Julia utilizes all the CPU cores.

I'm curious to know

is there any way we can run Meta's llama chat models in Julia?
can Julia utilize OMP to share the load of the GPU?
has Julia implemented the CPU AVX2 instruction? Or will Julia run faster with AVX2?

Thanks!

How to find and download a suitable GGMLV3 model?

I tried finding and downloading the llama-2-7b-chat.ggmlv3.q4_K_S.bin file from HuggingFace, but was unable to do so.
What does one need to do to actually obtain that bin file? I tried the model.savetensor from the repo, but this does not seem to be the right format.

Speed up tokenizer

The tokenizer is currently O(N^2) in its input. This currently makes it impossible tokenize large amount of text for training.
We could either use a more efficient algorithm or just pre-tokenize whitespace. We should also check what sentencepiece does.

Create vocabulary from text

We should have a function that builds a vocabulary for the byte pair encoder from some given text. (Relevant for training models.)

ERROR: git repository not found at `https://github.com/cafaxo/Llama2.jl`

I was just installing the package and this is what I got:

Support saving weights to a standard format

Thanks for this excellent work!

I was able to train and infer. But I do not see a option to save the model. It would be awesome if that option would be added.

Readme

We should add some basic info to the readme that explains how to get this to run.

A GPU version of inference code

I think there could be a GPU version for the inference

Reproduce temp=0 llama.cpp results with some consistency.

We need to find a way to detect what could cause the differences between the two solutions.

The task is to have the same or near similar results at temp=0. We made some tests with the new .gguf files since it got so huge adoption.

Llama2.jl test:

using Llama2
model = load_gguf_model("/path/to/llama-2-7b-chat.Q4_K_S.gguf");
sample(model, "Tim was happy."; temperature = 0.0f0)

llama.cpp test .gguf test:
./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."

Current Llama2.jl results:

Tim was happy. Einzelnes, but he was also very proud of his son. He had always known that Tim was special, and he was thrilled to see him finally getting the recognition he deserved.\nAs the two of them sat in the stands, watching the game, Tim couldn't help but feel a sense of pride and joy. He was so grateful to have" ⋯ 667 bytes ⋯ ". \"I'm lucky to have you too.\"\nAs they walked out of the restaurant, Tim felt a sense of contentment and happiness. He knew that he had a wonderful son, and he was grateful for every moment they spent together. He was proud of Tim, and he knew that he would always be there to support and encourage him, no matter what.

Current llama.cpp results:

Tim was happy.
He had just received a new job offer and he was excited to start his new career. He had been searching for a new opportunity for months, and now it seemed like all his hard work had paid off.
As he walked into the office building, he couldn't help but feel a sense of pride. He had worked hard to get where he was, and he knew that this new job would be a great opportunity for him.
Tim took a deep breath as he entered the office. He was greeted by a friendly receptionist who offered him a warm smile. "Hello there," she said. "Welcome to Tim's new workplace."
Tim felt a sense of excitement as he walked through the office. He couldn't wait to meet his new colleagues and start working on his new projects. He knew that this was going to be a great opportunity for him, and he was eager to get started. [end of text]

We need to find an efficient way to know what could cause the differences between the two.

adding prompting

Support for set prompt has been added to llama2.c

Training code

I think it would be nice to have some minimal, self-contained Julia code for training a small Llama2 model.
From some experiments I have done, I already have some code (CPU-only) that would be easy to adapt to llama2.

Should I adapt and push this code to a training subdirectory? Opinions welcome.

Quantization support

We should investigate how difficult it is to add quantization support.
It would be great if we could run the Llama2 7B model on a machine with 8GB RAM.

ggml model ERROR: TaskFailedException nested task error: bitcast: target type not a leaf primitive type

Apple M2 ARM processor running native Julia (aa64)

~/p/Llama2.jl (master)> julia-native --project=. -tauto
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.2 (2023-07-05)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Llama2,  Random
[ Info: Precompiling Llama2 [7841fa2c-192d-471c-ae30-1f93a4daddfc]

julia> model = load_ggml_model("data/llama-2-7b-chat.ggmlv3.q4_K_S.bin");
Loading model... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:00

julia> sample(model, "The Julia programming language is")
<s>
ERROR: TaskFailedException

    nested task error: bitcast: target type not a leaf primitive type
    Stacktrace:
     [1] reinterpret
       @ ./essentials.jl:513 [inlined]
     [2] dot(x::SubArray{Llama2.block_q4_K, 1, Matrix{Llama2.block_q4_K}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, y::Vector{Llama2.block_q8_K})
       @ Llama2 ~/projects/Llama2.jl/src/quantization/q4.jl:226
     [3] macro expansion
       @ ~/projects/Llama2.jl/src/matmul.jl:23 [inlined]
     [4] (::Llama2.var"#47#threadsfor_fun#24"{Llama2.var"#47#threadsfor_fun#23#25"{Vector{Float32}, Matrix{Llama2.block_q4_K}, Vector{Llama2.block_q8_K}, UnitRange{Int64}}})(tid::Int64; onethread::Bool)
       @ Llama2 ./threadingconstructs.jl:194
     [5] #47#threadsfor_fun
       @ ./threadingconstructs.jl:161 [inlined]
     [6] (::Base.Threads.var"#1#2"{Llama2.var"#47#threadsfor_fun#24"{Llama2.var"#47#threadsfor_fun#23#25"{Vector{Float32}, Matrix{Llama2.block_q4_K}, Vector{Llama2.block_q8_K}, UnitRange{Int64}}}, Int64})()
       @ Base.Threads ./threadingconstructs.jl:139

...and 7 more exceptions.

LoRA and finetuning

I was considering adding LoRA to this repo and figured I'd share my thoughts in case there is interest upstream.

I don't have anything crazy in mind, mostly an implementation similar to https://github.com/microsoft/LoRA/blob/main/loralib/layers.py. Layer weights start at an "unmerged mode", train with the low rank parameters, and eventually merge when the network is serialized. Some flag in the train function controls whether regular or LoRA linear layers are used.

Let me know if you have any other ideas/suggestions

More high-level operations

Thank you all for putting this package together.

I am interested in seeing the operations expressed at the highest possible level. At times it feels as if we are redoing the undoing required to express everything as 1d arrays of floats.

For instance the following

    fc = freq_cis_real_row .+ freq_cis_imag_row .* im
    QQ = reshape(reinterpret(ComplexF32, s.q), (head_size ÷ 2, p.n_heads))
    KK = reshape(reinterpret(ComplexF32, s.k), (head_size ÷ 2, p.n_heads))
        
        # apply RoPE rotation to the q and k vectors for each head        
        # rotate q and k by the freq_cis_real and freq_cis_imag
        QQ .= fc .* QQ
        KK .= fc .* KK

replaces some 20+ lines of code...

Not really an issue, but it does simplify the code. Nevertheless great job for making it all happen.

@turbo @. x *= 1-α*λ

According to the LLama2 paper, training was done with value λ=0.1.