cafaxo / llama2.jl Goto Github PK
View Code? Open in Web Editor NEWJulia package for inference and training of Llama-style language models
License: MIT License
Julia package for inference and training of Llama-style language models
License: MIT License
I'm new to Julia Lang. It runs faster than the llama2.c implementation. On average, it's 300%+ faster. And Julia utilizes all the CPU cores.
I'm curious to know
is there any way we can run Meta's llama chat models in Julia?
can Julia utilize OMP to share the load of the GPU?
has Julia implemented the CPU AVX2 instruction? Or will Julia run faster with AVX2?
Thanks!
I tried finding and downloading the llama-2-7b-chat.ggmlv3.q4_K_S.bin
file from HuggingFace, but was unable to do so.
What does one need to do to actually obtain that bin file? I tried the model.savetensor
from the repo, but this does not seem to be the right format.
The tokenizer is currently O(N^2) in its input. This currently makes it impossible tokenize large amount of text for training.
We could either use a more efficient algorithm or just pre-tokenize whitespace. We should also check what sentencepiece does.
We should have a function that builds a vocabulary for the byte pair encoder from some given text. (Relevant for training models.)
Thanks for this excellent work!
I was able to train and infer. But I do not see a option to save the model. It would be awesome if that option would be added.
We should add some basic info to the readme that explains how to get this to run.
I think there could be a GPU version for the inference
We need to find a way to detect what could cause the differences between the two solutions.
The task is to have the same or near similar results at temp=0. We made some tests with the new .gguf
files since it got so huge adoption.
Llama2.jl test:
using Llama2
model = load_gguf_model("/path/to/llama-2-7b-chat.Q4_K_S.gguf");
sample(model, "Tim was happy."; temperature = 0.0f0)
llama.cpp test .gguf test:
./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."
Current Llama2.jl results:
Tim was happy. Einzelnes, but he was also very proud of his son. He had always known that Tim was special, and he was thrilled to see him finally getting the recognition he deserved.\nAs the two of them sat in the stands, watching the game, Tim couldn't help but feel a sense of pride and joy. He was so grateful to have" ⋯ 667 bytes ⋯ ". \"I'm lucky to have you too.\"\nAs they walked out of the restaurant, Tim felt a sense of contentment and happiness. He knew that he had a wonderful son, and he was grateful for every moment they spent together. He was proud of Tim, and he knew that he would always be there to support and encourage him, no matter what.
Current llama.cpp results:
Tim was happy.
He had just received a new job offer and he was excited to start his new career. He had been searching for a new opportunity for months, and now it seemed like all his hard work had paid off.
As he walked into the office building, he couldn't help but feel a sense of pride. He had worked hard to get where he was, and he knew that this new job would be a great opportunity for him.
Tim took a deep breath as he entered the office. He was greeted by a friendly receptionist who offered him a warm smile. "Hello there," she said. "Welcome to Tim's new workplace."
Tim felt a sense of excitement as he walked through the office. He couldn't wait to meet his new colleagues and start working on his new projects. He knew that this was going to be a great opportunity for him, and he was eager to get started. [end of text]
We need to find an efficient way to know what could cause the differences between the two.
Support for set prompt has been added to llama2.c
I think it would be nice to have some minimal, self-contained Julia code for training a small Llama2 model.
From some experiments I have done, I already have some code (CPU-only) that would be easy to adapt to llama2.
Should I adapt and push this code to a training
subdirectory? Opinions welcome.
We should investigate how difficult it is to add quantization support.
It would be great if we could run the Llama2 7B model on a machine with 8GB RAM.
Apple M2 ARM processor running native Julia (aa64)
~/p/Llama2.jl (master)> julia-native --project=. -tauto
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.9.2 (2023-07-05)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using Llama2, Random
[ Info: Precompiling Llama2 [7841fa2c-192d-471c-ae30-1f93a4daddfc]
julia> model = load_ggml_model("data/llama-2-7b-chat.ggmlv3.q4_K_S.bin");
Loading model... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:00
julia> sample(model, "The Julia programming language is")
<s>
ERROR: TaskFailedException
nested task error: bitcast: target type not a leaf primitive type
Stacktrace:
[1] reinterpret
@ ./essentials.jl:513 [inlined]
[2] dot(x::SubArray{Llama2.block_q4_K, 1, Matrix{Llama2.block_q4_K}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, y::Vector{Llama2.block_q8_K})
@ Llama2 ~/projects/Llama2.jl/src/quantization/q4.jl:226
[3] macro expansion
@ ~/projects/Llama2.jl/src/matmul.jl:23 [inlined]
[4] (::Llama2.var"#47#threadsfor_fun#24"{Llama2.var"#47#threadsfor_fun#23#25"{Vector{Float32}, Matrix{Llama2.block_q4_K}, Vector{Llama2.block_q8_K}, UnitRange{Int64}}})(tid::Int64; onethread::Bool)
@ Llama2 ./threadingconstructs.jl:194
[5] #47#threadsfor_fun
@ ./threadingconstructs.jl:161 [inlined]
[6] (::Base.Threads.var"#1#2"{Llama2.var"#47#threadsfor_fun#24"{Llama2.var"#47#threadsfor_fun#23#25"{Vector{Float32}, Matrix{Llama2.block_q4_K}, Vector{Llama2.block_q8_K}, UnitRange{Int64}}}, Int64})()
@ Base.Threads ./threadingconstructs.jl:139
...and 7 more exceptions.
I was considering adding LoRA to this repo and figured I'd share my thoughts in case there is interest upstream.
I don't have anything crazy in mind, mostly an implementation similar to https://github.com/microsoft/LoRA/blob/main/loralib/layers.py. Layer weights start at an "unmerged mode", train with the low rank parameters, and eventually merge when the network is serialized. Some flag in the train function controls whether regular or LoRA linear layers are used.
Let me know if you have any other ideas/suggestions
Thank you all for putting this package together.
I am interested in seeing the operations expressed at the highest possible level. At times it feels as if we are redoing the undoing required to express everything as 1d arrays of floats.
For instance the following
fc = freq_cis_real_row .+ freq_cis_imag_row .* im
QQ = reshape(reinterpret(ComplexF32, s.q), (head_size ÷ 2, p.n_heads))
KK = reshape(reinterpret(ComplexF32, s.k), (head_size ÷ 2, p.n_heads))
# apply RoPE rotation to the q and k vectors for each head
# rotate q and k by the freq_cis_real and freq_cis_imag
QQ .= fc .* QQ
KK .= fc .* KK
replaces some 20+ lines of code...
Not really an issue, but it does simplify the code. Nevertheless great job for making it all happen.
We should compare the perplexity against llama.cpp to test the quantization code.
once a query is done, it keeps printing for a while. Is there a programatic way to stop it?
The readability of the code in src/quantization
(which I ported from llama.cpp) is not acceptable.
Weight decay could be quite easily added to the Adam optimizer code (it would then become AdamW).
For example, before this line, add
@turbo @. x *= 1-α*λ
According to the LLama2 paper, training was done with value λ=0.1
.
Hope support more model formats, such as ggmlv3.q8_0, ggml-q4, gguf ...
@cafaxo
Trying to follow the instructions in the README and immediately get the above error.
The Llama2 chat models need a special prompt template to produce good output, see https://github.com/facebookresearch/llama#fine-tuned-chat-models.
We should add a function that automatically provides the proper prompt template to the model.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.