Code Monkey home page Code Monkey logo

llama-cpp-rs's Introduction

🦙 llama-cpp-rsDocs Latest Version Lisence

This is the home for llama-cpp-2. It also contains the llama-cpp-sys bindings which are updated regularly and in sync with llama-cpp-2.

This project was created with the explict goal of staying as up to date as possible with llama.cpp, as a result it is dead simple, very close to raw bindings, and does not follow semver meaningfully.

Check out the docs.rs for crate documentation or the readme for high level information about the project.

Try it

We maintain a super simple example of using the library:

Clone the repo

git clone --recursive https://github.com/utilityai/llama-cpp-rs
cd llama-cpp-rs

Run the simple example (add --featues cuda if you have a cuda gpu)

cargo run --release --bin simple "The way to kill a linux process is" hf-model TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf
Output
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_params { n_gpu_layers: 1000, split_mode: 1, main_gpu: 0, tensor_split: 0x0, progress_callback: None, progress_callback_user_data: 0x0, kv_overrides: 0x0, vocab_only: false, use_mmap: true, use_mlock: false }
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/marcus/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      CUDA0 buffer size =  3820.94 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
Loaded "/home/marcus/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_K_M.gguf"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    13.02 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 164.01 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 8.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   164.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 3
n_len = 32, n_ctx = 2048, k_kv_req = 32

The way to kill a linux process is to send it a SIGKILL signal. The way to kill a windows process is to send it a S

decoded 24 tokens in 0.23 s, speed 105.65 t/s

load time = 727.50 ms sample time = 0.46 ms / 24 runs (0.02 ms per token, 51835.85 tokens per second) prompt eval time = 68.52 ms / 9 tokens (7.61 ms per token, 131.35 tokens per second) eval time = 225.70 ms / 24 runs (9.40 ms per token, 106.34 tokens per second) total time = 954.18 ms

Hacking

Ensure that when you clone this project you also clone the submodules. This can be done with the following command:

git clone --recursive https://github.com/utilityai/llama-cpp-rs

or if you have already cloned the project you can run:

git submodule update --init --recursive

llama-cpp-rs's People

Contributors

actions-user avatar anagri avatar babichjacob avatar bruceunx avatar danbev avatar dependabot[bot] avatar derrickpersson avatar estokes avatar hirtol avatar jasonmccampbell avatar jiabochao avatar kinkel-ralf avatar l-jasmine avatar luke344 avatar marcusdunn avatar sepehr455 avatar silasmarvin avatar systemcluster avatar tinglou avatar tommyip avatar vladfaust avatar zh217 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

llama-cpp-rs's Issues

Support dynamic linking llama

Since the change in llama-cpp-2-sys’s build.rs, specifically in version 0.1.55, I’ve found that compiling the CUDA version is extremely slow (taking 3 hours) and reruns very frequently.

To speed things up, I prevent the build.rs from running by overriding llama in .cargo/config and dynamically link to libllama.so instead.
https://doc.rust-lang.org/stable/cargo/reference/build-scripts.html#overriding-build-scripts

However, without running build.rs, the bindings.rs file is not generated.
So, I hope there could be a feature option that allows me to use a pre-generated bindings.rs or to directly dynamically link to llama without running build.rs.

Document `sample_token_greedy` better

As seen in #161 (and mistake I have made) sample_token_greedy is a little weird and avoiding it having unintended behavoir involves reading llama.cpp docs/code.

We should document it better and provide a golden path via some replacement for the deprecated Sampler struct

Can't make this compile...

Hey,

Really great to see this project - but unfortunately I can't seem to compile it - and this may well be may fault as I'm new at Rust.

( if so - my apologies )

Any advice or pointers would be much appreciated.

This is the error... ( on a Macbookpro M1 MAX )

The following warnings were emitted during compilation:

warning: llama.cpp/ggml.c:2225:19: warning: unused function 'ggml_up32' [-Wunused-function]
warning: static inline int ggml_up32(int n) {
warning: ^
warning: llama.cpp/ggml.c:17941:13: warning: unused function 'ggml_opt_get_grad' [-Wunused-function]
warning: static void ggml_opt_get_grad(int np, struct ggml_tensor * const ps[], float * g) {
warning: ^
warning: 2 warnings generated.
warning: llama.cpp/ggml-backend.c:1036:13: warning: unused function 'sched_print_assignments' [-Wunused-function]
warning: static void sched_print_assignments(ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
warning: ^
warning: 1 warning generated.
warning: llama.cpp/ggml-quants.c:1376:14: warning: unused function 'make_qkx1_quants' [-Wunused-function]
warning: static float make_qkx1_quants(int n, int nmax, const float * restrict x, uint8_t * restrict L, float * restrict the_min,
warning: ^
warning: 1 warning generated.
warning: llama.cpp/llama.cpp:1179:34: error: use of undeclared identifier 'RLIMIT_MEMLOCK'
warning: if (suggest && getrlimit(RLIMIT_MEMLOCK, &lock_limit)) {
warning: ^
warning: 1 error generated.

error: failed to run custom build command for llama-cpp-sys-2 v0.1.22

Caused by:
process didn't exit successfully: /Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-db28f601b850efa8/build-script-build (exit status: 1)
--- stdout
cargo:rerun-if-changed=llama.cpp
compiling ggml
TARGET = Some("aarch64-apple-darwin")
OPT_LEVEL = Some("3")
HOST = Some("aarch64-apple-darwin")
cargo:rerun-if-env-changed=CC_aarch64-apple-darwin
CC_aarch64-apple-darwin = None
cargo:rerun-if-env-changed=CC_aarch64_apple_darwin
CC_aarch64_apple_darwin = None
cargo:rerun-if-env-changed=HOST_CC
HOST_CC = None
cargo:rerun-if-env-changed=CC
CC = None
cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("aes,crc,dit,dotprod,dpb,dpb2,fcma,fhm,flagm,fp16,frintts,jsconv,lor,lse,neon,paca,pacg,pan,pmuv3,ras,rcpc,rcpc2,rdm,sb,sha2,sha3,ssbs,vh")
cargo:rerun-if-env-changed=CFLAGS_aarch64-apple-darwin
CFLAGS_aarch64-apple-darwin = None
cargo:rerun-if-env-changed=CFLAGS_aarch64_apple_darwin
CFLAGS_aarch64_apple_darwin = None
cargo:rerun-if-env-changed=HOST_CFLAGS
HOST_CFLAGS = None
cargo:rerun-if-env-changed=CFLAGS
CFLAGS = None
running: env -u IPHONEOS_DEPLOYMENT_TARGET "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-std=c17" "-Wall" "-Wextra" "-DGGML_USE_K_QUANTS" "-o" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/ggml.o" "-c" "llama.cpp/ggml.c"
cargo:warning=llama.cpp/ggml.c:2225:19: warning: unused function 'ggml_up32' [-Wunused-function]

cargo:warning=static inline int ggml_up32(int n) {

cargo:warning= ^

cargo:warning=llama.cpp/ggml.c:17941:13: warning: unused function 'ggml_opt_get_grad' [-Wunused-function]

cargo:warning=static void ggml_opt_get_grad(int np, struct ggml_tensor * const ps[], float * g) {

cargo:warning= ^

cargo:warning=2 warnings generated.

exit status: 0
running: env -u IPHONEOS_DEPLOYMENT_TARGET "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-std=c17" "-Wall" "-Wextra" "-DGGML_USE_K_QUANTS" "-o" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/ggml-alloc.o" "-c" "llama.cpp/ggml-alloc.c"
exit status: 0
running: env -u IPHONEOS_DEPLOYMENT_TARGET "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-std=c17" "-Wall" "-Wextra" "-DGGML_USE_K_QUANTS" "-o" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/ggml-backend.o" "-c" "llama.cpp/ggml-backend.c"
cargo:warning=llama.cpp/ggml-backend.c:1036:13: warning: unused function 'sched_print_assignments' [-Wunused-function]

cargo:warning=static void sched_print_assignments(ggml_backend_sched_t sched, struct ggml_cgraph * graph) {

cargo:warning= ^

cargo:warning=1 warning generated.

exit status: 0
running: env -u IPHONEOS_DEPLOYMENT_TARGET "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-std=c17" "-Wall" "-Wextra" "-DGGML_USE_K_QUANTS" "-o" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/ggml-quants.o" "-c" "llama.cpp/ggml-quants.c"
cargo:warning=llama.cpp/ggml-quants.c:1376:14: warning: unused function 'make_qkx1_quants' [-Wunused-function]

cargo:warning=static float make_qkx1_quants(int n, int nmax, const float * restrict x, uint8_t * restrict L, float * restrict the_min,

cargo:warning= ^

cargo:warning=1 warning generated.

exit status: 0
cargo:rerun-if-env-changed=AR_aarch64-apple-darwin
AR_aarch64-apple-darwin = None
cargo:rerun-if-env-changed=AR_aarch64_apple_darwin
AR_aarch64_apple_darwin = None
cargo:rerun-if-env-changed=HOST_AR
HOST_AR = None
cargo:rerun-if-env-changed=AR
AR = None
cargo:rerun-if-env-changed=ARFLAGS_aarch64-apple-darwin
ARFLAGS_aarch64-apple-darwin = None
cargo:rerun-if-env-changed=ARFLAGS_aarch64_apple_darwin
ARFLAGS_aarch64_apple_darwin = None
cargo:rerun-if-env-changed=HOST_ARFLAGS
HOST_ARFLAGS = None
cargo:rerun-if-env-changed=ARFLAGS
ARFLAGS = None
running: ZERO_AR_DATE="1" "ar" "cq" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/libggml.a" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/ggml.o" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/ggml-alloc.o" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/ggml-backend.o" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/ggml-quants.o"
exit status: 0
running: "ar" "s" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/libggml.a"
exit status: 0
cargo:rustc-link-lib=static=ggml
cargo:rustc-link-search=native=/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out
compiling llama
TARGET = Some("aarch64-apple-darwin")
OPT_LEVEL = Some("3")
HOST = Some("aarch64-apple-darwin")
cargo:rerun-if-env-changed=CXX_aarch64-apple-darwin
CXX_aarch64-apple-darwin = None
cargo:rerun-if-env-changed=CXX_aarch64_apple_darwin
CXX_aarch64_apple_darwin = None
cargo:rerun-if-env-changed=HOST_CXX
HOST_CXX = None
cargo:rerun-if-env-changed=CXX
CXX = None
cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("aes,crc,dit,dotprod,dpb,dpb2,fcma,fhm,flagm,fp16,frintts,jsconv,lor,lse,neon,paca,pacg,pan,pmuv3,ras,rcpc,rcpc2,rdm,sb,sha2,sha3,ssbs,vh")
cargo:rerun-if-env-changed=CXXFLAGS_aarch64-apple-darwin
CXXFLAGS_aarch64-apple-darwin = None
cargo:rerun-if-env-changed=CXXFLAGS_aarch64_apple_darwin
CXXFLAGS_aarch64_apple_darwin = None
cargo:rerun-if-env-changed=HOST_CXXFLAGS
HOST_CXXFLAGS = None
cargo:rerun-if-env-changed=CXXFLAGS
CXXFLAGS = None
running: env -u IPHONEOS_DEPLOYMENT_TARGET "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-std=c++17" "-Wall" "-Wextra" "-D_XOPEN_SOURCE=600" "-o" "/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/llama.o" "-c" "llama.cpp/llama.cpp"
cargo:warning=llama.cpp/llama.cpp:1179:34: error: use of undeclared identifier 'RLIMIT_MEMLOCK'

cargo:warning= if (suggest && getrlimit(RLIMIT_MEMLOCK, &lock_limit)) {

cargo:warning= ^

cargo:warning=1 error generated.

exit status: 1

--- stderr

error occurred: Command env -u IPHONEOS_DEPLOYMENT_TARGET "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-std=c++17" "-Wall" "-Wextra" "-D_XOPEN_SOURCE=600" "-o"

"/Users/odd/Documents/odd_LLM_rust/llamaload/target/release/build/llama-cpp-sys-2-809516a64fd910ba/out/llama.cpp/llama.o" "-c" "llama.cpp/llama.cpp" with args "c++" did not execute successfully (status code exit status: 1).

add windows to Test CI

we currently test on linux, and check we build on windows.

This workfllow allows linux devs to break tests for windows folks (see #133).

We should add windows to the test CI action

Fails to compile on Windows with u32 != i32

Compiling on Windows results in lots of type mismatch errors like this:

error[E0308]: mismatched types
   --> llama-cpp-2\src\model.rs:335:11
    |
335 |     BPE = llama_cpp_sys_2::LLAMA_VOCAB_TYPE_BPE,
    |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `u32`, found `i32`

error[E0308]: mismatched types
   --> llama-cpp-2\src\model.rs:337:11
    |
337 |     SPM = llama_cpp_sys_2::LLAMA_VOCAB_TYPE_SPM,
    |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `u32`, found `i32`

...

error[E0308]: mismatched types
   --> llama-cpp-2\src\grammar.rs:410:28
    |
410 |                 ']' => Ok((u32::from(']'), rest)),
    |                            ^^^^^^^^^^^^^^ expected `i32`, found `u32`

error[E0308]: mismatched types
   --> llama-cpp-2\src\grammar.rs:414:17
    |
414 |             Ok((u32::from(c), &rest[c.len_utf8()..]))
    |                 ^^^^^^^^^^^^ expected `i32`, found `u32`

The types generated for enums look like pub type llama_vocab_type = ::std::os::raw::c_int;, which maps to i32, while this library expects those values to be u32.

This is essentially an issue with bindgen generating the native underlying type for enums, which on Windows is different: rust-lang/rust-bindgen#1966

I worked around it locally by adding as _ in a couple places, which works fine, but I'm unsure if there's not a better approach.

on M1 Max I get 12 tokens/s - while in oobabooga i get 34 tokens/s... ( same model and settings )

Hi Marcus,

Amazing you fixed the compile bug so fast...thank you !!

I now got it to work and looking at speed...

..seems I can't get the same performance as in oobabooga ( which should be the same as I'm also using llama.cpp ) I've matched the context window and the only difference seems to be n_threads ( which I could not find a way to change )

On default settings I get n_gpu layers = 0 when it should be auto set to 1000 ...and if I force it to be 1 or 1000, it makes no difference, still 12 token/s instead of 34.

Here's the oobabooga stats + vanilla stats from running my compile of llama-cpp-rs...

-- oobabooga --

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from models/toppy-m-7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = undi95_toppy-m-7b
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q5_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.78 GiB (5.67 BPW)
llm_load_print_meta: general.name = undi95_toppy-m-7b
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 4893.70 MiB, ( 4893.77 / 49152.00)
llm_load_tensors: system memory used = 4893.10 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/odd/Documents/oddabooga/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: maxTransferRate = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 512.00 MiB, ( 5407.33 / 49152.00)
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, ( 5407.34 / 49152.00)
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 291.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 288.02 MiB, ( 5695.34 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
01:37:55-269296 INFO LOADER: llama.cpp
01:37:55-269801 INFO TRUNCATION LENGTH: 4096
01:37:55-270136 INFO INSTRUCTION TEMPLATE: Alpaca
01:37:55-270440 INFO Loaded the model in 0.14 seconds.

llama_print_timings: load time = 316.03 ms
llama_print_timings: sample time = 34.69 ms / 425 runs ( 0.08 ms per token, 12250.66 tokens per second)
llama_print_timings: prompt eval time = 315.94 ms / 38 tokens ( 8.31 ms per token, 120.27 tokens per second)
llama_print_timings: eval time = 11205.65 ms / 424 runs ( 26.43 ms per token, 37.84 tokens per second)
llama_print_timings: total time = 12201.13 ms

-->> Output generated in 12.42 seconds (34.13 tokens/s, 424 tokens, context 38, seed 1944046193) <<--

-- llama-cpp-rs --

Model params...
LlamaModelParams { params: llama_model_params { n_gpu_layers: 0, split_mode: 1, main_gpu: 0, tensor_split: 0x0, progress_callback: None, progress_callback_user_data: 0x0, kv_overrides: 0x0, vocab_only: false, use_mmap: true, use_mlock: false } }
..END
llama_model_params { n_gpu_layers: 0, split_mode: 1, main_gpu: 0, tensor_split: 0x0, progress_callback: None, progress_callback_user_data: 0x0, kv_overrides: 0x0, vocab_only: false, use_mmap: true, use_mlock: false }
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/odd/Documents/odd_LLM_rust/llama-cpp-rs-odd/target/release/model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = undi95_toppy-m-7b
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q5_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.78 GiB (5.67 BPW)
llm_load_print_meta: general.name = undi95_toppy-m-7b
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 4892.99 MiB
...................................................................................................
Loaded "/Users/odd/Documents/odd_LLM_rust/llama-cpp-rs-odd/target/release/model.gguf"
Context params...
LlamaContextParams { context_params: llama_context_params { seed: 1234, n_ctx: 4096, n_batch: 512, n_threads: 4, n_threads_batch: 4, rope_scaling_type: -1, rope_freq_base: 0.0, rope_freq_scale: 0.0, yarn_ext_factor: -1.0, yarn_attn_factor: 1.0, yarn_beta_fast: 32.0, yarn_beta_slow: 1.0, yarn_orig_ctx: 0, cb_eval: None, cb_eval_user_data: 0x0, type_k: 1, type_v: 1, mul_mat_q: true, logits_all: false, embedding: false, offload_kqv: true } }
..END
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: CPU input buffer size = 16.02 MiB
llama_new_context_with_model: CPU compute buffer size = 308.00 MiB
llama_new_context_with_model: graph splits (measure): 1
n_len = 512, n_ctx = 4096, k_kv_req = 512

Hello my name is Katie and I am a 20 year old student studying English Literature at the University of Manchester. I have been a vegetarian for 10 years and a vegan for 2 years. I am passionate about animal rights and environmental issues and I hope to use my degree to make a difference in the world. I love to read, write, travel and spend time with my friends and family. I am also a big fan of yoga and meditation, which help me to stay grounded and focused. I am looking forward to sharing my experiences and learning from others on this journey. Namaste.

decoded 123 tokens in 10.09 s, speed 12.20 t/s

load time = 523.95 ms
sample time = 14.02 ms / 124 runs (0.11 ms per token, 8845.77 tokens per second)

-->> prompt eval time = 481.84 ms / 5 tokens (96.37 ms per token, 10.38 tokens per second) <<--

eval time = 10062.00 ms / 123 runs (81.80 ms per token, 12.22 tokens per second)
total time = 10608.13 ms

Alternative for use of unstable library feature 'ptr_from_ref'

Using llama-cpp-2 = "0.1.45" as a dependency requires using nightly rust for the unstable feature 'ptr_from_ref'.

Example build error:

error[E0658]: use of unstable library feature 'ptr_from_ref'
   --> /Users/jeadie/.cargo/registry/src/index.crates.io-6f17d22bba15001f/llama-cpp-2-0.1.45/src/token/data_array.rs:362:22
    |
362 |         let mu_ptr = ptr::from_mut(mu);
    |                      ^^^^^^^^^^^^^
    |
    = note: see issue #106116 <https://github.com/rust-lang/rust/issues/106116> for more information

Is there any simple alternative for the underlying DataArray implementation? One alternative I have verified is at Jeadie/llama-cpp-rs#c271be.

Faster Embeddings

Great crate!

I was able to speed up embeddings by making the following changes -

  1. expose n_ubatch
  2. setting n_ubatch and n_batch to 2048
  3. initialize llamabatch with n-tokens with 2048
  4. updating line 65 to check on n_batch size instead of n_ctx. (Details below)

Line 65 - if (batch.n_tokens() as usize + tokens.len()) > n_ctx {

this needs to be n_batch & not n_ctx ( you can refer to the original llama example -
https://github.com/ggerganov/llama.cpp/blob/master/examples/embedding/embedding.cpp (line 164) - if (batch.n_tokens + n_toks > n_batch) {

Match Llama.cpp default sampling ?

I'd like to automate a few tests to make sure a model works - ( with llama.cpp as a baseline )

Currently I can't seem to match Llama.cpp's answer... ( llama-cpp-rs answers incorrectly )

..trying the llama-cpp-rs example OR my modified version ( see below )

--

..as a reference - Oobabooga using same model get the correct answer.

( not the exactly the same - but logically correct - like llama.cpp )

--

I presume this is down to llama-cpp-rs not yet having the same sample chain ?

( we don't seem to have CFG - maybe I'm using sample greedy / sample stages / something else the wrong way )

--

That said...

Question...
Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?

Llama-cpp-rs answer... ..close but incorrect

Let's compare the cost of each type of berry:

1. Blueberries cost more than strawberries.
2. Blueberries cost less than raspberries.

From the first statement, we know that blueberries are more expensive than strawberries. From the second statement, we know that blueberries are cheaper than raspberries.

To determine if the third statement, "Raspberries cost more than strawberries and blueberries," is true, we need to compare the cost of raspberries to both strawberries and blueberries.

Since blueberries are cheaper than raspberries, but more expensive than strawberries, and we don't have enough information to compare the cost of raspberries to strawberries directly, we cannot definitively say whether the third statement is true or false based on the given information.

---> Therefore, the answer is: Insufficient information to determine.

Llama.cpp answer... ...correct

Let's compare the prices of each type of berry:
1. Blueberries cost more than strawberries.
2. Blueberries cost less than raspberries.

To determine if the third statement "Raspberries cost more than strawberries and blueberries" is true, we need to compare the price of raspberries with both strawberries and blueberries:

1. Raspberries cost more than strawberries: This is not stated directly in the given information, but it can be inferred from statement 1 (blueberries cost less than raspberries, and blueberries cost more than strawberries).
2. Raspberries cost more than blueberries: This is stated directly in the second statement.

Therefore, based on the given information, 

---> the third statement "Raspberries cost more than strawberries and blueberries" is true. [end of text]

The model
TheBloke/Mistral-7B-Instruct-v0.2-GGUF --> mistral-7b-instruct-v0.2.Q4_K_S.gguf

Default Llama.cpp sample order...
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature

Sample settings...
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.100
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000

The code... ( please forgive my Rust - only rusted for two months... )

   let model = init_model()?;
    let backend = LlamaBackend::init()?;
    let ctx_params = init_context()?;
    run_prompt("Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?", &model, &backend, &ctx_params)?;


Calling the following...

//! This is a translation of simple.cpp in llama.cpp using llama-cpp-2 -- with additional sample stages
#![allow(
    clippy::cast_possible_wrap,
    clippy::cast_possible_truncation,
    clippy::cast_precision_loss,
    clippy::cast_sign_loss
)]

use anyhow::{/* anyhow,*/ bail, Context, Result};
use llama_cpp_2::context::params::LlamaContextParams;
use llama_cpp_2::ggml_time_us;
use llama_cpp_2::llama_backend::LlamaBackend;
use llama_cpp_2::llama_batch::LlamaBatch;
use llama_cpp_2::model::params::LlamaModelParams;
use llama_cpp_2::model::AddBos;
use llama_cpp_2::model::LlamaModel;
use llama_cpp_2::token::data_array::LlamaTokenDataArray;

use llama_cpp_2::token::LlamaToken;

use std::io::Write;
use std::num::NonZeroU32;
use std::time::Duration;

pub fn init_model() -> Result<LlamaModel> {
    let backend = LlamaBackend::init()?;
    let model_params = LlamaModelParams::default()
        .with_n_gpu_layers(33)
        .with_use_mlock(false);
        //.with_use_mlock(true);

    let model_path = std::env::current_exe()
        .expect("Failed to get current executable path")
        .parent()
        .expect("Failed to get executable directory")
        .read_dir()
        .expect("Failed to read directory contents")
        .filter_map(|entry| entry.ok())
        .find(|entry| entry.path().extension().and_then(std::ffi::OsStr::to_str) == Some("gguf"))
        .expect("No .gguf file found in the current directory")
        .path();

    let model = LlamaModel::load_from_file(&backend, &model_path, &model_params)
        .with_context(|| "unable to load model")?;

    Ok(model)
}

pub fn init_context() -> Result<LlamaContextParams> {
    let ctx_params = LlamaContextParams::default()
        .with_n_ctx(NonZeroU32::new(2048))
        .with_seed(1234);

    Ok(ctx_params)
}

pub fn run_prompt(prompt: &str, model: &LlamaModel, backend: &LlamaBackend, ctx_params: &LlamaContextParams) -> Result<()> {
    let n_len = 512;

    let mut ctx = model
        .new_context(backend, ctx_params.clone())
        .with_context(|| "unable to create the llama_context")?;

    let tokens_list = model
        .str_to_token(prompt, AddBos::Always)
        .with_context(|| format!("failed to tokenize {prompt}"))?;

    let n_cxt = ctx.n_ctx() as i32;
    let n_kv_req = tokens_list.len() as i32 + (n_len - tokens_list.len() as i32);

    eprintln!("n_len = {n_len}, n_ctx = {n_cxt}, k_kv_req = {n_kv_req}");

    if n_kv_req > n_cxt {
        bail!(
            "n_kv_req > n_ctx, the required kv cache size is not big enough
either reduce n_len or increase n_ctx"
        )
    }

    if tokens_list.len() >= usize::try_from(n_len)? {
        bail!("the prompt is too long, it has more tokens than n_len")
    }

    // print the prompt token-by-token
    eprintln!();

    for token in &tokens_list {
        eprint!("{}", model.token_to_str(*token)?);
    }

    std::io::stderr().flush()?;

    // create a llama_batch with size 512
    // we use this object to submit token data for decoding
    let mut batch = LlamaBatch::new(512, 1);

    let last_index: i32 = (tokens_list.len() - 1) as i32;
    for (i, token) in (0_i32..).zip(tokens_list.into_iter()) {
        // llama_decode will output logits only for the last token of the prompt
        let is_last = i == last_index;
        batch.add(token, i, &[0], is_last)?;
    }

    ctx.decode(&mut batch)
        .with_context(|| "llama_decode() failed")?;

    // main loop

    let mut n_cur = batch.n_tokens();
    let mut n_decode = 0;

    let t_main_start = ggml_time_us();

    while n_cur <= n_len {
        let candidates = ctx.candidates_ith(batch.n_tokens() - 1);

        let mut candidates_p = LlamaTokenDataArray::from_iter(candidates, false);

            // Llama.cpp default sample order...
            // CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
            // --------------------------------------------------------------------------------
            // Sample settings... 
            //repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
            // top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.100
            //mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000

            //CFG seems we don't have it ?? ( only in llama.cpp )

            // Penalties
            let history = vec![
                LlamaToken::new(2),
                LlamaToken::new(1),
                LlamaToken::new(0),
                ];

            ctx.sample_repetition_penalty(&mut candidates_p, &history, 64, 1.1,
                0.0, 0.0);

      
            ctx.sample_top_k(&mut candidates_p, 40, 1); 

            ctx.sample_tail_free(&mut candidates_p, 1.0, 1); 

            ctx.sample_typical(&mut candidates_p, 1.0, 1);

            ctx.sample_top_p(&mut candidates_p, 0.950, 1);

            ctx.sample_min_p(&mut candidates_p, 0.05, 1);

            ctx.sample_temp(&mut candidates_p, 0.1);

            let new_token_id = ctx.sample_token_greedy(candidates_p);

        if new_token_id == model.token_eos() {
            eprintln!();
            break;
        }

        print!("{}", model.token_to_str(new_token_id)?);
        std::io::stdout().flush()?;

        batch.clear();
        batch.add(new_token_id, n_cur, &[0], true)?;

        n_cur += 1;

        ctx.decode(&mut batch).with_context(|| "failed to eval")?;

        n_decode += 1;
    }

    eprintln!("\n");

    let t_main_end = ggml_time_us();

    let duration = Duration::from_micros((t_main_end - t_main_start) as u64);

    eprintln!(
        "decoded {} tokens in {:.2} s, speed {:.2} t/s\n",
        n_decode,
        duration.as_secs_f32(),
        n_decode as f32 / duration.as_secs_f32()
    );

    println!("{}", ctx.timings());

    Ok(())
}



Llama.cpp full log

./main -p "Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?" -m mistral-7b-instruct-v0.2.Q4_K_S.gguf -n 512 -ngl 33 --threads 8 --temp 0.1

Log start
main: build = 2409 (306d34be)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed = 1710284006
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/odd/Documents/odd_LLM_rust/llama-cpp-rs-mod-odd/target/release/mistral-7b-instruct-v0.2.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 14
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 217 tensors
llama_model_loader: - type q5_K: 8 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attm = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3877.58 MiB, ( 3877.64 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 3877.57 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/odd/Documents/odd_LLM_rust/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 64.00 MiB, ( 3943.45 / 49152.00)
llama_kv_cache_init: Metal KV buffer size = 64.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CPU input buffer size = 10.01 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 73.02 MiB, ( 4016.47 / 49152.00)
llama_new_context_with_model: Metal compute buffer size = 73.00 MiB
llama_new_context_with_model: CPU compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 2

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.100
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 1

Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?

Let's compare the prices of each type of berry:

  1. Blueberries cost more than strawberries.
  2. Blueberries cost less than raspberries.

To determine if the third statement "Raspberries cost more than strawberries and blueberries" is true, we need to compare the price of raspberries with both strawberries and blueberries:

  1. Raspberries cost more than strawberries: This is not stated directly in the given information, but it can be inferred from statement 1 (blueberries cost less than raspberries, and blueberries cost more than strawberries).
  2. Raspberries cost more than blueberries: This is stated directly in the second statement.

Therefore, based on the given information, the third statement "Raspberries cost more than strawberries and blueberries" is true. [end of text]

llama_print_timings: load time = 252.89 ms
llama_print_timings: sample time = 15.60 ms / 184 runs ( 0.08 ms per token, 11797.14 tokens per second)
llama_print_timings: prompt eval time = 164.76 ms / 43 tokens ( 3.83 ms per token, 260.98 tokens per second)
llama_print_timings: eval time = 3579.66 ms / 183 runs ( 19.56 ms per token, 51.12 tokens per second)
llama_print_timings: total time = 3782.13 ms / 226 tokens
ggml_metal_free: deallocating
Log end


LLAMA-CPP-RS - original example - full log

./llama-cpp-rs --n-len 512 "Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?" local mistral-7b-instruct-v0.2.Q4_K_S.gguf

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/odd/Documents/odd_LLM_rust/llama-cpp-rs-odd/target/release/mistral-7b-instruct-v0.2.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 14
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 217 tensors
llama_model_loader: - type q5_K: 8 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3877.58 MiB, ( 3877.64 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 3877.57 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 256.00 MiB, ( 4135.45 / 49152.00)
llama_kv_cache_init: Metal KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU input buffer size = 13.02 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 164.02 MiB, ( 4299.47 / 49152.00)
llama_new_context_with_model: Metal compute buffer size = 164.00 MiB
llama_new_context_with_model: CPU compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
n_len = 512, n_ctx = 2048, k_kv_req = 512

Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?

Let's compare the cost of each type of berry:

  1. Blueberries cost more than strawberries.
  2. Blueberries cost less than raspberries.

From the first statement, we know that blueberries are more expensive than strawberries. From the second statement, we know that blueberries are cheaper than raspberries.

To determine if the third statement, "Raspberries cost more than strawberries and blueberries," is true, we need to compare the cost of raspberries to both strawberries and blueberries.

Since blueberries are cheaper than raspberries, but more expensive than strawberries, and we don't have enough information to compare the cost of raspberries to strawberries directly, we cannot definitively say whether the third statement is true or false based on the given information.

decoded 177 tokens in 3.46 s, speed 51.15 t/s

load time = 350.73 ms
sample time = 20.21 ms / 178 runs (0.11 ms per token, 8805.34 tokens per second)
prompt eval time = 291.05 ms / 43 tokens (6.77 ms per token, 147.74 tokens per second)
eval time = 3437.89 ms / 177 runs (19.42 ms per token, 51.49 tokens per second)
total time = 3810.63 ms
ggml_metal_free: deallocating


LLAMA-CPP-RS modified example full log

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/odd/Documents/odd_LLM_rust/llama-cpp-rs-mod-odd/target/release/mistral-7b-instruct-v0.2.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 14
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 217 tensors
llama_model_loader: - type q5_K: 8 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.86 GiB (4.57 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3877.58 MiB, ( 3877.64 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 3877.57 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 256.00 MiB, ( 4135.45 / 49152.00)
llama_kv_cache_init: Metal KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU input buffer size = 13.02 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 164.02 MiB, ( 4299.47 / 49152.00)
llama_new_context_with_model: Metal compute buffer size = 164.00 MiB
llama_new_context_with_model: CPU compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
n_len = 512, n_ctx = 2048, k_kv_req = 512

Blueberries cost more than strawberries. Blueberries cost less than raspberries. Raspberries cost more than strawberries and blueberries. If the first two statements are true, the third statement is?

Let's compare the cost of each type of berry:

  1. Blueberries cost more than strawberries.
  2. Blueberries cost less than raspberries.

From the first statement, we know that blueberries are more expensive than strawberries. From the second statement, we know that blueberries are cheaper than raspberries.

To determine if the third statement, "Raspberries cost more than strawberries and blueberries," is true, we need to compare the cost of raspberries to both strawberries and blueberries.

Since blueberries are cheaper than raspberries, but more expensive than strawberries, and we don't have enough information to compare the cost of raspberries to strawberries directly, we cannot definitively say whether the third statement is true or false based on the given information.

Therefore, the answer is: Insufficient information to determine.

decoded 192 tokens in 3.74 s, speed 51.36 t/s

load time = 379.33 ms
sample time = 14.58 ms / 193 runs (0.08 ms per token, 13238.22 tokens per second)
prompt eval time = 293.73 ms / 43 tokens (6.83 ms per token, 146.39 tokens per second)
eval time = 3720.89 ms / 192 runs (19.38 ms per token, 51.60 tokens per second)
total time = 4116.71 ms
ggml_metal_free: deallocating

add mac to CI

I imagine #58 will not be the first time we fail to compile on Mac due to some change in llama.cpp or our own code. Ideally this should be added to CI similar to how arm64 and amd64 are right now.

Build fails with `error: statement may not appear in a constexpr function`

My build fails with error: statement may not appear in a constexpr function.
I'm using llama-cpp-2 = "0.1.31" with the feature cublas enabled.

Cargo Output
  process didn't exit successfully: `/home/chrono/Repos/raven-ai/target/release/build/llama-cpp-sys-2-ec23417049fcda47/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=llama.cpp
  cargo:rustc-link-lib=cuda
  cargo:rustc-link-lib=cublas
  cargo:rustc-link-lib=culibos
  cargo:rustc-link-lib=cudart
  cargo:rustc-link-lib=cublasLt
  cargo:rustc-link-lib=pthread
  cargo:rustc-link-lib=dl
  cargo:rustc-link-lib=rt
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-unknown-linux-gnu")
  HOST = Some("x86_64-unknown-linux-gnu")
  cargo:rerun-if-env-changed=CC_x86_64-unknown-linux-gnu
  CC_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CC_x86_64_unknown_linux_gnu
  CC_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = Some("clang")
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-unknown-linux-gnu
  CFLAGS_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_unknown_linux_gnu
  CFLAGS_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  cargo:rustc-link-lib=culibos
  cargo:rustc-link-lib=pthread
  cargo:rustc-link-lib=dl
  cargo:rustc-link-lib=rt
  cargo:rustc-link-search=native=/usr/local/cuda/lib64
  cargo:rerun-if-env-changed=CXX_x86_64-unknown-linux-gnu
  CXX_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXX_x86_64_unknown_linux_gnu
  CXX_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXX
  HOST_CXX = None
  cargo:rerun-if-env-changed=CXX
  CXX = Some("clang++")
  cargo:rerun-if-env-changed=NVCC_x86_64-unknown-linux-gnu
  NVCC_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=NVCC_x86_64_unknown_linux_gnu
  NVCC_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_NVCC
  HOST_NVCC = None
  cargo:rerun-if-env-changed=NVCC
  NVCC = None
  cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
  cargo:warning=Compiler version doesn't include clang or GCC: "nvcc" "--version"
  cargo:rerun-if-env-changed=CXXFLAGS_x86_64-unknown-linux-gnu
  CXXFLAGS_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXXFLAGS_x86_64_unknown_linux_gnu
  CXXFLAGS_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXXFLAGS
  HOST_CXXFLAGS = None
  cargo:rerun-if-env-changed=CXXFLAGS
  CXXFLAGS = None
  compiling ggml-cuda
  cargo:warning=nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
  cargo:warning=/sbin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_algobase.h(1521): error: statement may not appear in a constexpr function
  cargo:warning=        const int __sz = sizeof(+__n);
  cargo:warning=        ^
  cargo:warning=
  cargo:warning=/sbin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_algobase.h(1522): error: statement may not appear in a constexpr function
  cargo:warning=        int __w = __sz * 8 - 1;
  cargo:warning=        ^
  cargo:warning=
  cargo:warning=/sbin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_algobase.h(1524): error: statement may not appear in a constexpr function
  cargo:warning=   __w -= __builtin_clzll(+__n);
  cargo:warning=   ^
  cargo:warning=
  cargo:warning=/sbin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_algobase.h(1526): error: statement may not appear in a constexpr function
  cargo:warning=   __w -= __builtin_clzl(+__n);
  cargo:warning=   ^
  cargo:warning=
  cargo:warning=/sbin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_algobase.h(1528): error: statement may not appear in a constexpr function
  cargo:warning=   __w -= __builtin_clz(+__n);
  cargo:warning=   ^
  cargo:warning=
  cargo:warning=/sbin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_algobase.h(1527): error: statement may not appear in a constexpr function
  cargo:warning=        else if (__sz == sizeof(int))
  cargo:warning=             ^
  cargo:warning=
  cargo:warning=/sbin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_algobase.h(1525): error: statement may not appear in a constexpr function
  cargo:warning=        else if (__sz == sizeof(long))
  cargo:warning=             ^
  cargo:warning=
  cargo:warning=/sbin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_algobase.h(1523): error: statement may not appear in a constexpr function
  cargo:warning=        if (__sz == sizeof(long long))
  cargo:warning=        ^
  cargo:warning=
  cargo:warning=8 errors detected in the compilation of "llama.cpp/ggml-cuda.cu".

  --- stderr


  error occurred: Command "nvcc" "-ccbin=clang++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-m64" "-Xcompiler" "--target=x86_64-unknown-linux-gnu" "-Xcompiler" "-std=c++11" "-I" "llama.cpp" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-arch=all" "-std=c++11" "-DGGML_USE_CUBLAS" "-o" "/home/chrono/Repos/raven-ai/target/release/build/llama-cpp-sys-2-01d14c8a85775975/out/239022a9b6fc5d15-ggml-cuda.o" "-c" "llama.cpp/ggml-cuda.cu" with args "nvcc" did not execute successfully (status code exit status: 2).
nvcc
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
clang++
$ clang --version
clang version 16.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /sbin

Fails to build not finding llama.cpp (even when forced / put into place?)

I am not sure why it can't setup llama.cpp for me...

chris@earth llama-cpp-rs % cargo clean
     Removed 466 files, 185.7MiB total
chris@earth llama-cpp-rs % cargo build
   Compiling proc-macro2 v1.0.74
   Compiling unicode-ident v1.0.11
   Compiling libc v0.2.150
   Compiling glob v0.3.1
   Compiling prettyplease v0.2.12
   Compiling memchr v2.6.3
   Compiling regex-syntax v0.8.2
   Compiling minimal-lexical v0.2.1
   Compiling cfg-if v1.0.0
   Compiling bindgen v0.69.2
   Compiling either v1.9.0
   Compiling lazycell v1.3.0
   Compiling shlex v1.3.0
   Compiling bitflags v2.4.0
   Compiling log v0.4.20
   Compiling peeking_take_while v0.1.2
   Compiling rustc-hash v1.1.0
   Compiling lazy_static v1.4.0
   Compiling thiserror v1.0.56
   Compiling once_cell v1.18.0
   Compiling pin-project-lite v0.2.13
   Compiling libloading v0.7.4
   Compiling tracing-core v0.1.32
   Compiling clang-sys v1.6.1
   Compiling nom v7.1.3
   Compiling quote v1.0.35
   Compiling syn v2.0.46
   Compiling which v4.4.0
   Compiling cc v1.0.83
   Compiling regex-automata v0.4.3
   Compiling cexpr v0.6.0
   Compiling regex v1.10.2
   Compiling tracing-attributes v0.1.27
   Compiling thiserror-impl v1.0.56
   Compiling tracing v0.1.40
   Compiling llama-cpp-sys-2 v0.1.22 (/Users/chris/code/GAIB2.0/rsllm/llama-cpp-rs/llama-cpp-sys-2)
The following warnings were emitted during compilation:

warning: [email protected]: clang: error: no such file or directory: 'llama.cpp/ggml.c'
warning: [email protected]: clang: error: no input files

error: failed to run custom build command for `llama-cpp-sys-2 v0.1.22 (/Users/chris/code/GAIB2.0/rsllm/llama-cpp-rs/llama-cpp-sys-2)`

Caused by:
  process didn't exit successfully: `/Users/chris/code/GAIB2.0/rsllm/llama-cpp-rs/target/debug/build/llama-cpp-sys-2-e3d44ca9734b4442/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=llama.cpp
  compiling ggml
  TARGET = Some("aarch64-apple-darwin")
  OPT_LEVEL = Some("0")
  HOST = Some("aarch64-apple-darwin")
  cargo:rerun-if-env-changed=CC_aarch64-apple-darwin
  CC_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CC_aarch64_apple_darwin
  CC_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("true")
  CARGO_CFG_TARGET_FEATURE = Some("aes,crc,dit,dotprod,dpb,dpb2,fcma,fhm,flagm,fp16,frintts,jsconv,lor,lse,neon,paca,pacg,pan,pmuv3,ras,rcpc,rcpc2,rdm,sb,sha2,sha3,ssbs,vh")
  cargo:rerun-if-env-changed=CFLAGS_aarch64-apple-darwin
  CFLAGS_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CFLAGS_aarch64_apple_darwin
  CFLAGS_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  running: env -u IPHONEOS_DEPLOYMENT_TARGET "cc" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-gdwarf-2" "-fno-omit-frame-pointer" "-arch" "arm64" "-std=c17" "-Wall" "-Wextra" "-DGGML_USE_K_QUANTS" "-o" "/Users/chris/code/GAIB2.0/rsllm/llama-cpp-rs/target/debug/build/llama-cpp-sys-2-6cab0eaaf45c8367/out/llama.cpp/ggml.o" "-c" "llama.cpp/ggml.c"
  cargo:warning=clang: error: no such file or directory: 'llama.cpp/ggml.c'

  cargo:warning=clang: error: no input files

  exit status: 1

  --- stderr


  error occurred: Command env -u IPHONEOS_DEPLOYMENT_TARGET "cc" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-gdwarf-2" "-fno-omit-frame-pointer" "-arch" "arm64" "-std=c17" "-Wall" "-Wextra" "-DGGML_USE_K_QUANTS" "-o" "/Users/chris/code/GAIB2.0/rsllm/llama-cpp-rs/target/debug/build/llama-cpp-sys-2-6cab0eaaf45c8367/out/llama.cpp/ggml.o" "-c" "llama.cpp/ggml.c" with args "cc" did not execute successfully (status code exit status: 1).

It always makes an empty llama.cpp directory...

find . -iname llama.cpp
./target/debug/build/llama-cpp-sys-2-6cab0eaaf45c8367/out/llama.cpp

Replicate llama.cpp default settings...

When compiling llama.cpp "out of the box" and prompting it as follows... ( in this case on a Mac M1 )

./main -p "Write a rhyme haiku about a rabbit and a cube." -m llama-2-7b-chat.Q4_0.gguf -n 128 -ngl 33 --mlock --threads 8

We can see that llama.cpp use the following sampling settings and order...

sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000

sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 1

The ability to replicate settings and sample order would be very useful when comparing results with llama.cpp

Also - several of these are key to adjusting LLM behaviour - like temperature and penalty etc

ggml_metal_init: ggm-common.h not found

I updated to the latest version of the library, as I needed to have the command-r architecture support, but the curent crates.io and main version currently crash on MacOS due to metal_hack breaking in the latest version of llama.cpp
The culprit is ggm-common.h which is not avaiable to the bundled shader. I have tried replacing the .h by it's actual content, prior to putting it inside the .m loader, but it's not as simple and is not going to be maintainable at all.

ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:3:10: fatal error: 'ggml-common.h' file not found

I saw on the llama.cpp issues that this could be fixed by having the default.metallib built by the CMake project, but this would imply modifying the current build.rs heavily, and I have no CUDA compatible machine.

It seems LlamaTokenAttr should be a bitflags

Running Phi3 with simple is failing with

thread 'main' panicked at llama-cpp-2/src/model.rs:243:52:
token type is valid: UnknownValue(264)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Looking at the code it seems this token attr has both LLAMA_TOKEN_ATTR_CONTROL and LLAMA_TOKEN_ATTR_RSTRIP set. That seems like valid use in C, Perhaps LlamaTokenAttr should be migrated to an enumflags variant?

metal support

Currently we support CUDA as the only accelerator. Metal would be a nice addition and is very well supported by llama.cpp

When using llama-cpp-rs alongside the whisper-rs crate in the same project, the application crashes during model loading.

When using llama-cpp-rs alongside the whisper-rs crate in the same project, the application crashes during model loading.

LlamaModel::load_from_file(&backend, model_path, &model_params);

The issue still occurs even when the whisper model and llama model are not being executed at the same time.

The program output is as follows:

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = models--google--gemma-2b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          gemma.block_count u32              = 18
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 13
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   37 tensors
llama_model_loader: - type q3_K:   72 tensors
llama_model_loader: - type q5_K:   54 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 18
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q3_K - Large
llm_load_print_meta: model params     = 2.51 B
llm_load_print_meta: model size       = 1.36 GiB (4.66 BPW) 
llm_load_print_meta: general.name     = models--google--gemma-2b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.13 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  1391.95 MiB, ( 1392.56 / 10922.67)
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
 ELIFECYCLE  Command failed with exit code 1.

How to use GPU?

Even with
let model_params = LlamaModelParams::default().with_n_gpu_layers(10000);
Everything seems to run on the cpu, nothing gets run on the gpu. Do i need to do anything else to run the model on the gpu?

Embeddings breaks on unknown strings

You can replicate the failure with this script after cloning into main.

cargo run --release --bin embeddings --package embeddings -- "The way to kill embeddings is giving tokens it can not convert back to string like தமிழ் తెలుగు for example." hf-model CompendiumLabs/bge-base-en-v1.5-gguf bge-base-en-v1.5-f16.gguf

Output:
...
Prompt 0
101 -->
1996 --> the
2126 --> way
2000 --> to
3102 --> kill
7861 --> em
8270 --> bed
4667 --> ding
2015 --> s
2003 --> is
3228 --> giving
19204 --> token
2015 --> s
2009 --> it
2064 --> can
2025 --> not
10463 --> convert
2067 --> back
2000 --> to
5164 --> string
2066 --> like
1385 --> த
29925 --> ம
Error: Unknown Token Type

Make public

There are some security concerns around our self hosted runners being accessible from here.

Once these are addressed, publish this.

Add more sampling bindings

#108 mentions our pretty barren lack of sampling options, we are missing the following bindings our of the ones used by default by main.cpp

  • top_k
  • tfs_z
  • typical_p
  • top_p
  • min_p

These should more or less be duplicates of our current sampling implementations with minor tweaks.

MacOS - Build failing for missing symbol `unicode_cpt_type`

The latest branch is 0d59098 is failing (on Mac) for error below. This is due to unicode.cpp not compiled as part of the llama-cpp-sys-2 crate. Submitting a PR to fix the same.

          Undefined symbols for architecture arm64:
            "unicode_cpt_type(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)", referenced from:
                llm_tokenizer_bpe::bpe_gpt2_preprocess(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) in libllama_cpp_sys_2-8e30785903cc9911.rlib[12](239022a9b6fc5d15-llama.o)
...
            "unicode_cpts_normalize_nfd(std::__1::vector<unsigned int, std::__1::allocator<unsigned int>> const&)", referenced from:
                llm_tokenizer_wpm::preprocess(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) in libllama_cpp_sys_2-8e30785903cc9911.rlib[12](239022a9b6fc5d15-llama.o)
          ld: symbol(s) not found for architecture arm64
          clang: error: linker command failed with exit code 1 (use -v to see invocation)


error: could not compile `embeddings` (bin "embeddings") due to 1 previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `simple` (bin "simple") due to 1 previous error

Full Error trace: error.txt

API Coverage

This tracks progress on wanted and implemented parts of the llama.h header

convert `LlamaContextParams` to more opaque type

with the introduction of

    pub cb_eval: llama_cpp_sys_2::ggml_backend_sched_eval_callback,
    pub cb_eval_user_data: *mut std::ffi::c_void,

We now have to put some effort in to ensure LlamaContextParams is safe.

Task:

change it to be more like the models params.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.