Comments (6)
My bad, submitted a fix, in the meantime you can fix this by adding the appropriate metadata to the GGUF:
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "▁<PRE>" --special-token middle "▁<MID>" --special-token suffix "▁<SUF>" --special-token eot "▁<EOT>"
from llama.cpp.
My bad, submitted a fix, in the meantime you can fix this by adding the appropriate metadata to the GGUF:
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "▁<PRE>" --special-token middle "▁<MID>" --special-token suffix "▁<SUF>" --special-token eot "▁<EOT>"
Thanks, But when I try to convert this model, codeshell-chat-q4_0.gguf. I received the following error.
INFO:gguf-new-metadata:* Loading: codeshell-chat-q4_0.gguf
Traceback (most recent call last):
File "/Users/kido/Code/models/Publisher/Repository/../../../githubs/llama.cpp/gguf-py/scripts/gguf-new-metadata.py", line 242, in <module>
main()
File "/Users/kido/Code/models/Publisher/Repository/../../../githubs/llama.cpp/gguf-py/scripts/gguf-new-metadata.py", line 201, in main
ids = find_token(token_list, token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kido/Code/models/Publisher/Repository/../../../githubs/llama.cpp/gguf-py/scripts/gguf-new-metadata.py", line 73, in find_token
raise LookupError(f'Unable to find "{token}" in token list!')
LookupError: Unable to find "▁<PRE>" in token list!
from llama.cpp.
That's because that model has completely different FIM tokens (and no EOT token), see tokenizer_config.json, for this model you need the following:
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"
from llama.cpp.
./llama-infill -t 10 -ngl 0 -m ../../models/Publisher/Repository/codellama-13b.Q3_K_S.gguf --temp 0.7 --repeat_penalty 1.1 -n 20 --in-prefix "def helloworld():\n print("hell" --in-suffix "\n print("goodbye world")\n "
That fix the metadata, but I received segmentation fault during llama-infill calling.
./llama-infill -t 10 -m ../../models/Publisher/Repository/codeshell_modified.gguf --temp 0.7 --repeat_penalty 1.1 -n 20 --in-prefix "def helloworld()"
Log start
main: build = 3235 (88540445)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
main: seed = 1719505502
llama_model_loader: loaded meta data with 25 key-value pairs and 508 tensors from ../../models/Publisher/Repository/codeshell_modified.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = codeshell
llama_model_loader: - kv 1: general.name str = CodeShell
llama_model_loader: - kv 2: codeshell.context_length u32 = 8192
llama_model_loader: - kv 3: codeshell.embedding_length u32 = 4096
llama_model_loader: - kv 4: codeshell.feed_forward_length u32 = 16384
llama_model_loader: - kv 5: codeshell.block_count u32 = 42
llama_model_loader: - kv 6: codeshell.attention.head_count u32 = 32
llama_model_loader: - kv 7: codeshell.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: codeshell.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: general.file_type u32 = 2
llama_model_loader: - kv 10: codeshell.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: codeshell.rope.scale_linear f32 = 1.000000
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,70144] = ["æ½»", "æ¶ģ", "ïĴĻ", "amily...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,70144] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,70144] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,72075] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 70000
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 70000
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 70000
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 70000
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: tokenizer.ggml.prefix_token_id u32 = 70001
llama_model_loader: - kv 23: tokenizer.ggml.middle_token_id u32 = 70002
llama_model_loader: - kv 24: tokenizer.ggml.suffix_token_id u32 = 70003
llama_model_loader: - type f32: 338 tensors
llama_model_loader: - type q4_0: 169 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 0
llm_load_vocab: token to piece cache size = 0.2985 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = codeshell
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 70144
llm_load_print_meta: n_merges = 72075
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 42
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 16384
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 0.1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 7.98 B
llm_load_print_meta: model size = 4.25 GiB (4.58 BPW)
llm_load_print_meta: general.name = CodeShell
llm_load_print_meta: BOS token = 70000 '<|endoftext|>'
llm_load_print_meta: EOS token = 70000 '<|endoftext|>'
llm_load_print_meta: UNK token = 70000 '<|endoftext|>'
llm_load_print_meta: PAD token = 70000 '<|endoftext|>'
llm_load_print_meta: LF token = 28544 'ÄĬ'
llm_load_print_meta: PRE token = 70001 '<fim_prefix>'
llm_load_print_meta: SUF token = 70003 '<fim_suffix>'
llm_load_print_meta: MID token = 70002 '<fim_middle>'
llm_load_print_meta: EOT token = 70000 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.45 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size = 4201.36 MiB, ( 4201.44 / 12288.02)
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: Metal buffer size = 4201.35 MiB
llm_load_tensors: CPU buffer size = 154.12 MiB
.............................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/kido/Code/githubs/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 12884.92 MB
llama_kv_cache_init: Metal KV buffer size = 1344.00 MiB
llama_new_context_with_model: KV self size = 1344.00 MiB, K (f16): 672.00 MiB, V (f16): 672.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.27 MiB
llama_new_context_with_model: Metal compute buffer size = 564.00 MiB
llama_new_context_with_model: CPU compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 1687
llama_new_context_with_model: graph splits = 2
system_info: n_threads = 10 / 11 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
[1] 32542 segmentation fault ./llama-infill -t 10 -m --temp 0.7 --repeat_penalty 1.1 -n 20 --in-prefix
from llama.cpp.
The infill
example is not very stable, it's missing a few checks, my guess is it's because you're missing --in-suffix
. It will also crash on models with no EOT, but only after outputting the result.
Please submit another issue.
from llama.cpp.
--in-prefix "def helloworld():\n print("hell" --in-suffix "\n print("goodbye world")\n "
yes, adding --in-suffix fix the problem
from llama.cpp.
Related Issues (20)
- Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py HOT 3
- Bug: Conversion to GGUF format alters rms_norm_eps precision to 1e-05 for all values
- Bug: MinGW build fails to load models with "error loading model: PrefetchVirtualMemory unavailable" HOT 1
- Bug: llama-bench csv output truncated
- Bug: llama-server crash when defragmenting (llama_kv_cache_defrag_internal) HOT 3
- Bug: RWKV 6 Finch 3B+ models crash llama.cpp with CPU backend HOT 20
- Bug: llama-perplexity error using multiple-choice binary data
- Feature Request: Add OLMoE HOT 2
- Feature Request: Priority for RPC servers HOT 2
- Bug: cpu_set_t is undefined in specific Android Archs, making compilation impossible HOT 2
- Reflection-70B quantize error: Llama 3 must be converted with BpeVocab HOT 2
- Bug: rpc-server segment fault when running with no kv cache offloading HOT 3
- Bug: GPU acceleration deosn't open on Windows HOT 4
- Bug: Segmentation fault (core dumped) HOT 3
- Bug: llama-cli prompt eval time calculation
- Bug: llama-server crashing after refactor sampling v2 pull HOT 6
- Bug: Unable to quantise Uncensored Mistral NeMo Model HOT 1
- Bug: broken llama-imatrix arg parser
- Bug: generate at most 400 tokens. HOT 2
- llama : refactor llama_vocab HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama.cpp.