What happened? ./llama-infill -t 10 -ngl 0 -m ../../models/Publish

--in-prefix "def helloworld(): print("hell" --in-suffix " print("good

Bug: infill reference crashed about llama.cpp HOT 6 CLOSED

kidoln commented on September 13, 2024

Bug: infill reference crashed

from llama.cpp.

Comments (6)

CISC commented on September 13, 2024

My bad, submitted a fix, in the meantime you can fix this by adding the appropriate metadata to the GGUF:

gguf-new-metadata.py input.gguf output.gguf --special-token prefix "▁<PRE>" --special-token middle "▁<MID>" --special-token suffix "▁<SUF>" --special-token eot "▁<EOT>"

from llama.cpp.

kidoln commented on September 13, 2024

My bad, submitted a fix, in the meantime you can fix this by adding the appropriate metadata to the GGUF:
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "▁<PRE>" --special-token middle "▁<MID>" --special-token suffix "▁<SUF>" --special-token eot "▁<EOT>"

Thanks, But when I try to convert this model, codeshell-chat-q4_0.gguf. I received the following error.

INFO:gguf-new-metadata:* Loading: codeshell-chat-q4_0.gguf
Traceback (most recent call last):
  File "/Users/kido/Code/models/Publisher/Repository/../../../githubs/llama.cpp/gguf-py/scripts/gguf-new-metadata.py", line 242, in <module>
    main()
  File "/Users/kido/Code/models/Publisher/Repository/../../../githubs/llama.cpp/gguf-py/scripts/gguf-new-metadata.py", line 201, in main
    ids = find_token(token_list, token)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kido/Code/models/Publisher/Repository/../../../githubs/llama.cpp/gguf-py/scripts/gguf-new-metadata.py", line 73, in find_token
    raise LookupError(f'Unable to find "{token}" in token list!')
LookupError: Unable to find "▁<PRE>" in token list!

from llama.cpp.

CISC commented on September 13, 2024

That's because that model has completely different FIM tokens (and no EOT token), see tokenizer_config.json, for this model you need the following:

gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"

from llama.cpp.

kidoln commented on September 13, 2024

./llama-infill -t 10 -ngl 0 -m ../../models/Publisher/Repository/codellama-13b.Q3_K_S.gguf --temp 0.7 --repeat_penalty 1.1 -n 20 --in-prefix "def helloworld():\n print("hell" --in-suffix "\n print("goodbye world")\n "

That fix the metadata, but I received segmentation fault during llama-infill calling.

./llama-infill -t 10 -m ../../models/Publisher/Repository/codeshell_modified.gguf --temp 0.7 --repeat_penalty 1.1 -n 20 --in-prefix "def helloworld()"
Log start
main: build = 3235 (88540445)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
main: seed  = 1719505502
llama_model_loader: loaded meta data with 25 key-value pairs and 508 tensors from ../../models/Publisher/Repository/codeshell_modified.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = codeshell
llama_model_loader: - kv   1:                               general.name str              = CodeShell
llama_model_loader: - kv   2:                   codeshell.context_length u32              = 8192
llama_model_loader: - kv   3:                 codeshell.embedding_length u32              = 4096
llama_model_loader: - kv   4:              codeshell.feed_forward_length u32              = 16384
llama_model_loader: - kv   5:                      codeshell.block_count u32              = 42
llama_model_loader: - kv   6:             codeshell.attention.head_count u32              = 32
llama_model_loader: - kv   7:          codeshell.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:     codeshell.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                          general.file_type u32              = 2
llama_model_loader: - kv  10:                   codeshell.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                codeshell.rope.scale_linear f32              = 1.000000
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,70144]   = ["æ½»", "æ¶ģ", "ïĴĻ", "amily...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,70144]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,70144]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,72075]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 70000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 70000
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 70000
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 70000
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:             tokenizer.ggml.prefix_token_id u32              = 70001
llama_model_loader: - kv  23:             tokenizer.ggml.middle_token_id u32              = 70002
llama_model_loader: - kv  24:             tokenizer.ggml.suffix_token_id u32              = 70003
llama_model_loader: - type  f32:  338 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 0
llm_load_vocab: token to piece cache size = 0.2985 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = codeshell
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 70144
llm_load_print_meta: n_merges         = 72075
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 0.1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.98 B
llm_load_print_meta: model size       = 4.25 GiB (4.58 BPW)
llm_load_print_meta: general.name     = CodeShell
llm_load_print_meta: BOS token        = 70000 '<|endoftext|>'
llm_load_print_meta: EOS token        = 70000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 70000 '<|endoftext|>'
llm_load_print_meta: PAD token        = 70000 '<|endoftext|>'
llm_load_print_meta: LF token         = 28544 'ÄĬ'
llm_load_print_meta: PRE token        = 70001 '<fim_prefix>'
llm_load_print_meta: SUF token        = 70003 '<fim_suffix>'
llm_load_print_meta: MID token        = 70002 '<fim_middle>'
llm_load_print_meta: EOT token        = 70000 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.45 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  4201.36 MiB, ( 4201.44 / 12288.02)
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors:      Metal buffer size =  4201.35 MiB
llm_load_tensors:        CPU buffer size =   154.12 MiB
.............................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/kido/Code/githubs/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 12884.92 MB
llama_kv_cache_init:      Metal KV buffer size =  1344.00 MiB
llama_new_context_with_model: KV self size  = 1344.00 MiB, K (f16):  672.00 MiB, V (f16):  672.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.27 MiB
llama_new_context_with_model:      Metal compute buffer size =   564.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1687
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 10 / 11 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
[1]    32542 segmentation fault  ./llama-infill -t 10 -m  --temp 0.7 --repeat_penalty 1.1 -n 20 --in-prefix

from llama.cpp.

CISC commented on September 13, 2024

The infill example is not very stable, it's missing a few checks, my guess is it's because you're missing --in-suffix. It will also crash on models with no EOT, but only after outputting the result.

Please submit another issue.

from llama.cpp.

kidoln commented on September 13, 2024

--in-prefix "def helloworld():\n print("hell" --in-suffix "\n print("goodbye world")\n "

yes, adding --in-suffix fix the problem

from llama.cpp.

Bug: infill reference crashed about llama.cpp HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent