Hi, I found some differneces between your baichuan2-7b-chat example and <a href="https

Baichuan2-7b-chat Tokenizer and outputs difference between your repo and officical hugginface example. about chatglm.cpp HOT 2 CLOSED

Zhenzhong1 commented on July 2, 2024

Baichuan2-7b-chat Tokenizer and outputs difference between your repo and officical hugginface example.

from chatglm.cpp.

Comments (2)

li-plus commented on July 2, 2024 1

There are actually two kinds of tokenization strategies. For base models (pretrained without SFT) like Baichuan2-7B, it's fine to directly encode the text into token ids without adding any special prefix or suffix tokens, just as your tokenizer('你好吗', return_tensors='pt'). However, for chat models (after SFT) like Baichuan2-7B-Chat, they are usually trained on a specific prompt template during SFT. When they're deployed for inference, their prompt patterns should be respected. In this case, Baichuan2-7B-Chat uses a prompt template like: (see https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat/blob/main/generation_utils.py#L30-L43)

<user_token_id>prompt1<assistant_token_id>response1<user_token_id>prompt2<assistant_token_id>response2 ...

where <user_token_id> is 195 and <assistant_token_id> is 196 according to their generation config (see https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat/blob/main/generation_config.json#L5-L6). We should add these special tokens to make it chat normally.

from chatglm.cpp.

Zhenzhong1 commented on July 2, 2024

Thanks for your reply. I closed this issue.

from chatglm.cpp.

Baichuan2-7b-chat Tokenizer and outputs difference between your repo and officical hugginface example. about chatglm.cpp HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent