Comments (11)
class YiConverter(SpmConverter):
handle_byte_fallback = True
def decoder(self, replacement, add_prefix_space):
return decoders.Sequence(
[
decoders.Replace("▁", " "),
decoders.ByteFallback(),
decoders.Fuse(),
]
)
def tokenizer(self, proto):
model_type = proto.trainer_spec.model_type
vocab_scores = self.vocab(proto)
if model_type == 1:
import tokenizers
if version.parse(tokenizers.__version__) < version.parse("0.14.0"):
tokenizer = Tokenizer(Unigram(vocab_scores, 0))
else:
tokenizer = Tokenizer(Unigram(vocab_scores, 0, byte_fallback=True))
elif model_type == 2:
_, merges = SentencePieceExtractor(self.original_tokenizer.vocab_file).extract(vocab_scores)
bpe_vocab = {word: i for i, (word, _score) in enumerate(vocab_scores)}
tokenizer = Tokenizer(
BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
)
tokenizer.add_special_tokens(
[
AddedToken("<unk>", normalized=False, special=True),
AddedToken("<|startoftext|>", normalized=False, special=True),
AddedToken("<|endoftext|>", normalized=False, special=True),
]
)
else:
raise Exception(
"You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
)
return tokenizer
def normalizer(self, proto):
return normalizers.Sequence([normalizers.Replace(pattern=" ", content="▁")])
def pre_tokenizer(self, replacement, add_prefix_space):
return None
@Liangdi @ericzhou571 供参考,谢谢
from yi.
tokenizer.json
@Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下,目前的一些测试 case 是符合预期的。但我对 candle 不熟悉,所以没法测得很全面。
from yi.
最近使用 candle , 想做 Yi 系列的支持,candle 使用 https://github.com/huggingface/tokenizers 这个库, 使用时候需要一个 tokenizer.json , 在 Yi 系列 中没有这个文件,一些其他模型如:https://huggingface.co/bert-base-chinese ,https://huggingface.co/Salesforce/blip-image-captioning-large 等有相关支持。 看了一下 transformer 文档, 似乎是 fast-tokenziers 这个模块 https://huggingface.co/docs/transformers/fast_tokenizers
之前咨询 ChatGLM 的时候, candle 那边回复如下,不知道 Yi 系列是否能够支持? candle issue: huggingface/candle#1177 (comment)
transformers 的一些相关代码 https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py
以下是 candle 支持 marian-mt 修改的 convert_slow_tokenizer.py 的代码 https://github.com/huggingface/candle/blob/main/candle-examples/examples/marian-mt/convert_slow_tokenizer.py#L1262C32-L1262C32
我研究一下 tokenizer.json 的问题,稍等~
谢谢
from yi.
I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.
感谢,我们这边已经着手做 candle 的支持,你们可以将对应的 tokenizer.json 提交到 hf 和 modelscope 仓库中去呀,这样其他开发者就可以直接使用了
from yi.
不太确定是否好支持,这个需要内部讨论下(最后一个链接很有用)
from yi.
不太确定是否好支持,这个需要内部讨论下(最后一个链接很有用)
期待能支持, 顺便问一下,国内有微信技术交流群嘛?
from yi.
国内有微信技术交流群嘛?
目前没有,可以在这里投票: #51
from yi.
tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下,目前的一些测试 case 是符合预期的。但我对 candle 不熟悉,所以没法测得很全面。
@loofahcus 我用不同的中英文测试数据测试了, 和 python 的一致的结果, 太棒了, 可以发布转换脚本吗? 我这边尝试使用 candle 适配 Yi-6B 去
from yi.
请教一下converter相比于llama的Converter都做了哪些修改呢?https://github.com/huggingface/transformers/blob/04ab5605fbb4ef207b10bf2772d88c53fc242e83/src/transformers/convert_slow_tokenizer.py#L1098
我们在llama的基础上将转换脚本里的speical token都改成了Yi的,中文字符的tokenize结果都是准的,但是在面对whitespace的时候还是跟原生Yitokenizer结果不一致
from yi.
tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下,目前的一些测试 case 是符合预期的。但我对 candle 不熟悉,所以没法测得很全面。
另外请教一下,使用transfomrers的fast tokenizer加载的时候,应该使用哪一个class呢?直接使用PreTrainedTokenizerFast嘛?
🥹
from yi.
I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.
from yi.
Related Issues (20)
- 想請問Yi-34B-Chat-4bits串接langchain的create_pandas_dataframe_agent function無法達到預期的效果 HOT 4
- 三阶段数据介绍 HOT 1
- Running the model error HOT 4
- Title Yi-6B-200K can't be used HOT 1
- 8卡 A100-40G 可以做34B模型sft吗? HOT 1
- 是否支持在npu上微调和推理 HOT 3
- 偶发性的会报错
- v100显卡,加载量化模型Yi-34B-Chat-4bits,推理速度很慢 HOT 6
- Features : openai_api.py support multi turn dialogs. HOT 1
- Result of Yi-6B-Chat on the BBH dataset cannot be reproduced
- Yi-VL-34b支持int4量化吗?怎么操作 HOT 2
- 自定义数据train.jsonl 8万多,eval.jsonl 105条,为什么SFT时候只显示 length of train dataset:2852,length of eval dataset: 9 HOT 1
- When the API is called multiple times, the GPU memory continuously increases until it overflows. HOT 1
- LLama3发表了,啥时候Yi出新版本啊 HOT 2
- RuntimeError: "triu_tril_cuda_template" not implemented for 'BFloat16'” HOT 4
- Test issue bot
- Test issue bot
- where can I find the training code or script for YI-VL HOT 1
- lora微调yi-6b-chat之后,生成的结果会出现大量的换行符以及空格
- YI:9b在长上下下回答异常
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yi.