是否能支持 huggingface/tokenizers about yi HOT 11 CLOSED

Liangdi commented on May 30, 2024 4

是否能支持 huggingface/tokenizers

from yi.

Comments (11)

loofahcus commented on May 30, 2024 3

class YiConverter(SpmConverter):
    handle_byte_fallback = True

    def decoder(self, replacement, add_prefix_space):
        return decoders.Sequence(
            [
                decoders.Replace("▁", " "),
                decoders.ByteFallback(),
                decoders.Fuse(),
            ]
        )

    def tokenizer(self, proto):
        model_type = proto.trainer_spec.model_type
        vocab_scores = self.vocab(proto)
        if model_type == 1:
            import tokenizers

            if version.parse(tokenizers.__version__) < version.parse("0.14.0"):
                tokenizer = Tokenizer(Unigram(vocab_scores, 0))
            else:
                tokenizer = Tokenizer(Unigram(vocab_scores, 0, byte_fallback=True))

        elif model_type == 2:
            _, merges = SentencePieceExtractor(self.original_tokenizer.vocab_file).extract(vocab_scores)
            bpe_vocab = {word: i for i, (word, _score) in enumerate(vocab_scores)}
            tokenizer = Tokenizer(
                BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
            )
            tokenizer.add_special_tokens(
                [
                    AddedToken("<unk>", normalized=False, special=True),
                    AddedToken("<|startoftext|>", normalized=False, special=True),
                    AddedToken("<|endoftext|>", normalized=False, special=True),
                ]
            )
        else:
            raise Exception(
                "You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
            )

        return tokenizer

    def normalizer(self, proto):
        return normalizers.Sequence([normalizers.Replace(pattern=" ", content="▁")])

    def pre_tokenizer(self, replacement, add_prefix_space):
        return None

@Liangdi @ericzhou571 供参考，谢谢

from yi.

loofahcus commented on May 30, 2024 2

tokenizer.json
@Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

from yi.

loofahcus commented on May 30, 2024 1

最近使用 candle , 想做 Yi 系列的支持，candle 使用 https://github.com/huggingface/tokenizers 这个库，使用时候需要一个 tokenizer.json , 在 Yi 系列中没有这个文件，一些其他模型如：https://huggingface.co/bert-base-chinese ,https://huggingface.co/Salesforce/blip-image-captioning-large 等有相关支持。看了一下 transformer 文档，似乎是 fast-tokenziers 这个模块 https://huggingface.co/docs/transformers/fast_tokenizers

之前咨询 ChatGLM 的时候， candle 那边回复如下，不知道 Yi 系列是否能够支持？ candle issue: huggingface/candle#1177 (comment)

transformers 的一些相关代码 https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py

以下是 candle 支持 marian-mt 修改的 convert_slow_tokenizer.py 的代码 https://github.com/huggingface/candle/blob/main/candle-examples/examples/marian-mt/convert_slow_tokenizer.py#L1262C32-L1262C32

我研究一下 tokenizer.json 的问题，稍等～
谢谢

from yi.

Liangdi commented on May 30, 2024 1

I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.

感谢，我们这边已经着手做 candle 的支持，你们可以将对应的 tokenizer.json 提交到 hf 和 modelscope 仓库中去呀，这样其他开发者就可以直接使用了

from yi.

ZhaoFancy commented on May 30, 2024

不太确定是否好支持，这个需要内部讨论下（最后一个链接很有用）

from yi.

Liangdi commented on May 30, 2024

不太确定是否好支持，这个需要内部讨论下（最后一个链接很有用）

期待能支持，顺便问一下，国内有微信技术交流群嘛？

from yi.

ZhaoFancy commented on May 30, 2024

国内有微信技术交流群嘛？

目前没有，可以在这里投票： #51

from yi.

Liangdi commented on May 30, 2024

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

@loofahcus 我用不同的中英文测试数据测试了，和 python 的一致的结果, 太棒了，可以发布转换脚本吗？我这边尝试使用 candle 适配 Yi-6B 去

from yi.

ericzhou571 commented on May 30, 2024

请教一下converter相比于llama的Converter都做了哪些修改呢？https://github.com/huggingface/transformers/blob/04ab5605fbb4ef207b10bf2772d88c53fc242e83/src/transformers/convert_slow_tokenizer.py#L1098
我们在llama的基础上将转换脚本里的speical token都改成了Yi的，中文字符的tokenize结果都是准的，但是在面对whitespace的时候还是跟原生Yitokenizer结果不一致

from yi.

ericzhou571 commented on May 30, 2024

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

另外请教一下，使用transfomrers的fast tokenizer加载的时候，应该使用哪一个class呢？直接使用PreTrainedTokenizerFast嘛？
🥹

from yi.

loofahcus commented on May 30, 2024

I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.

from yi.

是否能支持 huggingface/tokenizers about yi HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent