Code Monkey home page Code Monkey logo

Comments (11)

loofahcus avatar loofahcus commented on May 30, 2024 3
class YiConverter(SpmConverter):
    handle_byte_fallback = True

    def decoder(self, replacement, add_prefix_space):
        return decoders.Sequence(
            [
                decoders.Replace("▁", " "),
                decoders.ByteFallback(),
                decoders.Fuse(),
            ]
        )

    def tokenizer(self, proto):
        model_type = proto.trainer_spec.model_type
        vocab_scores = self.vocab(proto)
        if model_type == 1:
            import tokenizers

            if version.parse(tokenizers.__version__) < version.parse("0.14.0"):
                tokenizer = Tokenizer(Unigram(vocab_scores, 0))
            else:
                tokenizer = Tokenizer(Unigram(vocab_scores, 0, byte_fallback=True))

        elif model_type == 2:
            _, merges = SentencePieceExtractor(self.original_tokenizer.vocab_file).extract(vocab_scores)
            bpe_vocab = {word: i for i, (word, _score) in enumerate(vocab_scores)}
            tokenizer = Tokenizer(
                BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
            )
            tokenizer.add_special_tokens(
                [
                    AddedToken("<unk>", normalized=False, special=True),
                    AddedToken("<|startoftext|>", normalized=False, special=True),
                    AddedToken("<|endoftext|>", normalized=False, special=True),
                ]
            )
        else:
            raise Exception(
                "You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
            )

        return tokenizer

    def normalizer(self, proto):
        return normalizers.Sequence([normalizers.Replace(pattern=" ", content="▁")])

    def pre_tokenizer(self, replacement, add_prefix_space):
        return None

@Liangdi @ericzhou571 供参考,谢谢

from yi.

loofahcus avatar loofahcus commented on May 30, 2024 2

tokenizer.json
@Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下,目前的一些测试 case 是符合预期的。但我对 candle 不熟悉,所以没法测得很全面。

from yi.

loofahcus avatar loofahcus commented on May 30, 2024 1

最近使用 candle , 想做 Yi 系列的支持,candle 使用 https://github.com/huggingface/tokenizers 这个库, 使用时候需要一个 tokenizer.json , 在 Yi 系列 中没有这个文件,一些其他模型如:https://huggingface.co/bert-base-chinese ,https://huggingface.co/Salesforce/blip-image-captioning-large 等有相关支持。 看了一下 transformer 文档, 似乎是 fast-tokenziers 这个模块 https://huggingface.co/docs/transformers/fast_tokenizers

之前咨询 ChatGLM 的时候, candle 那边回复如下,不知道 Yi 系列是否能够支持? candle issue: huggingface/candle#1177 (comment)

transformers 的一些相关代码 https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py

以下是 candle 支持 marian-mt 修改的 convert_slow_tokenizer.py 的代码 https://github.com/huggingface/candle/blob/main/candle-examples/examples/marian-mt/convert_slow_tokenizer.py#L1262C32-L1262C32

我研究一下 tokenizer.json 的问题,稍等~
谢谢

from yi.

Liangdi avatar Liangdi commented on May 30, 2024 1

I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.

感谢,我们这边已经着手做 candle 的支持,你们可以将对应的 tokenizer.json 提交到 hf 和 modelscope 仓库中去呀,这样其他开发者就可以直接使用了

from yi.

ZhaoFancy avatar ZhaoFancy commented on May 30, 2024

不太确定是否好支持,这个需要内部讨论下(最后一个链接很有用)

from yi.

Liangdi avatar Liangdi commented on May 30, 2024

不太确定是否好支持,这个需要内部讨论下(最后一个链接很有用)

期待能支持, 顺便问一下,国内有微信技术交流群嘛?

from yi.

ZhaoFancy avatar ZhaoFancy commented on May 30, 2024

国内有微信技术交流群嘛?

目前没有,可以在这里投票: #51

from yi.

Liangdi avatar Liangdi commented on May 30, 2024

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下,目前的一些测试 case 是符合预期的。但我对 candle 不熟悉,所以没法测得很全面。

@loofahcus 我用不同的中英文测试数据测试了, 和 python 的一致的结果, 太棒了, 可以发布转换脚本吗? 我这边尝试使用 candle 适配 Yi-6B 去

from yi.

ericzhou571 avatar ericzhou571 commented on May 30, 2024

请教一下converter相比于llama的Converter都做了哪些修改呢?https://github.com/huggingface/transformers/blob/04ab5605fbb4ef207b10bf2772d88c53fc242e83/src/transformers/convert_slow_tokenizer.py#L1098
我们在llama的基础上将转换脚本里的speical token都改成了Yi的,中文字符的tokenize结果都是准的,但是在面对whitespace的时候还是跟原生Yitokenizer结果不一致

from yi.

ericzhou571 avatar ericzhou571 commented on May 30, 2024

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下,目前的一些测试 case 是符合预期的。但我对 candle 不熟悉,所以没法测得很全面。

另外请教一下,使用transfomrers的fast tokenizer加载的时候,应该使用哪一个class呢?直接使用PreTrainedTokenizerFast嘛?
🥹

from yi.

loofahcus avatar loofahcus commented on May 30, 2024

I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.

from yi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.