Code Monkey home page Code Monkey logo

Comments (2)

RangiLyu avatar RangiLyu commented on July 18, 2024

Please install the correct transformers version as described in the readme and requirements file:
https://github.com/InternLM/InternLM#usages

transformers>=4.34

Lower versions of transformers cannot correctly identify the id set in added_tokens_decoder

from internlm.

Li-Qingyun avatar Li-Qingyun commented on July 18, 2024

yes, the final reason is added_tokens_decoder. i resolved the problem by modifying tokenization_internlm2.py

class InternLM2Tokenizer(PreTrainedTokenizer):
    """
    Construct a InternLM2 tokenizer. Based on byte-level Byte-Pair-Encoding.
    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
    """

    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    model_input_names = ["input_ids", "attention_mask"]
    _auto_class = "AutoTokenizer"

    def __init__(
        self,
        vocab_file,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        pad_token="</s>",
        sp_model_kwargs: Optional[Dict[str, Any]] = None,
        add_bos_token=True,
        add_eos_token=False,
        decode_with_prefix_space=False,
        clean_up_tokenization_spaces=False,
        **kwargs,
    ):
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
        self.vocab_file = vocab_file
        self.add_bos_token = add_bos_token
        self.add_eos_token = add_eos_token
        self.decode_with_prefix_space = decode_with_prefix_space
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(vocab_file)
        self._no_prefix_space_tokens = None
        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            pad_token=pad_token,
            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
            **kwargs,
        )
        
        # If a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
        # Modified from https://github.com/huggingface/transformers/blob/132852203a02e320049457316a63cffb64968aa1/src/transformers/tokenization_utils.py#L358-L360
        added_tokens_decoder = {int(k):v["content"] for k, v in kwargs.pop("added_tokens_decoder", {}).items()}
        added_tokens_encoder = {k:v for v, k in added_tokens_decoder.items()}
        self.added_tokens_decoder = added_tokens_decoder
        self.added_tokens_encoder = added_tokens_encoder

Thanks for reply!

from internlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.