Describe the bug <a href="https://huggingface.co/internlm/internlm

[Bug] Special tokens are still mismatched. about internlm HOT 2 CLOSED

Li-Qingyun commented on July 18, 2024 1

[Bug] Special tokens are still mismatched.

from internlm.

Comments (2)

RangiLyu commented on July 18, 2024

Please install the correct transformers version as described in the readme and requirements file:
https://github.com/InternLM/InternLM#usages

InternLM/requirements.txt

Line 3 in bd57ff3

transformers>=4.34

Lower versions of transformers cannot correctly identify the id set in added_tokens_decoder

from internlm.

Li-Qingyun commented on July 18, 2024

yes, the final reason is added_tokens_decoder. i resolved the problem by modifying tokenization_internlm2.py

class InternLM2Tokenizer(PreTrainedTokenizer):
    """
    Construct a InternLM2 tokenizer. Based on byte-level Byte-Pair-Encoding.
    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
    """

    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    model_input_names = ["input_ids", "attention_mask"]
    _auto_class = "AutoTokenizer"

    def __init__(
        self,
        vocab_file,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        pad_token="</s>",
        sp_model_kwargs: Optional[Dict[str, Any]] = None,
        add_bos_token=True,
        add_eos_token=False,
        decode_with_prefix_space=False,
        clean_up_tokenization_spaces=False,
        **kwargs,
    ):
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
        self.vocab_file = vocab_file
        self.add_bos_token = add_bos_token
        self.add_eos_token = add_eos_token
        self.decode_with_prefix_space = decode_with_prefix_space
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(vocab_file)
        self._no_prefix_space_tokens = None
        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            pad_token=pad_token,
            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
            **kwargs,
        )
        
        # If a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
        # Modified from https://github.com/huggingface/transformers/blob/132852203a02e320049457316a63cffb64968aa1/src/transformers/tokenization_utils.py#L358-L360
        added_tokens_decoder = {int(k):v["content"] for k, v in kwargs.pop("added_tokens_decoder", {}).items()}
        added_tokens_encoder = {k:v for v, k in added_tokens_decoder.items()}
        self.added_tokens_decoder = added_tokens_decoder
        self.added_tokens_encoder = added_tokens_encoder

Thanks for reply!

from internlm.

Recommend Projects