Roberta-large uses byte-level Byte-Pair-Encoding . It avoids the commo

Thanks for answering! But using lowercase doesn't work for me <g-emoji class="g-em

That works! Thanks for the solution and reference.<g-emoji class="g-emoji" alias="part

Roberta-large using BPE tokenizer generates multi tokens. about pet HOT 6 CLOSED

caidongqi commented on July 29, 2024

Roberta-large using BPE tokenizer generates multi tokens.

from pet.

Comments (6)

caidongqi commented on July 29, 2024

Could anyone do me a favor plz...

from pet.

huchinlp commented on July 29, 2024

You can try another API: tokenizer.convert_tokens_to_ids(YOUR_TOKEN).
Since Roberta is case-sensitive, you may also try lowercase "society".

from pet.

caidongqi commented on July 29, 2024

Thanks for answering!
But using lowercase doesn't work for me 😭
Bug still exists: Verbalization "society" does not correspond to a single token, got ['soc', 'iety']

For your first suggestion, I still don't know how it works yet. Here is the related code.

kwargs = {'add_prefix_space': True} if isinstance(tokenizer, GPT2Tokenizer) else {}
ids = tokenizer.encode(word, add_special_tokens=False, **kwargs)
if not force_single_token:
    return ids
assert (
    len(ids) == 1
), f'Verbalization "{word}" does not correspond to a single token, got {tokenizer.convert_ids_to_tokens(ids)}'

Roberta tokenizer converts one word into two tokens (with specific ids). But vanilla PET can only process one token.
So the assertion check finds there are two ids and drops me an error.

Could you please explain more explictly how to modify it at your convenience.

Anyway, your suggestion does a great help to me, thanks again. Best wishes.

from pet.

huchinlp commented on July 29, 2024

Hi,

GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a "Ġ".
Actually, "Society" is not a token in the vocab but "ĠSociety" is a valid one.
You can call tokenizer.convert_tokens_to_ids("ĠSociety") and the result is 3930.

The only thing you need to do is replace "tokenizer.encode(xxxxx)" with the following lines:

if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
     space_word = "Ġ" + word
     id = tokenizer.convert_tokens_to_ids(space_word)
else:
     id = tokenizer.convert_tokens_to_ids(word)

Refer to this thread for more details:
https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante
Best.

from pet.

caidongqi commented on July 29, 2024

That works! Thanks for the solution and reference.🥳

from pet.

Recommend Projects

Roberta-large using BPE tokenizer generates multi tokens. about pet HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent