Comments (6)
Could anyone do me a favor plz...
from pet.
You can try another API: tokenizer.convert_tokens_to_ids(YOUR_TOKEN).
Since Roberta is case-sensitive, you may also try lowercase "society".
from pet.
Thanks for answering!
But using lowercase doesn't work for me
Bug still exists: Verbalization "society" does not correspond to a single token, got ['soc', 'iety']
For your first suggestion, I still don't know how it works yet. Here is the related code.
kwargs = {'add_prefix_space': True} if isinstance(tokenizer, GPT2Tokenizer) else {}
ids = tokenizer.encode(word, add_special_tokens=False, **kwargs)
if not force_single_token:
return ids
assert (
len(ids) == 1
), f'Verbalization "{word}" does not correspond to a single token, got {tokenizer.convert_ids_to_tokens(ids)}'
Roberta tokenizer converts one word into two tokens (with specific ids). But vanilla PET can only process one token.
So the assertion check finds there are two ids and drops me an error.
Could you please explain more explictly how to modify it at your convenience.
Anyway, your suggestion does a great help to me, thanks again. Best wishes.
from pet.
Hi,
GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a "Ġ".
Actually, "Society" is not a token in the vocab but "ĠSociety" is a valid one.
You can call tokenizer.convert_tokens_to_ids("ĠSociety")
and the result is 3930
.
The only thing you need to do is replace "tokenizer.encode(xxxxx)" with the following lines:
if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
space_word = "Ġ" + word
id = tokenizer.convert_tokens_to_ids(space_word)
else:
id = tokenizer.convert_tokens_to_ids(word)
Refer to this thread for more details:
https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante
Best.
from pet.
That works! Thanks for the solution and reference.
from pet.
Related Issues (20)
- PET results different from reported in huggingface blog "How many data points is a prompt worth?" study HOT 1
- Script for zero labeled examples? HOT 1
- PET's final classifier for RAFT benchmark B77 task
- How to recover training process? HOT 2
- load_dataset function missing HOT 2
- How to use meta-training with PET?
- Question about the loss calculation of pet HOT 3
- Running commands for GENPET HOT 1
- cannot import name AddedToken
- Training PET on a personalised task HOT 1
- Cannot take a larger sample than population when 'replace=False' HOT 1
- Is multi-token iPET be able to trained on a multi-class classification dataset? HOT 2
- Training Time Issue HOT 4
- GenPET commands with GPT-2 HOT 2
- RuntimeError on eval method HOT 2
- Token indices sequence length is longer than the specified maximum sequence length for this model HOT 2
- PET and iPET parameters
- Random seed parameter for iterations
- Data format for few-shot text classification
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pet.