Comments (7)
Hi, sorry I haven't get back at you about this, I want to check myself but haven't found the time.
In order not to make you wait more, I think it is due to the fact I used a % of those files as validation and test set (which are the numbers reported in the paper). However this data was not supervised, and I wouldn't use those splits for evaluation, at least not as they are. If you are using the dataset for training, I suggest you use a small portion of the validation data in order to do early stopping and check during training.
I will get back at you once I check myself the data.
from rebel.
Hi @David-Lee-1990, care to share more information? Which stats did you get?
from rebel.
Hi, on test set, i get 172,007 instances and 451,623 triples (far more bigger than 43,506 as reported in your paper) with the following code.
Further, i recalculate the statistics of predicates in en_train.jsonl with
and i find the most frequent predicate "county" show up for 1.28M times.
plz help to clarify the preprocess process, many thanks±
from rebel.
It would be great if you could explain how the number of instances for REBEL (full) is 2,754,387 whereas that for REBEL (sent.) is 784,202.
Am I correct in interpreting an instance as being a unique sentence?
In particular, could you please elaborate on what kind of processing was done on REBEL (full) to get to REBEL (sent.).
I think the REBEL dataset here has all over 3 million instances in the training set and it is not clear to me how to get to 784,202 instances from here.
from rebel.
Hi @debraj135, (sent.) is for a sentence level instance, hence the paragraphs (wikipedia abstracts) are split into sentences. However the number in the paper (784,202) is for instances that contain one of the 220 relation types selected to pretrain rebel, hence if you use the dataset here but only keep those from the 220 relation types as in here
Lines 102 to 103 in 837983b
from rebel.
I can confirm that the numbers in the paper are for 10% of the validation and test files, since those were the numbers used for early stopping and reporting on performance for the silver data. If you need to replicate the same splits try setting val_percent_check to 0.1 and the seed to 42.
Nevertheless, while they were useful for the pre-training of REBEL, as mentioned before, they are quite noisy and unbalanced. To assess final performance is better to use a gold-standard dataset as the ones it was finetuned on.
from rebel.
@LittlePea13 Thank you for getting back to me. So is my understanding correct that each item in the training set here corresponds to a sentence obtained after splitting the abstracts? There are approximate 3 million items in the training set. Are there paragraphs as well in the 3 million items?
And that the 784,202 can be obtained by removing all the items with at least one of the 220 relation types?
from rebel.
Related Issues (20)
- DocRED dataset HOT 1
- Replicating REBEL from BART and some issues HOT 2
- Role of shift_tokens_left HOT 1
- Guide to Fine-tuning on Spacy HOT 2
- is it possible to specify which word i want to generate relations for? HOT 1
- Can't find factory for 'rebel' for language English (en). HOT 1
- Explainability of REBEL HOT 1
- Fine tuning for person to person entity relationship extraction HOT 1
- Problem with negative samples HOT 3
- Extraction of non-existant relation HOT 1
- Issue while running the default_model on training with conl dataset HOT 2
- Error while executing conl dataset HOT 1
- Dataset generation error HOT 1
- Pred file and gold file issues
- version incompatible HOT 2
- Financial relations
- Evaluation details of REBEL on DocRED
- SREDFM Chinese dataset
- Embedding of entity
- surfaceform of the REDFM predicate HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rebel.