Hi, nice work. I try to replica the pretraining process, but using "<a href="https://g

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

About the statistics of the dataset about rebel HOT 7 CLOSED

babelscape commented on May 30, 2024

About the statistics of the dataset

from rebel.

Comments (7)

LittlePea13 commented on May 30, 2024 2

Hi, sorry I haven't get back at you about this, I want to check myself but haven't found the time.

In order not to make you wait more, I think it is due to the fact I used a % of those files as validation and test set (which are the numbers reported in the paper). However this data was not supervised, and I wouldn't use those splits for evaluation, at least not as they are. If you are using the dataset for training, I suggest you use a small portion of the validation data in order to do early stopping and check during training.

I will get back at you once I check myself the data.

from rebel.

LittlePea13 commented on May 30, 2024

Hi @David-Lee-1990, care to share more information? Which stats did you get?

from rebel.

David-Lee-1990 commented on May 30, 2024

Hi, on test set, i get 172,007 instances and 451,623 triples (far more bigger than 43,506 as reported in your paper) with the following code.

Further, i recalculate the statistics of predicates in en_train.jsonl with

and i find the most frequent predicate "county" show up for 1.28M times.

plz help to clarify the preprocess process, many thanks±

from rebel.

debraj135 commented on May 30, 2024

It would be great if you could explain how the number of instances for REBEL (full) is 2,754,387 whereas that for REBEL (sent.) is 784,202.

Am I correct in interpreting an instance as being a unique sentence?

In particular, could you please elaborate on what kind of processing was done on REBEL (full) to get to REBEL (sent.).

I think the REBEL dataset here has all over 3 million instances in the training set and it is not clear to me how to get to 784,202 instances from here.

from rebel.

LittlePea13 commented on May 30, 2024

Hi @debraj135, (sent.) is for a sentence level instance, hence the paragraphs (wikipedia abstracts) are split into sentences. However the number in the paper (784,202) is for instances that contain one of the 220 relation types selected to pretrain rebel, hence if you use the dataset here but only keep those from the 220 relation types as in here

rebel/datasets/rebel-short.py

Lines 102 to 103 in 837983b

    
           relations_df = pd.read_csv(self.config.data_files['relations'], header = None, sep='\t') 
        
           relations = list(relations_df[0])

, then you should obtain the numbers in the paper.

from rebel.

LittlePea13 commented on May 30, 2024

I can confirm that the numbers in the paper are for 10% of the validation and test files, since those were the numbers used for early stopping and reporting on performance for the silver data. If you need to replicate the same splits try setting val_percent_check to 0.1 and the seed to 42.

Nevertheless, while they were useful for the pre-training of REBEL, as mentioned before, they are quite noisy and unbalanced. To assess final performance is better to use a gold-standard dataset as the ones it was finetuned on.

from rebel.

debraj135 commented on May 30, 2024

@LittlePea13 Thank you for getting back to me. So is my understanding correct that each item in the training set here corresponds to a sentence obtained after splitting the abstracts? There are approximate 3 million items in the training set. Are there paragraphs as well in the 3 million items?

And that the 784,202 can be obtained by removing all the items with at least one of the 220 relation types?

from rebel.

About the statistics of the dataset about rebel HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	relations_df = pd.read_csv(self.config.data_files['relations'], header = None, sep='\t')
	relations = list(relations_df[0])