Code Monkey home page Code Monkey logo

Comments (7)

LittlePea13 avatar LittlePea13 commented on May 30, 2024 2

Hi, sorry I haven't get back at you about this, I want to check myself but haven't found the time.

In order not to make you wait more, I think it is due to the fact I used a % of those files as validation and test set (which are the numbers reported in the paper). However this data was not supervised, and I wouldn't use those splits for evaluation, at least not as they are. If you are using the dataset for training, I suggest you use a small portion of the validation data in order to do early stopping and check during training.

I will get back at you once I check myself the data.

from rebel.

LittlePea13 avatar LittlePea13 commented on May 30, 2024

Hi @David-Lee-1990, care to share more information? Which stats did you get?

from rebel.

David-Lee-1990 avatar David-Lee-1990 commented on May 30, 2024

Hi, on test set, i get 172,007 instances and 451,623 triples (far more bigger than 43,506 as reported in your paper) with the following code.

image

Further, i recalculate the statistics of predicates in en_train.jsonl with

image

and i find the most frequent predicate "county" show up for 1.28M times.

image

plz help to clarify the preprocess process, many thanks±

from rebel.

debraj135 avatar debraj135 commented on May 30, 2024

It would be great if you could explain how the number of instances for REBEL (full) is 2,754,387 whereas that for REBEL (sent.) is 784,202.

Am I correct in interpreting an instance as being a unique sentence?

In particular, could you please elaborate on what kind of processing was done on REBEL (full) to get to REBEL (sent.).

I think the REBEL dataset here has all over 3 million instances in the training set and it is not clear to me how to get to 784,202 instances from here.

from rebel.

LittlePea13 avatar LittlePea13 commented on May 30, 2024

Hi @debraj135, (sent.) is for a sentence level instance, hence the paragraphs (wikipedia abstracts) are split into sentences. However the number in the paper (784,202) is for instances that contain one of the 220 relation types selected to pretrain rebel, hence if you use the dataset here but only keep those from the 220 relation types as in here

relations_df = pd.read_csv(self.config.data_files['relations'], header = None, sep='\t')
relations = list(relations_df[0])
, then you should obtain the numbers in the paper.

from rebel.

LittlePea13 avatar LittlePea13 commented on May 30, 2024

I can confirm that the numbers in the paper are for 10% of the validation and test files, since those were the numbers used for early stopping and reporting on performance for the silver data. If you need to replicate the same splits try setting val_percent_check to 0.1 and the seed to 42.

Nevertheless, while they were useful for the pre-training of REBEL, as mentioned before, they are quite noisy and unbalanced. To assess final performance is better to use a gold-standard dataset as the ones it was finetuned on.

from rebel.

debraj135 avatar debraj135 commented on May 30, 2024

@LittlePea13 Thank you for getting back to me. So is my understanding correct that each item in the training set here corresponds to a sentence obtained after splitting the abstracts? There are approximate 3 million items in the training set. Are there paragraphs as well in the 3 million items?

And that the 784,202 can be obtained by removing all the items with at least one of the 220 relation types?

from rebel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.