Code Monkey home page Code Monkey logo

ake-datasets's Introduction

Benchmark datasets for keyphrase extraction

This repository contains a large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms. These datasets are all pre-processed using the Stanford CoreNLP suite and are available in XML format.

Dataset format

All datasets are stored according to the following, common structure:

dataset/
       /test/       <- test documents
       /train/      <- training documents (if available)
       /dev/        <- validation documents (if available)
       /src/        <- everything used to build the dataset
       /references/ <- reference keyphrases in json format

Bigger datasets (such as KP20k, KPTimes) should be downloaded and preprocessed using the dataset/src directory.

Reference (gold annotation) format

Reference keyphrases, used for evaluating automatic keyphrase extraction algorithms, are available in json format and named according to the following rules: [split].[annotator].[stem]?.json

where

  • split corresponds to the dataset split: test, train, dev or valid
  • annotator is the type of annotation: author, reader, editor, combined, contr (controlled vocabulary), uncontr (free annotation)
  • stem (optional) indicates that stemming (using nltk Porter algorithm) is applied on reference keyphrases.

Below is a an example of reference file format:

{
    "doc-1": [
        [
            "target detect"
        ],
        [
            "number of sensor",
            "sensor number"
        ]
    ],
    ...
}

Available datasets

dataset lang nature train dev test Annotation #kp (test) #words (test)
CSTR [1] en Full papers 130 - 500 A 5.4 11501.4
NUS [3] en Full papers - - 211 A+R 11.0 8398.3
PubMed [5] en Full papers - - 1320 A 5.4 5322.9
ACM [6] en Full papers - - 2304 A 5.3 9197.6
Citeulike-180 [13] en Full papers - - 182 R 5.4 8589.7
SemEval-2010 [10] en Full papers 144 - 100 A+R 14.7 7961.2
KP20k [15] en Abstracts 527,090 20,000 20,000 A 176 5.3
Inspec [2] en Abstracts 1000 500 500 I (uncontr) 9.8 134.6
TALN-Archives [14] en/fr Abstracts - - 521/1207 A 4.0/4.1 123.1/141.0
KDD [9] en Abstracts - - 755 A 4.1 190.7
WWW [9] en Abstracts - - 1330 A 4.8 163.5
TermITH-Eval [11] fr Abstracts - - 400 I 11.8 164.7
KPTimes [16] en News 259,923 10,000 20,000 E 5.0 921
DUC-2001 [4] en News - - 308 R 8.1 847.2
500N-KPCrowd [7] en News 450 - 50 R 46.2 465.3
110-PT-BN-KP [12] pt News 100 - 10 R 27.6 439.4
Wikinews-Keyphrase [8] fr News - - 100 R 9.7 313.6

Annotation for gold keyphrases are performed by authors (A), readers (R), editors (E) or professional indexers (I).

References

  1. KEA: Practical automatic keyphrase extraction. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. In Proceedings of the fourth ACM conference on Digital libraries. p. 254-255. 1999.

  2. Improved automatic keyword extraction given more linguistic knowledge. Anette Hulth. In Proceedings of EMNLP 2003. p. 216-223.

  3. Keyphrase Extraction in Scientific Publications. Thuy Dung Nguyen and Min-Yen Kan. In Proceedings of International Conference on Asian Digital Libraries 2007. p. 317-326.

  4. Single Document Keyphrase Extraction Using Neighborhood Knowledge. Xiaojun Wan and Jianguo Xiao. In Proceedings of AAAI 2008. pp. 855-860.

  5. Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods. Alexander Thorsten Schutz. Master's thesis, National University of Ireland (2008).

  6. Large dataset for keyphrases extraction. Krapivin, M., Autaeu, A., & Marchese, M. (2009). University of Trento.

  7. Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization. Marujo, L., Gershman, A., Carbonell, J., Frederking, R., & Neto, J. P. In Proceedings of LREC 2012.

  8. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction. Adrien Bougouin, Florian Boudin, Béatrice Daille. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), 2013.

  9. Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach. Cornelia Caragea, Florin Bulgarov, Andreea Godea and Sujatha Das Gollapalli. In Proceedings of EMNLP 2014. pp. 1435-1446.

  10. How Document Pre-processing affects Keyphrase Extraction Performance. Florian Boudin, Hugo Mougard and Damien Cram. COLING 2016 Workshop on Noisy User-generated Text (WNUT).

  11. TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation. Adrien Bougouin, Sabine Barreaux, Laurent Romary, Florian Boudin and​ Béatrice Daille. Language Resources and Evaluation Conference (LREC), 2016.

  12. Keyphrase Cloud Generation of Broadcast News. Luis Marujo, Márcio Viveiros, João Paulo da Silva Neto. In Proceedings of Interspeech 2011.

  13. Human-competitive tagging using automatic keyphrase extraction. O. Medelyan, E. Frank, I. H. Witten. In Proceedings of EMNLP 2009.

  14. TALN Archives: a digital archive of French research articles in Natural Language Processing. Florian Boudin. In Proceedings of TALN 2013.

  15. Deep Keyphrase Generation R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky and Y. Chi. In Proceedings of ACL 2017.

  16. KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents. Y. Gallina, F. Boudin and B. Daille. In Proceedings of INLG 2019.

ake-datasets's People

Contributors

boudinfl avatar ygorg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ake-datasets's Issues

Question about evaluation

Hi, thanks for awesome repo!
I am a new to keyphrase extraction and would like to know whether I should evaluate my own model on test.author.json, test.readers.json or test.combined.json on NUS/SemEval2010/etc. On the other hand, there are test.contr.json and test.uncontr.json on Inspec. In your MultipartiteRank paper, you used NUS and SemEval2010. Which json file did you used to evaluate your model in the paper?

thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.