Code Monkey home page Code Monkey logo

keyword-extraction-datasets's Introduction

Keyword-Extraction-Datasets

This repository contains seven annotated datasets for automatic keyword extraction task. Every dataset contains a document (.txt or .abstr) and its corresponding gold-standard keywords list (.key or .uncontr). These datasets were used for our study of supervised and unsupervised keyword extraction. Following are the links to our published works.

  1. sCAKE: Semantic Connectivity Aware Keyword Extraction

DOI:10.1016/j.ins.2018.10.034 Generic badge Generic badge

  1. Complex Network based Supervised Keyword Extractor.

DOI:10.1016/j.eswa.2019.112876 Generic badge Generic badge

Following are the datasets and the original papers which proposed them.

  1. Hulth2003: Contains abstracts from Inspec dataset. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
  2. WWW and KDD: CS abstracts from KDD and WWW conferences. We have only kept those documents that contain at least two sentences and atleast one gold-standard keyword. Originally downloaded from https://www.dropbox.com/s/3c57qar1b0xseob/kpshare.tgz?dl=0 (Link is not available now). Full dataset can be downloaded from https://github.com/LIAAD/KeywordExtractor-Datasets/tree/master/datasets.
  3. Marujo2012: News articles. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
  4. Krapivin2012: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
  5. Semeval2010: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
  6. NLM500: PubMed documents. Originally downloaded from https://github.com/zelandiya/keyword-extraction-datasets. Created for abstractive KE task.

Dataset details and collection statistics

Dataset |D| Lavg Navg Kavg KPavg Description
Hulth2003 1500 129 23 10 90.07 Abstracts from Inspec dataset
WWW 1248 174 9 5 64.97 Abstracts from CS articles published in KDD conference
KDD 704 204 8 4 68.12 Abstracts from CS articles published in WWW conference
Marujo2012 450 427 69 48 99.31 Online news articles
Krapivin2009 2304 7961 11 5 96.91 Full scientific articles from ACM
SemEval2010 244 8085 34 16 95.89 Full scientific articles from ACM, created for SemEval2010 Task 5
NLM500 500 4854 27 14 71.35 Full papers from PubMed database

|D|: Number of documents. Lavg: Average document length, in words. Navg: Average gold-standard keywords (unigrams) assigned per document. Kavg: Average gold-standard keyphrases (n-grams) assigned per document. KPavg: Average percentage of keyphrases present in the text

Citations:

Following are the citations for original papers.

Hulth2003

@inproceedings{hulth2003improved,
title = "Improved Automatic Keyword Extraction given more Linguistic Knowledge",
author = "Hulth, Anette",
booktitle = "Proceedings of the 2003 Conference on EMNLP",
pages = "216--223",
year = "2003",
organization = "ACL"
}

Krapivin2009

@article{krapivin2009large,
title = "Large Dataset for Keyphrases Extraction",
author = "Krapivin, Mikalai and Autaeu, Aliaksandr and Marchese, Maurizio",
journal = "Technical Report DISI-09-055",
year = "2009",
publisher = "University of Trento"
}

NLM500

@inproceedings{aronson2000nlm,
title = "The NLM Indexing Initiative",
author = "Aronson and others",
booktitle = "Proceedings of the AMIA Symposium",
pages = "17",
year = "2000",
organization = "American Medical Informatics Association"
}

SemEval2010

@inproceedings{kim2010semeval,
title = "Semeval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles",
author = "Kim, Su Nam and Medelyan, Olena and Kan, Min-Yen and Baldwin, Timothy",
booktitle = "Proceedings of the 5th International Workshop on Semantic Evaluation",
pages = "21--26",
year = "2010",
organization = "Association for Computational Linguistics"
}

Marujo2012

@inproceedings{marujo2012supervised,
title = "Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization",
author = "Marujo, Lu{\'\i}s and Gershman, Anatole and Carbonell, Jaime and Frederking, Robert and Neto, Joa{\`I}ƒo P",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)",
year = "2012"
}

WWW and KDD

@inproceedings{gollapalli2014extracting,
title = "Extracting keyphrases from research papers using citation networks",
author = "Gollapalli, Sujatha Das and Caragea, Cornelia",
booktitle = "Twenty-Eighth AAAI Conference on Artificial Intelligence",
year = "2014"
}

keyword-extraction-datasets's People

Contributors

sduari avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.