Code Monkey home page Code Monkey logo

edm's Introduction

Wluper

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Authors: Ed Collins, Nikolai Rozanov, Bingbing Zhang

Contact: [email protected]

In the paper of the corresponding name, we discuss how we used an evolutionary algorithm to discover which statistics about a text classification dataset most accurately represent how difficult that dataset is likely to be for machine learning models to learn. We presented there the difficulty measure which we discovered and have provided this Python package of code which can calculate it.

Installation

This code is pip-installable so can be installed on your machine by running:

pip3 install edm

The code requires Python 3 and NumPy.

It is recommended that you install this code in a virtualenv:

$ mkdir myvirtualenv/
$ virtualenv -p python3 myvirtualenv/
$ source bin/activate
(myvirtualenv) $ pip3 install edm

Running

To calculate the difficulty of a text classification dataset, you will need to provide two lists: one of sentences and one of labels. These two lists need to be the same length - i.e. every sentence has a label. Each item of data should be an untokenized string and each label a string.

>>> sents, labels = your_own_loading_function(PATH_TO_DATA_FILE)
>>> sents
["this is a positive sentence", "this is a negative sentence", ...]
>>> labels
["positive", "negative", ...]
>>> assert len(sents) == len(labels)
True

This code does not support the loading of data files (e.g. csv files) into memory - you will need to do this separately.

Once you have loaded your dataset into memory, you can receive a "difficulty report" by running the code as follows:

from edm import report

sents, labels = your_own_loading_function(PATH_TO_DATA_FILE)

print(report.get_difficulty_report(sents, labels))

Note that if your dataset is very large, then counting the words of the dataset may take several minutes. The Amazon Reviews dataset from Character-level Convolutional Networks for Text Classification by Xiang Zhang, Junbo Zhao and Yann LeCun, 2015 which contains 3.6 million Amazon reviews takes approximately 15 minutes to be processed and the difficulty report created. A loading bar will be displayed while the words are counted.

Citation

The official citation from CoNLL 2018 in Belgium. Please use this for citation:

@inproceedings{collins-etal-2018-evolutionary,
    title = "Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks",
    author = "Collins, Edward  and
      Rozanov, Nikolai  and
      Zhang, Bingbing",
    booktitle = "Proceedings of the 22nd Conference on Computational Natural Language Learning",
    month = oct,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/K18-1037",
    doi = "10.18653/v1/K18-1037",
    pages = "380--391",
    abstract = "Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the properties of the data is crucial in perfecting models. In this paper we analyse exactly which characteristics of a dataset best determine how difficult that dataset is for the task of text classification. We then propose an intuitive measure of difficulty for text classification datasets which is simple and fast to calculate. We empirically prove that this measure generalises to unseen data by comparing it to state-of-the-art datasets and results. This measure can be used to analyse the precise source of errors in a dataset and allows fast estimation of how difficult a dataset is to learn. We searched for this measure by training 12 classical and neural network based models on 78 real-world datasets, then use a genetic algorithm to discover the best measure of difficulty. Our difficulty-calculating code and datasets are publicly available.",
}

edm's People

Contributors

ai-nikolai avatar eciraum avatar bahraynian avatar

Stargazers

Yu-Wei Chang avatar  avatar  avatar  avatar Tomoaki Nakamura avatar Leo avatar  avatar  avatar pbyuu avatar kwrobel.eth avatar Bingbing avatar huygrijhuir avatar Danny Tipple avatar 陶蒙蒙 avatar  avatar Song avatar  avatar Christoforos Nalmpantis avatar Lei Chen avatar NanAN avatar  avatar Sirui Hong avatar Jack avatar Ðietrich ₸rautmann avatar Keishin N avatar  avatar Chris Hokamp avatar X_Bee avatar  avatar Federico Marinelli avatar  avatar Andre Brincat avatar Slice avatar 爱可可-爱生活 avatar  avatar HeZhang avatar Rafael Menelau Oliveira e Cruz avatar hamlet avatar takuoko avatar  avatar Seungwon avatar  avatar  avatar Arunkumar Venkataramanan avatar 夏强(Xia Qiang) avatar Christian Hardy avatar Takuro Yamazaki (Murayama) avatar Chen J. avatar Atti avatar yilunchen avatar mimi avatar Andreas Loupasakis avatar Saurabh avatar Alexander Osipenko avatar Iman Jundi avatar Ed Collins avatar Nicky Sher avatar wanshun123 avatar Tung Thanh Le avatar Oleg Baskov avatar  avatar Qingsong Lv avatar  avatar

Watchers

James Cloos avatar Ed Collins avatar  avatar  avatar Bingbing avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.