Code Monkey home page Code Monkey logo

btcsr2's Introduction

Bechmarking of text collections from Solange, Ricardo and Rafael (BTCSR2)

TextCollectionsLibrary

Library to use the text collectins present in the article: Benchmarking Text Collections for Classification and Clustering Tasks. If you use any part of this code in your research, please cite it using the following BibTex entry

@article{ref:Rossi2013,
  title={Benchmarking text collections for classification and clustering tasks},
  author={Rossi, Rafael Geraldeli and Marcacini, Ricardo Marcondes and Rezende, Solange Oliveira},
  year={2013}
}

How To use

!pip install git+https://github.com/GoloMarcos/BTCSR2/

from TextCollectionsLibrary import datasets

datasets_dictionary = datasets.load()

dataset_dictionaty[base name] return a DataFrame

Datasets

  • CSTR
  • Classic4
  • SiskillWebert
  • Webkb-parsed
  • Review_polarity
  • Re8
  • NSF
  • indrustry Sector
  • Dmoz-Sports
  • Dmoz-Science
  • Dmoz-Health
  • Dmoz-Computers

Columns from DataFrame

  • Text
  • Embedding from BERT
  • Embedding from DistilBERT
  • Embeddings from Multilingual DistilBERT
  • Embedding from RoBERTa
  • Document class

We obtain the embeddings with the library sentence_tranformers

  • BERT model: bert-large-nli-stsb-mean-tokens
    • Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
  • DistilBERT model: distilbert-base-nli-stsb-mean-tokens
    • Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).
  • RoBERTa model: roberta-large-nli-stsb-mean-tokens
    • Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
  • DistilBERT Multilingual model: distiluse-base-multilingual-cased
    • Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).

btcsr2's People

Contributors

golomarcos avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.