Library to use the text collectins present in the article: Benchmarking Text Collections for Classification and Clustering Tasks. If you use any part of this code in your research, please cite it using the following BibTex entry
@article{ref:Rossi2013,
title={Benchmarking text collections for classification and clustering tasks},
author={Rossi, Rafael Geraldeli and Marcacini, Ricardo Marcondes and Rezende, Solange Oliveira},
year={2013}
}
!pip install git+https://github.com/GoloMarcos/BTCSR2/
from TextCollectionsLibrary import datasets
datasets_dictionary = datasets.load()
dataset_dictionaty[base name] return a DataFrame
- CSTR
- Classic4
- SiskillWebert
- Webkb-parsed
- Review_polarity
- Re8
- NSF
- indrustry Sector
- Dmoz-Sports
- Dmoz-Science
- Dmoz-Health
- Dmoz-Computers
- Text
- Embedding from BERT
- Embedding from DistilBERT
- Embeddings from Multilingual DistilBERT
- Embedding from RoBERTa
- Document class
- BERT model: bert-large-nli-stsb-mean-tokens
- Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
- DistilBERT model: distilbert-base-nli-stsb-mean-tokens
- Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).
- RoBERTa model: roberta-large-nli-stsb-mean-tokens
- Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
- DistilBERT Multilingual model: distiluse-base-multilingual-cased
- Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).