thilinarajapakse / simpletransformers Goto Github PK

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI

Home Page: https://simpletransformers.ai/

License: Apache License 2.0

Python 99.90% Makefile 0.08% Shell 0.02%

transformers text-classification named-entity-recognition question-answering conversational-ai information-retrival

simpletransformers's Introduction

Simple Transformers

This library is based on the Transformers library by HuggingFace. Simple Transformers lets you quickly train and evaluate Transformer models. Only 3 lines of code are needed to initialize, train, and evaluate a model.

Supported Tasks:

Information Retrieval (Dense Retrieval)
(Large) Language Models (Training, Fine-tuning, and Generation)
Encoder Model Training and Fine-tuning
Sequence Classification
Token Classification (NER)
Question Answering
Language Generation
T5 Model
Seq2Seq Tasks
Multi-Modal Classification
Conversational AI.

Simple Transformers
Table of contents

Setup

With Conda

Install Anaconda or Miniconda Package Manager from here
Create a new virtual environment and install packages.

$ conda create -n st python pandas tqdm
$ conda activate st

Using Cuda:

$ conda install pytorch>=1.6 cudatoolkit=11.0 -c pytorch

Without using Cuda

$ conda install pytorch cpuonly -c pytorch

Install simpletransformers.

$ pip install simpletransformers

Optional

Install Weights and Biases (wandb) for tracking and visualizing training in a web browser.

$ pip install wandb

Usage

All documentation is now live at simpletransformers.ai

Simple Transformer models are built with a particular Natural Language Processing (NLP) task in mind. Each such model comes equipped with features and functionality designed to best fit the task that they are intended to perform. The high-level process of using Simple Transformers models follows the same pattern.

Initialize a task-specific model
Train the model with train_model()
Evaluate the model with eval_model()
Make predictions on (unlabelled) data with predict()

However, there are necessary differences between the different models to ensure that they are well suited for their intended task. The key differences will typically be the differences in input/output data formats and any task specific features/configuration options. These can all be found in the documentation section for each task.

The currently implemented task-specific Simple Transformer models, along with their task, are given below.

Task	Model
Binary and multi-class text classification	`ClassificationModel`
Conversational AI (chatbot training)	`ConvAIModel`
Language generation	`LanguageGenerationModel`
Language model training/fine-tuning	`LanguageModelingModel`
Multi-label text classification	`MultiLabelClassificationModel`
Multi-modal classification (text and image data combined)	`MultiModalClassificationModel`
Named entity recognition	`NERModel`
Question answering	`QuestionAnsweringModel`
Regression	`ClassificationModel`
Sentence-pair classification	`ClassificationModel`
Text Representation Generation	`RepresentationModel`
Document Retrieval	`RetrievalModel`

Please refer to the relevant section in the docs for more information on how to use these models.
Example scripts can be found in the examples directory.
See the Changelog for up-to-date changes to the project.

A quick example

from simpletransformers.classification import ClassificationModel, ClassificationArgs
import pandas as pd
import logging


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Preparing train data
train_data = [
    ["Aragorn was the heir of Isildur", 1],
    ["Frodo was the heir of Isildur", 0],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["text", "labels"]

# Preparing eval data
eval_data = [
    ["Theoden was the king of Rohan", 1],
    ["Merry was the king of Rohan", 0],
]
eval_df = pd.DataFrame(eval_data)
eval_df.columns = ["text", "labels"]

# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=1)

# Create a ClassificationModel
model = ClassificationModel(
    "roberta", "roberta-base", args=model_args
)

# Train the model
model.train_model(train_df)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(eval_df)

# Make predictions with the model
predictions, raw_outputs = model.predict(["Sam was a Wizard"])

Experiment Tracking with Weights and Biases

Weights and Biases makes it incredibly easy to keep track of all your experiments. Check it out on Colab here:

Current Pretrained Models

For a list of pretrained models, see Hugging Face docs.

The model_types available for each task can be found under their respective section. Any pretrained model of that type found in the Hugging Face docs should work. To use any of them set the correct model_type and model_name in the args dictionary.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_hawktang 💻	_{Mabu Manaileng} 💻	_{Ali Hamdi Ali Fadel} 💻	_{Tovly Deutsch} 💻	_hlo-world 💻	_huntertl 💻	_{Yann Defretin} 💻 📖 💬 🤔
_Manuel 📖 💻	_{Gilles Jacobs} 📖	_shasha79 💻	_{Mercedes Garcia} 💻	_{Hammad Hassan Tarar} 💻 📖	_{Todd Cook} 💻	_{Knut O. Hellan} 💻 📖
_nagenshukla 💻	_flaviussn 💻 📖	_{Marc Torrellas} 🚧	_{Adrien Renaud} 💻	_jacky18008 💻	_{Matteo Senese} 💻	_sarthakTUM 📖 💻
_djstrong 💻	_{Hyeongchan Kim} 📖	_Pradhy729 💻 🚧	_{Iknoor Singh} 📖	_{Gabriel Altay} 💻	_flozi00 📖 💻 🚧	_{alexysdussier} 💻
_{Jean-Louis Queguiner} 📖	_aced125 💻	_Laksh1997 💻	_{Changlin_NLP} 💻	_jpotoniec 💻	_fcggamou 💻 📖	_guy-mor 🐛 💻
_{Cahya Wirawan} 💻	_{BjarkePedersen} 💻	_tekkkon 💻	_{Amit Garg} 💻	_caprone 🐛	_{Ather Fawaz} 💻	_{Santiago Castro} 📖
_taranais 💻	_{Pablo N. Marino} 💻 📖	_{Anton Kiselev} 💻 📖	_Alex 💻	_{Karthik Ganesan} 💻	_{Zhylko Dima} 💻	_{Jonatan Kłosko} 💻
_sarapapi 💻 💬	_Abdul 💻	_{James Milliman} 📖	_{Suraj Parmar} 📖	_{KwanHong Lee} 💬	_{Erik Fäßler} 💻	_{Thomas Søvik} 💬
_{Gagandeep Singh} 💻 📖	_{Andrea Esuli} 💻	_DM2493 💻	_{Nick Doiron} 💻	_{Abhinav Gupta} 💻	_{Martin H. Normark} 📖	_{Mossad Helali} 💻
_calebchiam 💻	_{Daniele Sartiano} 💻	_tuner007 📖	_{xia jiang} 💻	_{Hendrik Buschmeier} 📖	_{Mana Borwornpadungkitti} 📖	_rayline 💻
_{Mehdi Heidari} 💻	_{William Roe} 💻	_{Álvaro Abella Bascarán} 💻	_{Brett Fazio} 📖	_Viet-Tien 💻	_{Bisola Olasehinde} 💻 📖	_{William Chen} 📖
_{Reza Ebrahimi} 📖	_gabriben 📖	_{Prashanth Kurella} 💻	_dopc 💻	_{Tanish Tyagi} 📖 💻	_kongyurui 💻	_{Andrew Lensen} 💻
_jinschoi 💻	_{Le Nguyen Khang} 💻	_{Jordi Mas} 📖	_mxa 💻	_{MichelBartels} 💻	_{Luke Tudge} 📖	_Saint 💻
_deltaxrg 💻 📖	_{Fortune Adekogbe} 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

If you should be on this list but you aren't, or you are on the list but don't want to be, please don't hesitate to contact me!

How to Contribute

How to Update Docs

The latest version of the docs is hosted on Github Pages, if you want to help document Simple Transformers below are the steps to edit the docs. Docs are built using Jekyll library, refer to their webpage for a detailed explanation of how it works.

Install Jekyll: Run the command gem install bundler jekyll
Visualizing the docs on your local computer: In your terminal cd into the docs directory of this repo, eg: cd simpletransformers/docs From the docs directory run this command to serve the Jekyll docs locally: bundle exec jekyll serve Browse to http://localhost:4000 or whatever url you see in the console to visualize the docs.
Edit and visualize changes: All the section pages of our docs can be found under docs/_docs directory, you can edit any file you want by following the markdown format and visualize the changes after refreshing the browser tab.

Acknowledgements

None of this would have been possible without the hard work by the HuggingFace team in developing the Transformers library.

<div>Icon for the Social Media Preview made by <a href="https://www.flaticon.com/authors/freepik" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a></div>

simpletransformers's People

Contributors

Stargazers

Watchers

Forkers

srirampingali wannaphong hawktang mabu-dev cdeepakroy shaohongbai mejihero biranchi2018 tovlydeutsch jingmouren jny2117 anjapago seeker1943 subburajs summon-ml hlo-world wharu shihuaxing huntertl kinoute gillesj antoniopaisfernandes guolong-zhang xuzhou911 radcheb ethanlovequeen yueyedeai nick1889 sunyancn jsaenzbimcv arun1090 can-keklik rosssong arita37 jeromebau w95 hungita retrieva flobarrios garcer3 zjms dragomirradev ravi-0809 antoine-collet nadre aucan shasha79 mayurmorin 53x gazzola bharatr21 nirvanesque tschunknail blackandrose ziligy petartodorov rizwan34 hammad26 nagenshukla duml vagi cclauss larsoncs todd-cook databill86 khellan alexxxtyurin razakhan2 andrewbrooks-o mingewang getoutreach shashankwer quinceyyy mritu301 totalgood haritzpuerto nsu1210 flaviussn scottishfold007 brunneis ngo010 sugendran yx0119 viktoryia-davydovich13 acadtags gkaramanolakis stevanmatovic aaj-fullfact preesee chaitaliprabhu sunnyly2016 suitup snaildm manikant92 jacky18008 souravroy1989 dadelani rogervaas vdt neozoik

simpletransformers's Issues

How to transform raw model outputs into probabilities, or 0/1 ?

Hi, I followed your tutorial and went for a multi-label classification task.

This task has 4 labels and the submission format can be like [1, 0, 1, 1] or [1, 1, 0, 0]for each text.

First, test data is a list of string, and its len is 20000, so when executing

predictions, raw_outputs = model.predict([test])

predicitons is a numpy array and its len is 20000, but the format should be like 20000 * 4 = 80000
because each text should have 4 values, but it has only 20000. If I want to convert it to submission format, how can I do?

Secondly, I do not understand why "raw model outputs" can be [ 4.078125 , 3.2167969, -2.8378906, -2.8203125], I expected them to be lists of probabilities.

So which I should choose to convert to submission format, predictions or raw_outputs?

array([[ 4.078125 , 3.2167969, -2.8378906, -2.8203125],
[ 4.4921875, 1.8496094, -2.3378906, -2.4355469],
[ 1.796875 , 4.3945312, -2.7089844, -2.4433594],
...,
[ 1.5478516, 4.5976562, -2.5039062, -2.515625 ],
[ 3.4179688, 3.4179688, -2.7265625, -2.7734375],
[ 2.9882812, 3.9726562, -2.7285156, -2.8398438]], dtype=float32)

ValueError: too many dimensions 'str' for Multilabel-classification

Hi @ThilinaRajapakse

I run through your tutorial on Medium and found this bug. My dataFrame structure is like you described but has the number of labels 11 instead of 6. I also created two additional column called "labels" and "text" for the train_data DataFrame.

/work/vnhh/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
I1118 08:14:10.053511 140433564665600 file_utils.py:39] PyTorch version 1.3.0 available.
I1118 08:14:10.217245 140433564665600 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
I1118 08:14:10.847334 140433564665600 tokenization_utils.py:374] loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /work/vnhh/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
I1118 08:14:10.847580 140433564665600 tokenization_utils.py:374] loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at /work/vnhh/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
I1118 08:14:11.280796 140433564665600 configuration_utils.py:151] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at /work/vnhh/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.9dad9043216064080cf9dd3711c53c0f11fe2b09313eaa66931057b4bdcaf068
I1118 08:14:11.282683 140433564665600 configuration_utils.py:168] Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 11,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 1,
  "use_bfloat16": false,
  "vocab_size": 50265
}

I1118 08:14:11.620726 140433564665600 modeling_utils.py:337] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.bin from cache at /work/vnhh/.cache/torch/transformers/228756ed15b6d200d7cb45aaef08c087e2706f54cb912863d2efe07c89584eb7.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e
I1118 08:14:15.541236 140433564665600 modeling_utils.py:405] Weights of RobertaForMultiLabelSequenceClassification not initialized from pretrained model: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
I1118 08:14:15.541777 140433564665600 modeling_utils.py:408] Weights from pretrained model not used in RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
Features loaded from cache at cache_dir/cached_train_roberta_512_binary
Traceback (most recent call last):
  File "nn.py", line 54, in <module>
    test_predictions = train_and_predict(train_data, test_data)
  File "nn.py", line 20, in train_and_predict
    model.train_model(train_data)
  File "/work/vnhh/anaconda3/lib/python3.6/site-packages/simpletransformers/classification/multi_label_classification_model.py", line 106, in train_model
    return super().train_model(train_df, multi_label=multi_label, output_dir=output_dir, show_running_loss=show_running_loss, args=args)
  File "/work/vnhh/anaconda3/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 173, in train_model
    train_dataset = self.load_and_cache_examples(train_examples)
  File "/work/vnhh/anaconda3/lib/python3.6/site-packages/simpletransformers/classification/multi_label_classification_model.py", line 115, in load_and_cache_examples
    return super().load_and_cache_examples(examples, evaluate=evaluate, no_cache=no_cache, multi_label=multi_label)
  File "/work/vnhh/anaconda3/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 458, in load_and_cache_examples
    all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
ValueError: too many dimensions 'str'

Calling succesive eval_model returns the same value

Hi, I am trying todo a binary classification.
The model seems to train well , but when calling an eval model on one dataframe and then on other keeps returning the same values of the first eval.
For example

,output,=model.eval_model(df_train)
,output2,=model.eval_model(df_train.head())

returns equal outputs both in value and llength

How to do validation after each training epoch?

I would like to select a validation dataset and try to valid on it after each training epoch. Any possibility to support this feature? Thanks!

"TypeError: not a sequence" when running "Minimal Start for Multilabel Classification"

Describe the bug
I wanted to test this package for multilabel so I tried the example code for "Minimal Start for Multilabel Classification".

To Reproduce

Copy example code from README.md to clipboard (reproduced here) and paste into file multilabelmve.py:

from simpletransformers.classification import MultiLabelClassificationModel
import pandas as pd


# Train and Evaluation data needs to be in a Pandas Dataframe containing at least two columns, a 'text' and a 'labels' column. The `labels` column should contain multi-hot encoded lists.
train_data = [['Example sentence 1 for multilabel classification.', [1, 1, 1, 1, 0, 1]]] + [['This is another example sentence. ', [0, 1, 1, 0, 0, 0]]]
train_df = pd.DataFrame(train_data, columns=['text', 'labels'])
train_df = pd.DataFrame(train_data)

eval_data = [['Example eval sentence for multilabel classification.', [1, 1, 1, 1, 0, 1]], ['Another example eval sentence.', **0**], ['Example eval senntence belonging to class 2', [0, 1, 1, 0, 0, 0]]]
eval_df = pd.DataFrame(eval_data)

# Create a MultiLabelClassificationModel
model = MultiLabelClassificationModel('roberta', 'roberta-base', num_labels=6, args={'reprocess_input_data': True, 'overwrite_output_dir': True, 'num_train_epochs': 5})
print(train_df.head())

# Train the model
model.train_model(train_df)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(eval_df)
print(result)
print(model_outputs)

predictions, raw_outputs = model.predict(['This thing is entirely different from the other thing. '])
print(predictions)
print(raw_outputs)

From active simpletransformers conda environment run: python multilabelmve.py
The model trains fine but fails on evaluation @ line 21 model.eval_model(eval_df)
Error trace:

Traceback (most recent call last):
  File "multilabelmve.py", line 21, in <module>
    result, model_outputs, wrong_predictions = model.eval_model(eval_df)
  File "/home/gilles/repos/simpletransformers/simpletransformers/classification/multi_label_classification_model.py", line 103, in eval_model
    return super().eval_model(eval_df, output_dir=output_dir, multi_label=multi_label, verbose=verbose, **kwargs)
  File "/home/gilles/repos/simpletransformers/simpletransformers/classification/classification_model.py", line 307, in eval_model
    result, model_outputs, wrong_preds = self.evaluate(eval_df, output_dir, multi_label=multi_label, **kwargs)
  File "/home/gilles/repos/simpletransformers/simpletransformers/classification/multi_label_classification_model.py", line 106, in evaluate
    return super().evaluate(eval_df, output_dir, multi_label=multi_label, prefix=prefix, **kwargs)
  File "/home/gilles/repos/simpletransformers/simpletransformers/classification/classification_model.py", line 337, in evaluate
    eval_dataset = self.load_and_cache_examples(eval_examples, evaluate=True)
  File "/home/gilles/repos/simpletransformers/simpletransformers/classification/multi_label_classification_model.py", line 109, in load_and_cache_examples
    return super().load_and_cache_examples(examples, evaluate=evaluate, no_cache=no_cache, multi_label=multi_label)
  File "/home/gilles/repos/simpletransformers/simpletransformers/classification/classification_model.py", line 446, in load_and_cache_examples
    all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
TypeError: not a sequence

Expected behavior
Evaluation in minimal example for multilabel classification works.
I figured out that the value for
[f.label_id for f in features] is [[1, 1, 1, 1, 0, 1], 0, [0, 1, 1, 0, 0, 0]] which is probably not correct, because to input is not a one-hot encoding list but simple int 0.

Desktop (please complete the following information):

Ubuntu 18.04
All requirements except Apex installed following README.md

RuntimeError: CUDA error: device-side assert triggered

My code:

from simpletransformers.classification import ClassificationModel
import pandas as pd
train_df = pd.read_csv('data/train.csv', header=None)
eval_df = pd.read_csv('data/test.csv', header=None)
train_df[0] = (train_df[0] == 2).astype(int)
eval_df[0] = (eval_df[0] == 2).astype(int)
train_df = pd.DataFrame({
'text': train_df[1].replace(r'\n', ' ', regex=True),
'label':train_df[0]
})
eval_df = pd.DataFrame({
'text': eval_df[1].replace(r'\n', ' ', regex=True),
'label':eval_df[0]
})
model = ClassificationModel('xlm', 'model/', args=({'fp16': False}))
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(eval_df)

Error:

Features loaded from cache at cache_dir/cached_train_xlm_128_binary
Epoch: 0%| | 0/1 [00:00<?, ?it/s/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "run1.py", line 24, in
model.train_model(train_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 162, in train_model
global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss, eval_df=eval_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 235, in train
print("\rRunning loss: %f" % loss, end="")
RuntimeError: CUDA error: device-side assert triggered

Could you please help me figure it out how to fix that?Thank you!

Asking for multi-label classification function

Great job! Thanks a lot for making it so easy! When will you support multi-label classification?

model_output in eval_model are not probabilities

from simpletransformers.classification import ClassificationModel
import pandas as pd

train_data = [['Example sentence belonging to class 1', 1], ['Example sentence belonging to class 0', 0]]
train_df = pd.DataFrame(train_data)

eval_data = [['Example eval sentence belonging to class 1', 1], ['Example eval sentence belonging to class 0', 0]]
eval_df = pd.DataFrame(eval_data)

model = ClassificationModel('roberta', 'roberta-base', use_cuda=False, args={'reprocess_input_data': True, 'overwrite_output_dir': True})

model.train_model(train_df)

result, model_outputs, wrong_predictions = model.eval_model(eval_df)

print(result)

print(model_outputs)

{'mcc': 0.0, 'tp': 1, 'tn': 0, 'fp': 1, 'fn': 0, 'eval_loss': 0.697429895401001}

[[-0.2112433 -0.01764951]
[-0.2094903 -0.01733721]]

Error when trying to finetune roberta-large-mnli model

The following error is appears when trying to finetune roberta-large-mnli:

File "/usr/local/lib/python3.6/dist-packages/simpletransformers/model.py", line 66, in __init__
    self.model = model_class.from_pretrained(model_name, num_labels=num_labels)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py", line 411, in from_pretrained
    model.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for RobertaForSequenceClassification:
	size mismatch for classifier.out_proj.weight: copying a param with shape torch.Size([3, 1024]) from checkpoint, the shape in current model is torch.Size([2, 1024]).
	size mismatch for classifier.out_proj.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).

Taking long time for mininal example on GPU

The bug
Hello, really nice work!! I do have one question though. Is there some reason why
model.train_model(train_df) gets stuck for a long time for the minimal example you have provided in the README

To Reproduce

from simpletransformers.classification import ClassificationModel
import pandas as pd


# Train and Evaluation data needs to be in a Pandas Dataframe containing at least two columns. If the Dataframe has a header, it should contain a 'text' and a 'labels' column. If no header is present, the Dataframe should contain at least two columns, with the first column is the text with type str, and the second column in the label with type int.
train_data = [['Example sentence belonging to class 1', 1], ['Example sentence belonging to class 0', 0], ['Example eval senntence belonging to class 2', 2]]
train_df = pd.DataFrame(train_data)

eval_data = [['Example eval sentence belonging to class 1', 1], ['Example eval sentence belonging to class 0', 0], ['Example eval senntence belonging to class 2', 2]]
eval_df = pd.DataFrame(eval_data)

# Create a ClassificationModel
model = ClassificationModel('bert', 'bert-base-cased', num_labels=3, args={'reprocess_input_data': True, 'overwrite_output_dir': True}) 
# You can set class weights by using the optional weight argument

# Train the model
model.train_model(train_df)

## GETS STUCK HERE

What's the average time it takes on a GPU? Could it be because of Apex not being installed properly?

because I get these warnings

I1125 10:26:54.747091 140576747587328 file_utils.py:39] PyTorch version 1.0.0 available.
I1125 10:26:54.805834 140576747587328 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .

Potential conflict with sklearn metrics?

Any updated, it seems that evaluation has some confilt with sklearn metrics.

On Mon, 7 Oct 2019, 10:02 pm Thilina Rajapakse, [email protected]
wrote:

@ThilinaRajapakse commented on this pull request.

I just noticed this and fixed it before I saw your PR. Thanks!

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/ThilinaRajapakse/simpletransformers/pull/3?email_source=notifications&email_token=AAPJIZ7CL6DYBE4PC6YY65TQNM6QVA5CNFSM4I6EUXD2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHCWV7I#pullrequestreview-298150653,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPJIZ5DGVH3O5IFCFYWWRTQNM6QVANCNFSM4I6EUXDQ
.

Originally posted by @hawktang in #3 (comment)

Prediction always returns 0

First of all, thank you very much for creating this cool library!

I tried the Yelp example and it works fine. However, when I replace the Yelp data with my own, it evaluates to only TN and FN (in other words, only the class 0 is predicted). In my dataset, the label 0 occurs approximately 80% of the time and the label 1 about 20%.

I removed the output and cache folder every time and I also tried to vary some parameters, like batch size.

I am using the most recent simpletransformers version from github.

I can do classification without any problems with the same dataset using custom NNs or classifiers from sklearn.

Am I missing something? I appreciate any pointers that you can give.

Explanation for prediction

I am using this library for sentiment analysis. I have trained the model and now I want to predict the testing dataset. When I am using model.predict(['sentence comes here']) of the prediction of an unknown new sentence I am getting an output as follows:

(array([0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,
1, 0, 0, 1, 0]), array([[ 0.05918646, -0.2312517 ],
[-0.15851352, -0.0500554 ],
[ 0.13287637, -0.25292102],
[ 0.13287637, -0.25292102],
[-0.11805943, -0.09430733],
[-0.19722672, 0.13679701],
[-0.27268714, 0.13469158],
[ 0.71184117, -0.83837014],
[ 0.04452557, -0.21931541],
[ 0.1598964 , -0.33018294],
[-0.21872135, 0.11754164],
[ 0.46354538, -0.52351034],
[ 0.00256112, -0.19036555],
[-0.27268714, 0.13469158],
[-0.27268714, 0.13469158],
[ 0.00256112, -0.19036555],
[-0.19722672, 0.13679701],
[-0.17320369, 0.00332226],
[ 0.1598964 , -0.33018294],
[-0.10593811, -0.06245855],
[ 0.04452557, -0.21931541],
[ 0.1598964 , -0.33018294],
[ 0.31071872, -0.46716788],
[-0.15851352, -0.0500554 ],
[ 0.05918646, -0.2312517 ],
[-0.42780194, 0.4040406 ],
[ 0.00256112, -0.19036555],
[-0.19722672, 0.13679701],
[-0.17320369, 0.00332226],
[ 0.1598964 , -0.33018294],
[ 0.74134415, -0.89872265],
[-0.37553215, 0.2911712 ],
[-0.19722672, 0.13679701],
[-0.27268714, 0.13469158],
[-0.21872135, 0.11754164],
[-0.62663805, 0.67718726],
[ 0.1598964 , -0.33018294],
[ 1.263364 , -1.7161529 ],
[-0.37553215, 0.2911712 ],
[ 0.74134415, -0.89872265],
[ 0.74134415, -0.89872265],
[ 0.71184117, -0.83837014],
[-0.11805943, -0.09430733],
[ 0.1598964 , -0.33018294],
[ 1.263364 , -1.7161529 ],
[ 0.00256112, -0.19036555],
[-0.04172399, -0.13767837],
[-0.37553215, 0.2911712 ],
[ 0.05918646, -0.2312517 ],
[ 0.1598964 , -0.33018294],
[ 1.263364 , -1.7161529 ],
[ 0.05918646, -0.2312517 ],
[-0.37553215, 0.2911712 ],
[-0.21872135, 0.11754164],
[ 0.46354538, -0.52351034],
[-0.10593811, -0.06245855],
[-0.24102493, 0.12610589],
[-0.19722672, 0.13679701],
[-0.11805943, -0.09430733],
[ 0.04452557, -0.21931541],
[ 0.1598964 , -0.33018294],
[ 1.263364 , -1.7161529 ],
[ 0.13287637, -0.25292102],
[-0.37553215, 0.2911712 ],
[ 0.49057966, -0.6595632 ],
[ 0.49057966, -0.6595632 ],
[-0.11805943, -0.09430733],
[-0.21872135, 0.11754164],
[-0.27268714, 0.13469158],
[ 0.1598964 , -0.33018294],
[ 1.263364 , -1.7161529 ],
[-0.21872135, 0.11754164],
[ 0.25952443, -0.29697695],
[-0.37553215, 0.2911712 ],
[-0.17320369, 0.00332226],
[-0.21872135, 0.11754164],
[ 0.25952443, -0.29697695],
[-0.37553215, 0.2911712 ],
[-0.17320369, 0.00332226],
[-0.21872135, 0.11754164],
[ 0.25952443, -0.29697695],
[-0.37553215, 0.2911712 ],
[-0.17320369, 0.00332226],
[ 0.1598964 , -0.33018294],
[ 0.46354538, -0.52351034],
[-0.27268714, 0.13469158],
[-0.27268714, 0.13469158],
[ 0.74134415, -0.89872265],
[ 0.49325183, -0.62816525],
[ 1.1978749 , -1.6458939 ],
[ 1.1978749 , -1.6458939 ],
[ 0.00256112, -0.19036555],
[-0.19722672, 0.13679701],
[-0.21872135, 0.11754164],
[-0.27268714, 0.13469158],
[-0.37553215, 0.2911712 ],
[-0.17320369, 0.00332226],
[ 0.13287637, -0.25292102],
[-0.62663805, 0.67718726],
[-0.37553215, 0.2911712 ],
[-0.10593811, -0.06245855],
[ 1.1978749 , -1.6458939 ],
[ 0.74134415, -0.89872265],
[ 1.1978749 , -1.6458939 ],
[-0.5022752 , 0.5205793 ],
[-0.5022752 , 0.5205793 ],
[-0.5210591 , 0.31223089],
[-0.25475317, 0.12043235],
[-0.38624087, 0.32818356],
[ 0.23968135, -0.35400856],
[-0.5210591 , 0.31223089],
[ 0.19725166, -0.36717686],
[ 0.16669363, -0.35203722],
[-0.24102493, 0.12610589],
[ 1.1978749 , -1.6458939 ]], dtype=float32))

Can you please help me understand this output as I have trained on 0 or 1 sentiment. I was expecting some integer as an output. Please can you help me understand this output?

Thank you

Load a checkpoint after training

Is there a way to load the checkpoint after training the model? I looked into the code and I didn't find anything.

Error while loading model

Hello, I trained a ClassificationModel model with 125 labels. When I try to reload this model from the output folder with:

model = ClassificationModel('xlm', 'outputs/')

I got this Error message:

RuntimeError: Error(s) in loading state_dict for XLMForSequenceClassification:
size mismatch for sequence_summary.summary.weight: copying a param with shape torch.Size([125, 1024]) from checkpoint, the shape in current model is torch.Size([2, 1024]).
size mismatch for sequence_summary.summary.bias: copying a param with shape torch.Size([125]) from checkpoint, the shape in current model is torch.Size([2]).

Is there multiclass compatibility?

I am trying to do a multiclass classification, but I am obtaining this message:

RuntimeError: CUDA error: device-side assert triggered

I can't think of what else I am doing different aside the fact that my problem involves multiple classes.

expected torch.cuda.FloatTensor

Here is the error when I try to run the basic example

Found param roberta.embeddings.word_embeddings.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.

multi class is not working

Hi,

I am getting the following error while trying to train a multiclass. Any help is much appreciated!

my df_train looks like the following

text id label alpha
0 text1 0 2 a
1 text2 1 2 a
2 text3 2 3 a
3 text4 3 2 a
4 text5 4 2 a

df_train.label.value_counts()
3 212925
2 71273
0 9883
1 5920
Name: label, dtype: int64

model = TransformerModel('bert', 'bert-base-cased', num_labels=4, args={'reprocess_input_data': True, 'overwrite_output_dir': True})

model.train_model(df_train)
Converting to features started.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300001/300001 [02:41<00:00, 1855.38it/s]
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Epoch: 0%| | 0/1 [00:00<?, ?it/s/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=110 error=710 : device-side assert triggered
Current iteration: 0%| | 0/37501 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 1, in
File "/home/jbabu/.local/lib/python3.7/site-packages/simpletransformers/model.py", line 142, in train_model
global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss)
File "/home/jbabu/.local/lib/python3.7/site-packages/simpletransformers/model.py", line 367, in train
outputs = model(**inputs)
File "/home/jbabu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/jbabu/.local/lib/python3.7/site-packages/transformers/modeling_bert.py", line 913, in forward
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
File "/home/jbabu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/jbabu/.local/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 916, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/usr/local/lib/python3.7/dist-packages/apex/amp/wrap.py", line 28, in wrapper
return orig_fn(*new_args, **kwargs)
File "/home/jbabu/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 2009, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/usr/local/lib/python3.7/dist-packages/apex/amp/wrap.py", line 28, in wrapper
return orig_fn(*new_args, **kwargs)
File "/home/jbabu/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1838, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:110

Multiclassification task: process_count error

I have set process count to 2 in the code after reading through a closed issue here. But I am still facing this error, with every model.

AttributeError                            Traceback (most recent call last)
<ipython-input-32-19c200c6bb33> in <module>()
      1 t = TransformerModel('bert', 'bert-base-uncased')
----> 2 t.train_model(train_df)
      3 result, model_outputs, wrong_predictions = t.eval_model(dev_df)

5 frames
<ipython-input-29-4f5794731e33> in train_model(self, train_df, output_dir, show_running_loss, args)
    121         train_examples = [InputExample(i, text, None, label) for i, (text, label) in enumerate(zip(train_df.iloc[:, 0], train_df.iloc[:, 1]))]
    122 
--> 123         train_dataset = self.load_and_cache_examples(train_examples)
    124         global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss)
    125 

<ipython-input-29-4f5794731e33> in load_and_cache_examples(self, examples, evaluate, no_cache)
    260                                                     pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
    261                                                     pad_token_segment_id=4 if args['model_type'] in ['xlnet'] else 0,
--> 262                                                     process_count=process_count)
    263 
    264             if not no_cache:

<ipython-input-28-1f50b5aee09a> in convert_examples_to_features(examples, max_seq_length, tokenizer, output_mode, cls_token_at_end, sep_token_extra, pad_on_left, cls_token, sep_token, pad_token, sequence_a_segment_id, sequence_b_segment_id, cls_token_segment_id, pad_token_segment_id, mask_padding_with_zero, process_count)
    152 
    153     with Pool(process_count) as p:
--> 154         features = list(tqdm(p.imap(convert_example_to_feature, examples, chunksize=500), total=len(examples)))
    155 
    156     return features

/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py in __iter__(self)
    977 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
    978 
--> 979             for obj in iterable:
    980                 yield obj
    981                 # Update and possibly print the progressbar.

/usr/lib/python3.6/multiprocessing/pool.py in <genexpr>(.0)
    318                     result._set_length
    319                 ))
--> 320             return (item for chunk in result for item in chunk)
    321 
    322     def imap_unordered(self, func, iterable, chunksize=1):

/usr/lib/python3.6/multiprocessing/pool.py in next(self, timeout)
    733         if success:
    734             return value
--> 735         raise value
    736 
    737     __next__ = next                    # XXX

AttributeError: 'int' object has no attribute 'split'

Please provide a solution to this.

Obtaining Embeddings

Is there a way I can obtain the embeddings for the text when I use any of the pre-trained models.

error using evaluate_during_training

I am currently using version 0.7.9. After setting evaluate_during_training to True, I tried to run model.train_model(train, eval_df=valid), yet got an error message saying there is no eval_df option in train_model function. Any idea? Thanks!

NER >> 'InputFeatures' object has no attribute 'label_ids'

I created my custom label list and stated in the NerModel.. but I got the following error

/anaconda3/lib/python3.6/site-packages/simpletransformers/ner/ner_model.py in train_model(self, train_data, output_dir, show_running_loss, args)
144 self._move_model_to_device()
145
--> 146 train_dataset = self.load_and_cache_examples(train_data)
147
148 global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss)

/anaconda3/lib/python3.6/site-packages/simpletransformers/ner/ner_model.py in load_and_cache_examples(self, data, evaluate, no_cache, to_predict)
524 all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
525 all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
--> 526 all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)
527
528 dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)

/anaconda3/lib/python3.6/site-packages/simpletransformers/ner/ner_model.py in (.0)
524 all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
525 all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
--> 526 all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)
527
528 dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)

AttributeError: 'InputFeatures' object has no attribute 'label_ids'

Loss calculation fluctuation

Is there any specific reason for too much fluctuation in loss calculation? Please find the attached calculation results.

Current iteration: 70%|███████ | 1321/1875 [09:50<04:06, 2.25it/s]Running loss: 0.496334
Current iteration: 71%|███████ | 1322/1875 [09:51<04:06, 2.24it/s]Running loss: 0.607332
Current iteration: 71%|███████ | 1323/1875 [09:51<04:06, 2.24it/s]Running loss: 0.981335
Current iteration: 71%|███████ | 1324/1875 [09:52<04:05, 2.24it/s]Running loss: 0.564045
Current iteration: 71%|███████ | 1325/1875 [09:52<04:06, 2.23it/s]Running loss: 0.813926
Current iteration: 71%|███████ | 1326/1875 [09:53<04:05, 2.24it/s]Running loss: 0.495253
Current iteration: 71%|███████ | 1327/1875 [09:53<04:04, 2.24it/s]Running loss: 0.877046
Current iteration: 71%|███████ | 1328/1875 [09:54<04:04, 2.23it/s]Running loss: 0.935980
Current iteration: 71%|███████ | 1329/1875 [09:54<04:03, 2.24it/s]Running loss: 1.029763
Current iteration: 71%|███████ | 1330/1875 [09:54<04:03, 2.24it/s]Running loss: 0.401378
Current iteration: 71%|███████ | 1331/1875 [09:55<04:02, 2.25it/s]Running loss: 0.995074
Current iteration: 71%|███████ | 1332/1875 [09:55<04:01, 2.25it/s]Running loss: 1.740216
Current iteration: 71%|███████ | 1333/1875 [09:56<04:01, 2.24it/s]Running loss: 0.489634
Current iteration: 71%|███████ | 1334/1875 [09:56<04:02, 2.23it/s]Running loss: 0.216885
Current iteration: 71%|███████ | 1335/1875 [09:57<04:00, 2.24it/s]Running loss: 0.847103
Current iteration: 71%|███████▏ | 1336/1875 [09:57<03:59, 2.25it/s]Running loss: 0.480067

Feature request: list predictions for NER

Hey there.
I've started using this package recently, and I think it is awesome!

However I've noticed one possible issue with NER predictions format. It uses dict where keys are words itself. But same words (or tokens) within one text may have different labels depending on the context.

Consider the following example:

Bobby found 5 apples by 5 PM

Here first 5 token may have N_FRUITS label, second one may have TIME_HOUR tag. But predictions will have only one dict entry with key 5.

I would propose output format that can be described as List[Tuple[str, str]]:

predictions = [
    ('Bobby', 'O'),
    ('found', 'O'),
    ('5', 'N_FRUITS'),
    ('apples', 'O'),
    ('by', 'O'),
    ('5', 'TIME_HOUR'),
    ('PM', ''),
]

It would be trivial to change current dict format into something like that, but we should take care about backward compatibility.

What do you think?

Running it on Google Colab gives the following error : ValueError: Number of processes must be at least 1

Thanks for this library. I'm trying to run it on Google Colab and I'm getting the following error:
ValueError: Number of processes must be at least 1. I think its because of the process_count argument in convert_examples_to_features.

def convert_examples_to_features(examples, max_seq_length,
                                 tokenizer, output_mode,
                                 cls_token_at_end=False, sep_token_extra=False, pad_on_left=False,
                                 cls_token='[CLS]', sep_token='[SEP]', pad_token=0,
                                 sequence_a_segment_id=0, sequence_b_segment_id=1,
                                 cls_token_segment_id=1, pad_token_segment_id=0,
                                 mask_padding_with_zero=True,
                                 process_count=cpu_count() - 2):

The entire stack trace is:

ValueError Traceback (most recent call last)
in ()
----> 1 model.train_model(train_split)

4 frames
/usr/local/lib/python3.6/dist-packages/simpletransformers/model.py in train_model(self, train_df, output_dir)
125 train_examples = [InputExample(i, text, None, label) for i, (text, label) in enumerate(zip(train_df.iloc[:, 0], train_df.iloc[:, 1]))]
126
--> 127 train_dataset = self.load_and_cache_examples(train_examples)
128 global_step, tr_loss = self.train(train_dataset, output_dir)
129

/usr/local/lib/python3.6/dist-packages/simpletransformers/model.py in load_and_cache_examples(self, examples, evaluate)
272 pad_token=tokenizer.convert_tokens_to_ids(
273 [tokenizer.pad_token])[0],
--> 274 pad_token_segment_id=4 if args['model_type'] in ['xlnet'] else 0)
275
276 torch.save(features, cached_features_file)

/usr/local/lib/python3.6/dist-packages/simpletransformers/utils.py in convert_examples_to_features(examples, max_seq_length, tokenizer, output_mode, cls_token_at_end, sep_token_extra, pad_on_left, cls_token, sep_token, pad_token, sequence_a_segment_id, sequence_b_segment_id, cls_token_segment_id, pad_token_segment_id, mask_padding_with_zero, process_count)
170 examples = [(example, max_seq_length, tokenizer, output_mode, cls_token_at_end, cls_token, sep_token, cls_token_segment_id, pad_on_left, pad_token_segment_id, sep_token_extra) for example in examples]
171
--> 172 with Pool(process_count) as p:
173 features = list(tqdm(p.imap(convert_example_to_feature, examples, chunksize=500), total=len(examples)))
174

/usr/lib/python3.6/multiprocessing/context.py in Pool(self, processes, initializer, initargs, maxtasksperchild)
117 from .pool import Pool
118 return Pool(processes, initializer, initargs, maxtasksperchild,
--> 119 context=self.get_context())
120
121 def RawValue(self, typecode_or_type, *args):

/usr/lib/python3.6/multiprocessing/pool.py in init(self, processes, initializer, initargs, maxtasksperchild, context)
165 processes = os.cpu_count() or 1
166 if processes < 1:
--> 167 raise ValueError("Number of processes must be at least 1")
168
169 if initializer is not None and not callable(initializer):

ValueError: Number of processes must be at least 1

regex not used

It looks like the regex module is not used anywhere, but it's still specified as a dependency. Is it possible to remove this from setup.py as it causes a whole build process on Linux (the author only released a Python wheel for Windows).

Question - Is there a way to change the number of iterations?

The model.train_model(train_df) line starts 70,000 iterations. I can't find a parameter to change that.

Is it at all possible?

Thanks.

EmptyHeaderError

I'm running into this exception when executing .train_model(). At first I thought the problem comes from empty text strings in my train_df, but after cleaning the error persists. Do you have any idea?
Thank you very much for your work!

EmptyHeaderError Traceback (most recent call last)
~/anaconda3/envs/transformers/lib/python3.7/tarfile.py in next(self)
2288 try:
-> 2289 tarinfo = self.tarinfo.fromtarfile(self)
2290 except EOFHeaderError as e:

~/anaconda3/envs/transformers/lib/python3.7/tarfile.py in fromtarfile(cls, tarfile)
1094 buf = tarfile.fileobj.read(BLOCKSIZE)
-> 1095 obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
1096 obj.offset = tarfile.fileobj.tell() - BLOCKSIZE

~/anaconda3/envs/transformers/lib/python3.7/tarfile.py in frombuf(cls, buf, encoding, errors)
1030 if len(buf) == 0:
-> 1031 raise EmptyHeaderError("empty header")
1032 if len(buf) != BLOCKSIZE:

EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

ReadError Traceback (most recent call last)
~/anaconda3/envs/transformers/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
594 try:
--> 595 return legacy_load(f)
596 except tarfile.TarError:

~/anaconda3/envs/transformers/lib/python3.7/site-packages/torch/serialization.py in legacy_load(f)
505
--> 506 with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar,
507 mkdtemp() as tmpdir:

~/anaconda3/envs/transformers/lib/python3.7/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
1590 raise CompressionError("unknown compression type %r" % comptype)
-> 1591 return func(name, filemode, fileobj, **kwargs)
1592

~/anaconda3/envs/transformers/lib/python3.7/tarfile.py in taropen(cls, name, mode, fileobj, **kwargs)
1620 raise ValueError("mode must be 'r', 'a', 'w' or 'x'")
-> 1621 return cls(name, mode, fileobj, **kwargs)
1622

~/anaconda3/envs/transformers/lib/python3.7/tarfile.py in init(self, name, mode, fileobj, format, tarinfo, dereference, ignore_zeros, encoding, errors, pax_headers, debug, errorlevel, copybufsize)
1483 self.firstmember = None
-> 1484 self.firstmember = self.next()
1485

~/anaconda3/envs/transformers/lib/python3.7/tarfile.py in next(self)
2303 if self.offset == 0:
-> 2304 raise ReadError("empty file")
2305 except TruncatedHeaderError as e:

ReadError: empty file

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
in
3
4 # Train the model
----> 5 model.train_model(train_df)
6
7 # # Evaluate the model

~/anaconda3/envs/transformers/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in train_model(self, train_df, multi_label, output_dir, show_running_loss, args, eval_df)
159
160
--> 161 train_dataset = self.load_and_cache_examples(train_examples)
162 global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss, eval_df=eval_df)
163

~/anaconda3/envs/transformers/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in load_and_cache_examples(self, examples, evaluate, no_cache, multi_label)
410
411 if os.path.exists(cached_features_file) and not args["reprocess_input_data"] and not no_cache:
--> 412 features = torch.load(cached_features_file)
413 print(f"Features loaded from cache at {cached_features_file}")
414 else:

~/anaconda3/envs/transformers/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
424 if sys.version_info >= (3, 0) and 'encoding' not in pickle_load_args.keys():
425 pickle_load_args['encoding'] = 'utf-8'
--> 426 return _load(f, map_location, pickle_module, **pickle_load_args)
427 finally:
428 if new_fd:

~/anaconda3/envs/transformers/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
595 return legacy_load(f)
596 except tarfile.TarError:
--> 597 if _is_zipfile(f):
598 # .zip is used for torch.jit.save and will throw an un-pickling error here
599 raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))

~/anaconda3/envs/transformers/lib/python3.7/site-packages/torch/serialization.py in _is_zipfile(f)
73 match = True
74 for magic_byte, read_byte in zip(magic_number, read_bytes):
---> 75 if ord(magic_byte) != ord(read_byte):
76 match = False
77 break

TypeError: ord() expected a character, but string of length 0 found

Couldn't import ClassificationModel

Couldn't import ClassificationModel due to latest update

<ipython-input-1-1fa568d2355c> in <module>()
      2 get_ipython().system('pip install simpletransformers')
      3 
----> 4 from simpletransformers.classification import ClassificationModel

1 frames
/usr/local/lib/python3.6/dist-packages/simpletransformers/classification/classification_model.py in <module>()
     43 )
     44 
---> 45 from simpletransformers.classification.transformer_models.bert_model import BertForSequenceClassification
     46 from simpletransformers.classification.transformer_models.roberta_model import RobertaForSequenceClassification
     47 from simpletransformers.classification.transformer_models.xlm_model import XLMForSequenceClassification

ModuleNotFoundError: No module named 'simpletransformers.classification.transformer_models'```

Multi class classification error

When I am trying to run the sentiment analysis on multi class classification problem I am getting the following error.

RuntimeError Traceback (most recent call last)
in ()
1 model = TransformerModel('bert', 'bert-base-cased',args={'fp16':False})
----> 2 model.train_model(train_df)

6 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in convert(t)
428
429 def convert(t):
--> 430 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
431
432 return self._apply(convert)

RuntimeError: CUDA error: device-side assert triggered

Instructions to use pre-trained models

Given the current documentation, it is not clear to me how to use a pre-trained model.
My understanding is that using the following will not download the pre-trained model:

model = ClassificationModel('bert', 'bert-base-multilingual-cased')

If this understanding is correct, how can I download a pre-trained model and use it in simpletransformers. Clear instructions on how to do that would be most welcome.

Error while installing simpletransformers

i get an error while installing simpletransformer. I command i have ran and the error is given below. Could you suggest me a solution. I am also getting error while installing apex.

pip install simpletransformers
ERROR: Could not find a version that satisfies the requirement simpletransformers (from versions: none)
ERROR: No matching distribution found for simpletransformers

Early stopping

Hi,

Thanks for this great library. It works nicely. Is there any chance you could add a "early stopping" feature? At the moment I struggle to monitor my training to avoid overfitting with only the running loss being displayed.

Thanks!

can you support albert_zh?

For many students,who haven't large GPU,so they will use small model(eg albert),hence I hope you will support loading albert, Thanks so much!

File write error while training

I received this error while training:

FileNotFoundError Traceback (most recent call last)
in
6
7 # Train the model
----> 8 model.train_model(train_df)
9
10 # Evaluate the model

~\Miniconda3\envs\transformers\lib\site-packages\simpletransformers\model.py in train_model(self, train_df, output_dir, show_running_loss, args)
140
141 train_dataset = self.load_and_cache_examples(train_examples)
--> 142 global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss)
143
144 if not os.path.exists(output_dir):

~\Miniconda3\envs\transformers\lib\site-packages\simpletransformers\model.py in train(self, train_dataset, output_dir, show_running_loss)
405 # Take care of distributed/parallel training
406 model_to_save = model.module if hasattr(model, "module") else model
--> 407 model_to_save.save_pretrained(output_dir)
408
409 return global_step, tr_loss / global_step

~\Miniconda3\envs\transformers\lib\site-packages\transformers\modeling_utils.py in save_pretrained(self, save_directory)
202 # If we save using the predefined names, we can load using from_pretrained
203 output_model_file = os.path.join(save_directory, WEIGHTS_NAME)
--> 204 torch.save(model_to_save.state_dict(), output_model_file)
205 logger.info("Model weights saved in {}".format(output_model_file))
206

~\Miniconda3\envs\transformers\lib\site-packages\torch\serialization.py in save(obj, f, pickle_module, pickle_protocol)
222 >>> torch.save(x, buffer)
223 """
--> 224 return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
225
226

~\Miniconda3\envs\transformers\lib\site-packages\torch\serialization.py in _with_file_like(f, mode, body)
145 (sys.version_info[0] == 3 and isinstance(f, pathlib.Path)):
146 new_fd = True
--> 147 f = open(f, mode)
148 try:
149 return body(f)

FileNotFoundError: [Errno 2] No such file or directory: 'outputs/checkpoint-2000\checkpoint-4000\checkpoint-6000\checkpoint-8000\checkpoint-10000\checkpoint-12000\checkpoint-14000\checkpoint-16000\checkpoint-18000\checkpoint-20000\checkpoint-22000\checkpoint-24000\pytorch_model.bin'

It looks like it's creating another sub directory under each directory after each 2000 iterations. I'm assuming the path eventually becomes too long for windows and it fails. Any advice?

Problems with Apex

getting error whern training

utils line 196, in
"binary": BinaryProcessor
NameError: name 'BinaryProcessor' is not defined

MultiLabel-ClassificationModel : shape is invalid for input of size

I'm trying to repeat the example "Toxic Comments Dataset - Multilabel Classification"
And an error occurs while model.train_model(train_df_0)

Note:
model = MultiLabelClassificationModel('roberta', 'roberta-base', num_labels=6, args={'train_batch_size':2, 'gradient_accumulation_steps':16, 'learning_rate': 3e-5, 'num_train_epochs': 3, 'max_seq_length': 512, 'reprocess_input_data': True})

train_df_0.shape = (127656, 7)

Here is the message:
Converting to features started.
HBox(children=(IntProgress(value=0, max=127656), HTML(value='')))
HBox(children=(IntProgress(value=0, description='Epoch', max=3, style=ProgressStyle(description_width='initial…
HBox(children=(IntProgress(value=0, description='Current iteration', max=63828, style=ProgressStyle(descriptio…
Traceback (most recent call last):
File "C:\Users\GPU\Anaconda3\envs\tf2_simple\lib\site-packages\simpletransformers\custom_models\models.py", line 114, in forward
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1, self.num_labels))

RuntimeError: shape '[-1, 6]' is invalid for input of size 2

runtime error torch.cuda.FloatTensor Torch.FloatTensor mismatch

HI I have set apex. An I am trying to run the example code but there appears to be a conflict.

RuntimeError: Found param roberta.embeddings.word_embeddings.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.

Problem while trying to load saved models

I get this error (RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
size mismatch for classifier.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
size mismatch for classifier.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).) after trying to load the classification models through:
model = ClassificationModel('bert', 'outputs/', use_cuda=False)

I'm doing multi-class classification with BERT base uncased (3 classes).

Mismatched lists length when using predict function

I tried to use the model.predict() function. The length of the input list (of strings) is 400, yet the length of predictions is only 320. May I know what is wrong? Besides, would you please add accuracy and f1 score as default metrics too? Thanks!

Explanation of different models

Hi, after Iooking deep into huggingface's document, I noticed that there are BertModel and BertForSequenceClassification, and also, XLMModel and XLMForSequenceClassification . So what's the difference about ...Model and ...ForSequenceClassification ? If I just want to do multi label classification, which is better for me to use?

Originally posted by @QAQOAO in #30 (comment)

Moved to new issue since it's unrelated to the original thread

error when running the code with multilingual BERT

is mBERT supported?
Traceback (most recent call last):
File "run.py", line 21, in
model = ClassificationModel('bert', 'model/', args=({'fp16': False}))
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 69, in init
self.model = model_class.from_pretrained(model_name, num_labels=num_labels)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/modeling_utils.py", line 345, in from_pretrained
state_dict = torch.load(resolved_archive_file, map_location='cpu')
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 603, in _load
magic_number = pickle_module.load(f, **pickle_load_args)
MemoryError

I just downloaded mBERT model (uncased_L-12_H-768_A-12: bert_config.json; bert_model.ckpt.data-00000-of-00001; bert_model.ckpt.index; bert_model.ckpt.meta; vocab.txt) and putted them in model/ directory. Data set and other part of code are same. How to deal with that error? Thanks a lot!

getting error while making model

Hello
I am getting this error while creating model

ValueError: 'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False.

Describe the bug
A clear and concise description of what the bug is. Please specify the class causing the issue.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context
Add any other context about the problem here.

support for multitask learning

Do you support multitask learning? How could I accomplish this, using your library?

Can the user choose the columns as text or label?

By default, the framework regards the first column as text, while the second one as label. Could you please provide a parameter to select the role of different columns? Thanks a lot!

Unable to import Xlnet DistilBERT models for NER training

Throws a key error when trying to import either XLnet or DistilBert. Runs fine with any BERT (case/uncased) model

#Code
model = NERModel('xlnet', 'xlnet-large-cased',use_cuda=True)
model = NERModel('distilbert', 'distilbert-base-uncased-distilled-squad',use_cuda=True)

Ran on Google Collab

Getting error while executing code

Hi
I have executed the code and getting error:

I have attached code file which I executed

ValueError Traceback (most recent call last)
in
----> 1 model.train_model(train_df)

~\Anaconda3\envs\simpletransformers\lib\site-packages\simpletransformers\classification\multi_label_classification_model.py in train_model(self, train_df, multi_label, output_dir, show_running_loss, args)
98
99 def train_model(self, train_df, multi_label=True, output_dir=None, show_running_loss=True, args=None):
--> 100 return super().train_model(train_df, multi_label=multi_label, output_dir=output_dir, show_running_loss=show_running_loss, args=args)
101
102 def eval_model(self, eval_df, multi_label=True, output_dir=None, verbose=False, **kwargs):

~\Anaconda3\envs\simpletransformers\lib\site-packages\simpletransformers\classification\classification_model.py in train_model(self, train_df, multi_label, output_dir, show_running_loss, args, eval_df)
159
160
--> 161 train_dataset = self.load_and_cache_examples(train_examples)
162 global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss, eval_df=eval_df)
163

~\Anaconda3\envs\simpletransformers\lib\site-packages\simpletransformers\classification\multi_label_classification_model.py in load_and_cache_examples(self, examples, evaluate, no_cache, multi_label)
107
108 def load_and_cache_examples(self, examples, evaluate=False, no_cache=False, multi_label=True):
--> 109 return super().load_and_cache_examples(examples, evaluate=evaluate, no_cache=no_cache, multi_label=multi_label)
110
111 def compute_metrics(self, preds, labels, eval_examples, multi_label=True, **kwargs):

~\Anaconda3\envs\simpletransformers\lib\site-packages\simpletransformers\classification\classification_model.py in load_and_cache_examples(self, examples, evaluate, no_cache, multi_label)
444
445 if output_mode == "classification":
--> 446 all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
447 elif output_mode == "regression":
448 all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.float)

ValueError: too many dimensions 'str'

Segment mask

I can not encode segment masks for two sentences input. How to do that.

simple transformers for QnA tasks

How can i customize simpletransformers library for question answering tasks. i've dataset in squad format.. Please help

Can I use custom pre-trained models?

I want to do classification with Chinese texts. Though there is a BERT Chinese pre-trained model by default, it's quite out-dated. I would like to select different pre-trained models, like wwm BERT, XLnet and RoBERTa which trained on Chinese corpus. They have already been converted into pytorch-transformers format. Can I use them with simpletransformers? Thanks!