Code Monkey home page Code Monkey logo

thilinarajapakse / pytorch-transformers-classification Goto Github PK

View Code? Open in Web Editor NEW
300.0 11.0 98.0 186 KB

Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.

License: Apache License 2.0

Jupyter Notebook 85.27% Python 14.42% Shell 0.31%
pytorch-transformers text-classification natural-language-processing transformer-models huggingface

pytorch-transformers-classification's Introduction

This repository is now deprecated. Please use Simple Transformers instead.

Update Notice

The underlying Pytorch-Transformers library by HuggingFace has been updated substantially since this repo was created. As such, this repo might not be compatible with the current version of the Hugging Face Transformers library. This repo will not be updated further.

I recommend using Simple Transformers (based on the updated Hugging Face library) as it is regularly maintained, feature rich, as well as (much) easier to use.

Pytorch-Transformers-Classification

This repository is based on the Pytorch-Transformers library by HuggingFace. It is intended as a starting point for anyone who wishes to use Transformer models in text classification tasks.

Please refer to this Medium article for further information on how this project works.

Check out the new library simpletransformers for one line training and evaluating!

Table of contents

Setup

Simple Transformers - Ready to use library

If you want to go directly to training, evaluating, and predicting with Transformer models, take a look at the Simple Transformers library. It's the easiest way to use Transformers for text classification with only 3 lines of code required. It's based on this repo but is designed to enable the use of Transformers without having to worry about the low level details. However, ease of usage comes at the cost of less control (and visibility) over how everything works.

Quickstart using Colab

Try this Google Colab Notebook for a quick preview. You can run all cells without any modifications to see how everything works. However, due to the 12 hour time limit on Colab instances, the dataset has been undersampled from 500 000 samples to about 5000 samples. For such a tiny sample size, everything should complete in about 10 minutes.

With Conda

  1. Install Anaconda or Miniconda Package Manager from here
  2. Create a new virtual environment and install packages.
    conda create -n transformers python pandas tqdm jupyter
    conda activate transformers
    If using cuda:
    conda install pytorch cudatoolkit=10.0 -c pytorch
    else:
    conda install pytorch cpuonly -c pytorch
    conda install -c anaconda scipy
    conda install -c anaconda scikit-learn
    pip install pytorch-transformers
    pip install tensorboardX
  3. Clone repo. git clone https://github.com/ThilinaRajapakse/pytorch-transformers-classification.git

Usage

Yelp Demo

This demonstration uses the Yelp Reviews dataset.

Linux users can execute data_download.sh to download and set up the data files.

If you are doing it manually;

  1. Download Yelp Reviews Dataset.
  2. Extract train.csv and test.csv and place them in the directory data/.

Once the download is complete, you can run the data_prep.ipynb notebook to get the data ready for training.

Finally, you can run the run_model.ipynb notebook to fine-tune a Transformer model on the Yelp Dataset and evaluate the results.

Current Pretrained Models

The table below shows the currently available model types and their models. You can use any of these by setting the model_type and model_name in the args dictionary. For more information about pretrained models, see HuggingFace docs.

Architecture Model Type Model Name Details
BERT bert bert-base-uncased 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.
BERT bert bert-large-uncased 24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text.
BERT bert bert-base-cased 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased English text.
BERT bert bert-large-cased 24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text.
BERT bert bert-base-multilingual-uncased (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias
BERT bert bert-base-multilingual-cased (New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias
BERT bert bert-base-chinese 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Chinese Simplified and Traditional text.
BERT bert bert-base-german-cased 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai
BERT bert bert-large-uncased-whole-word-masking 24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text using Whole-Word-Masking
BERT bert bert-large-cased-whole-word-masking 24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text using Whole-Word-Masking
BERT bert bert-large-uncased-whole-word-masking-finetuned-squad 24-layer, 1024-hidden, 16-heads, 340M parameters.
The bert-large-uncased-whole-word-masking model fine-tuned on SQuAD
BERT bert bert-large-cased-whole-word-masking-finetuned-squad 24-layer, 1024-hidden, 16-heads, 340M parameters
The bert-large-cased-whole-word-masking model fine-tuned on SQuAD
BERT bert bert-base-cased-finetuned-mrpc 12-layer, 768-hidden, 12-heads, 110M parameters.
The bert-base-cased model fine-tuned on MRPC
XLNet xlnet xlnet-base-cased 12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model
XLNet xlnet xlnet-large-cased 24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model
XLM xlm xlm-mlm-en-2048 12-layer, 2048-hidden, 16-heads
XLM English model
XLM xlm xlm-mlm-ende-1024 6-layer, 1024-hidden, 8-heads
XLM English-German Multi-language model
XLM xlm xlm-mlm-enfr-1024 6-layer, 1024-hidden, 8-heads
XLM English-French Multi-language model
XLM xlm xlm-mlm-enro-1024 6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model
XLM xlm xlm-mlm-xnli15-1024 12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages
XLM xlm xlm-mlm-tlm-xnli15-1024 12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages
XLM xlm xlm-clm-enfr-1024 12-layer, 1024-hidden, 8-heads
XLM English model trained with CLM (Causal Language Modeling)
XLM xlm xlm-clm-ende-1024 6-layer, 1024-hidden, 8-heads
XLM English-German Multi-language model trained with CLM (Causal Language Modeling)
RoBERTa roberta roberta-base 125M parameters
RoBERTa using the BERT-base architecture
RoBERTa roberta roberta-large 24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture
RoBERTa roberta roberta-large-mnli 24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.

Custom Datasets

When working with your own datasets, you can create a script/notebook similar to data_prep.ipynb that will convert the dataset to a Pytorch-Transformer ready format.

The data needs to be in tsv format, with four columns, and no header.

This is the required structure.

  • guid: An ID for the row.
  • label: The label for the row (should be an int).
  • alpha: A column of the same letter for all rows. Not used in classification but still expected by the DataProcessor.
  • text: The sentence or sequence of text.

Evaluation Metrics

The evaluation process in the run_model.ipynb notebook outputs the confusion matrix, and the Matthews correlation coefficient. If you wish to add any more evaluation metrics, simply edit the get_eval_reports() function in the notebook. This function takes the predictions and the ground truth labels as parameters, therefore you can add any custom metrics calculations to the function as required.

Acknowledgements

None of this would have been possible without the hard work by the HuggingFace team in developing the Pytorch-Transformers library.

pytorch-transformers-classification's People

Contributors

thilinarajapakse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-transformers-classification's Issues

TypeError: convert_examples_to_features() got an unexpected keyword argument 'sep_token_extra'

I am getting this error when I run:
if args['do_train']: train_dataset = load_and_cache_examples(task, tokenizer) global_step, tr_loss = train(train_dataset, model, tokenizer) logger.info(" global_step = %s, average loss = %s", global_step, tr_loss

The error is raised in pad_token_segment_id=4 if args['model_type'] in ['xlnet'] else 0) line of load_and_cache_examples(task, tokenizer, evaluate) function. Do you have any idea what is wrong?
I am just running the cells in the run_model.ipynb.

Guidance on Model Checkpointing and saving models

I am new to pytorch and it seems to have simple model save and reload functions. On the other hand pytroch_transformer has this model.save_pretrained_model().

I am looking to ensemble a few models so I need to be able to load the last version of the model and load them. You do this in the eval function - but is there a simpler way that is more pytorch like? Also I do not need checkpoints per se (so I set the flag to false), so how does the eval function figure out which .bin to go after?

Minor error- and some questions

Hi there - thanks for putting this together. Minor issue - you have the train file set to dev.tsv in the data_prep and the colab ipynb.

Second, this might be specific to Windows folk - but using pooling will create issues if the function is called before name=='main' - you might need to add that.

Third, when and if you get a chance - could you shed some light on the number of layers being fine tuned- and if we can specify that - or does this simply add a dense layer with 2 units for binary classification?

Fourth - XLM seemed to fail out - is there a ready reference for models and vocabs- say bert and bert_base_cased? - never mind. Here is a short list https://github.com/huggingface/pytorch-transformers. Also full list https://huggingface.co/pytorch-transformers/pretrained_models.html

Thanks - you are a champ for doing this.

RuntimeError: Trying to create tensor with negative dimension -1: [-1, 768]

model = model_class.from_pretrained(args['model_name'])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-2a8eccfdb8d2> in <module>
----> 1 model = model_class.from_pretrained(args['model_name'])

C:\Python37\Lib\site-packages\pytorch_transformers\modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    534 
    535         # Instantiate model.
--> 536         model = cls(config, *model_args, **model_kwargs)
    537 
    538         if state_dict is None and not from_tf:

C:\Python37\Lib\site-packages\pytorch_transformers\modeling_xlnet.py in __init__(self, config)
   1108         self.num_labels = config.num_labels
   1109 
-> 1110         self.transformer = XLNetModel(config)
   1111         self.sequence_summary = SequenceSummary(config)
   1112         self.logits_proj = nn.Linear(config.d_model, config.num_labels)

C:\Python37\Lib\site-packages\pytorch_transformers\modeling_xlnet.py in __init__(self, config)
    729         self.n_layer = config.n_layer
    730 
--> 731         self.word_embedding = nn.Embedding(config.n_token, config.d_model)
    732         self.mask_emb = nn.Parameter(torch.FloatTensor(1, 1, config.d_model))
    733         self.layer = nn.ModuleList([XLNetLayer(config) for _ in range(config.n_layer)])

C:\Python37\Lib\site-packages\torch\nn\modules\sparse.py in __init__(self, num_embeddings, embedding_dim, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse, _weight)
     95         self.scale_grad_by_freq = scale_grad_by_freq
     96         if _weight is None:
---> 97             self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
     98             self.reset_parameters()
     99         else:

RuntimeError: Trying to create tensor with negative dimension -1: [-1, 768]

How to run utils.py?

Is there any specific parameter that we should pass to run the utils.py? Running the file itself seems does nothing. It does not create the InputFeature objects or anything. What should we expect utils.py return/create?
I have already done the data preprocessing part and generated the train.tsv and dev.tsv files. They are both on my data/ folder.

Extensions

Thanks to your help - I have added custom losses, special initialization and a bunch of other things as extensions.

I am now trying to mess with the sentence classification model itself. It is a linear layer on top of the bert model. What I would like to do is a) freeze all of bert. b) add a cnn over and above. https://github.com/Shawn1993/cnn-text-classification-pytorch/blob/master/model.py

I ant to compare results with a fozen and unfrozen bert. Any pointers would be most appreciated.

'math' is not defined

in the run_model.ipynb import math is missing in the first cell which leads to an later error in the train function.

Regarding number of train test samples

  • My train file has 8000 sentences but when i implemented this code it shows number of samples =817

INFO:main:Creating features from dataset file at data/
8000
817
100%|██████████| 817/817 [00:01<00:00, 537.09it/s]
INFO:main:Saving features into cached file data/cached_train_bert-base-multilingual-cased_128_binary
INFO:main:***** Running training *****
INFO:main: Num examples = 817
INFO:main: Num Epochs = 35
INFO:main: Total train batch size = 8
INFO:main: Gradient Accumulation steps = 1
INFO:main: Total optimization steps = 3605

  • Similarly my test file has 2000 sentences but when evaluation code was executed it showed num examples =18

INFO:main:Evaluate the following checkpoints: ['outputs/checkpoint-2000', 'outputs']
INFO:main:Creating features from dataset file at data/
2000
18
100%|██████████| 18/18 [00:00<00:00, 148.27it/s]
INFO:main:Saving features into cached file data/cached_dev_bert-base-multilingual-cased_128_binary
INFO:main:***** Running evaluation 2000 *****
INFO:main: Num examples = 18
INFO:main: Batch size = 8
Evaluating
100% 3/3 [00:00<00:00, 7.02it/s]
INFO:main:***** Eval results 2000 *****
INFO:main: fn = 4
INFO:main: fp = 3
INFO:main: mcc = 0.20385887657505022
INFO:main: tn = 7
INFO:main: tp = 4

INFO:main:Loading features from cached file data/cached_dev_bert-base-multilingual-cased_128_binary
INFO:main:***** Running evaluation outputs *****
INFO:main: Num examples = 18
INFO:main: Batch size = 8
Evaluating
100% 3/3 [00:00<00:00, 7.51it/s]

INFO:main:***** Eval results outputs *****
INFO:main: fn = 4
INFO:main: fp = 2
INFO:main: mcc = 0.31622776601683794
INFO:main: tn = 8
INFO:main: tp = 4

  • also the final output obtained is

{'fn_2000': 4,
'fn_outputs': 4,
'fp_2000': 3,
'fp_outputs': 2,
'mcc_2000': 0.20385887657505022,
'mcc_outputs': 0.31622776601683794,
'tn_2000': 7,
'tn_outputs': 8,
'tp_2000': 4,
'tp_outputs': 4}

**

  • But in total my test_df has 2000 sentences , but wen i add tp+tn+fp+fn i only get 18. Could you please explain this.

**

where is the positional embedding in the Bert model inputs

First thanks for sharing the code, it's really helpful!!

I have a question when I tried to use the pretrained Bert on my dataset for sentence classification. I realize that in Bert, the input feature should be consist of token embedding, segment embedding and position embedding. But I'm not seeing the positional embedding in your code. In run_model:

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'token_type_ids': batch[2] if args['model_type'] in ['bert', 'xlnet'] else None,  # XLM don't use segment_ids
                  'labels':         batch[3]}
        outputs = model(**inputs)

Or I might miss this detail, could you please tell me whether you implement this, and if so where exactly?

Thanks again and looking forward to your reply!

AttributeError: module 'torch.nn.functional' has no attribute 'one_hot'

Hi, I downloaded and ran your program, and got a training error as above. I have no GPU, so I changed the setup to fp16 = 'false' (xlnet left as your demo choice).

What's the problem?

DarrellWong
code:
if args['do_train']:
train_dataset = load_and_cache_examples(task, tokenizer)
global_step, tr_loss = train(train_dataset, model, tokenizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
------------------------------------------------------------ output window----
INFO:main:Creating features from dataset file at data/
100%|████████████████████████████████| 560000/560000 [05:07<00:00, 1823.33it/s]
INFO:main:Saving features into cached file data/cached_train_xlnet-base-cased_128_binary
INFO:main:***** Running training *****
INFO:main: Num examples = 560000
INFO:main: Num Epochs = 1
INFO:main: Total train batch size = 8
INFO:main: Gradient Accumulation steps = 1
INFO:main: Total optimization steps = 70000
Epoch: 0%| | 0/1 [00:00<?, ?it/s]

HBox(children=(IntProgress(value=0, description='Iteration', max=70000, style=ProgressStyle(description_width=…
-----------------and then error messages --------------------

AttributeError Traceback (most recent call last)
in
1 if args['do_train']:
2 train_dataset = load_and_cache_examples(task, tokenizer)
----> 3 global_step, tr_loss = train(train_dataset, model, tokenizer)
4 logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

in train(train_dataset, model, tokenizer)
43 'token_type_ids': batch[2] if args['model_type'] in ['bert', 'xlnet'] else None, # XLM don't use segment_ids
44 'labels': batch[3]}
---> 45 outputs = model(**inputs)
46 loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
47 print("\r%f" % loss, end='')

~\AppData\Local\Continuum\anaconda3\envs\transformers\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)

~\AppData\Local\Continuum\anaconda3\envs\transformers\lib\site-packages\pytorch_transformers\modeling_xlnet.py in forward(self, input_ids, token_type_ids, input_mask, attention_mask, mems, perm_mask, target_mapping, labels, head_mask)
1120 input_mask=input_mask, attention_mask=attention_mask,
1121 mems=mems, perm_mask=perm_mask, target_mapping=target_mapping,
-> 1122 head_mask=head_mask)
1123 output = transformer_outputs[0]
1124

~\AppData\Local\Continuum\anaconda3\envs\transformers\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)

~\AppData\Local\Continuum\anaconda3\envs\transformers\lib\site-packages\pytorch_transformers\modeling_xlnet.py in forward(self, input_ids, token_type_ids, input_mask, attention_mask, mems, perm_mask, target_mapping, head_mask)
920 # 1 indicates not in the same segment [qlen x klen x bsz]
921 seg_mat = (token_type_ids[:, None] != cat_ids[None, :]).long()
--> 922 seg_mat = F.one_hot(seg_mat, num_classes=2).to(dtype_float)
923 else:
924 seg_mat = None

AttributeError: module 'torch.nn.functional' has no attribute 'one_hot'

SummaryWriter Import Missing from Gist

Hello! The following import line is missing from the Gist you provided in your article. Discovered it was missing when I got an error running the code. I found it in your ipynb instead.

from tensorboardX import SummaryWriter

Could you add it to your Gist? It would be helpful to anyone who is looking through your article and doing good old copy-and-paste. Thanks!

WARMUP_PROPORTION equivalent

Hi,

In the BERT_binary_text_classification repo we used a parameter called WARMUP_PROPORTION (set to 0.1). Which is the equivalent in this repo?

Thanks.

Detected call of `lr_scheduler.step()` before `optimizer.step()`

When I train a model here global_step, tr_loss = train(train_dataset, model, tokenizer) I get this warning:

0.767207/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:82: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

This is the screenshot, just in case helps (the warning is next to the epoch counter).

Untitled

Ability to pass custom pytorch loss function

I am trying to figure out if I can pass a custom loss to the underlying bert model. Is it something I can do from your code or do I need to mess with the Bert models in pytorch _transformer - the issue is I can locate the next sentence, masked label and LM modules . Not sure which one accesses the binary label model. Any tips / suggestions? Would be useful functionality for biased samples...

AttributeError: module 'torch.nn.functional' has no attribute 'one_hot'

if args['do_train']:
train_dataset = load_and_cache_examples(task, tokenizer)
global_step, tr_loss = train(train_dataset, model, tokenizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

I still get this error when running this step despite having torch 1.3.0 and torchvision 0.4.1. Has anyone found a solution for this error?

Model performance degrades when moved to Multi-GPU

Hi,

When I run your code on multi-gpu, performance degrades severely (compared to the single-gpu version). To make the code multi-gpu competable, I've only added 2 lines of code:

  • model = nn.torch.DataParallel(model) between your model = model_class.from_pretrained(args['model_name']) and model.to(device) calls

  • loss = loss.mean() after the loss = outputs[0] line in the train function. Do you have any idea how can I get the same (or similar) performance on Multi-GPU setting?

These are the results I got with these two settings:

  • With Multi-GPU training:
    evaluate_loss: = 0.3928874781464829
    fn = 116
    fp = 81
    mcc = 0.5114751200090137
    tn = 1291
    tp = 136

  • With Single-GPU Training:
    evaluate_loss: = 0.39542119007776766
    fn = 82
    fp = 126
    mcc = 0.5465463104769824
    tn = 1246
    tp = 170

Although avg loss values are similar, there are big differences in other metrics.

Validating the model

Hi,

I would like to know if the model over-fits and also the optimum number of epochs, plotting accuracy and loss as it's shown here. It would be possible to do it using this repo without making too many changes (maybe using the evaluation results as validation)?

Thanks.

Running Inference

Hello Thilina, thank you for this repo.

Question: After I've fine-tuned a roberta model for sentence classification, how do I run inference on a sample sentence in real-time? Is there a specific function in your code base you can point me to?

can I have more column in train set

other than the specified format as below, can I have more columns as features?
guid: An ID for the row.
label: The label for the row (should be an int).
alpha: A column of the same letter for all rows. Not used in classification but still expected by the DataProcessor.
text: The sentence or sequence of text.

Eval results outputs

I am trying to reproduce the code with smaller subset : 100K train; 5K dev examples
Getting this result:

INFO:main:***** Eval results outputs *****
INFO:main: fn = 0
INFO:main: fp = 2529
INFO:main: mcc = 0.0
INFO:main: tn = 0
INFO:main: tp = 2471

How to interpret it? What possibly can went wrong? Thank you.

Update prerec packages info: transformer and apex?

  1. replacing every pytorch_transformers by transformers fixes the problem of giving negative dimensions to create tensor.
  2. fp 16 requires nv apex, but can you write something to note that it is not the pip apex, and point to a way to install from nv or just let the default fp16=False to circumvent the issue for beginners?
    Thanks

passing args

Hi,

I am trying to set reprocess_input_data to True so as to avoid overwriting the features. I do it using the following commands but it still creates a new feature file every time I run the file:

python filename.py reprocess_input_data=True [or]
python filename.py --reprocess_input_data=True [or]
python filename.py reprocess_input_data True [or]
python filename.py --reprocess_input_data True

Thanks!

How to use other models without loading them from internet?

I have terrible internet connection. So i would like to ask you what modifications I should make to use other models if I already downloaded them from internet? for example xlm-mlm-tlm-xnli15-1024 model? Should I create 'cache' file in the same folder and put there .bin file? and what part of code i should modify to use that model? Thanks

Minor Issue :2 - Reading input files.

The data processor function identifies the labels and text by column position.

def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = line[3]
label = line[1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples

This is a problem as pandas is used to generate the tsv files, and across 0.24, 0.25 there is a difference in the order in which the columns are saved. It might be better to save the column names ad directly name the label column. I ran into this issue as I had to operate on two different machines and on the second machine it would crash - saying label_id used before being assigned.

File path problem


FileNotFoundError Traceback (most recent call last)
in
1 if args['do_train']:
----> 2 train_dataset = load_and_cache_examples(task, tokenizer)
3 global_step, tr_loss = train(train_dataset, model, tokenizer)
4 logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

in load_and_cache_examples(task, tokenizer, evaluate)
13 logger.info("Creating features from dataset file at %s", args['data_dir'])
14 label_list = processor.get_labels()
---> 15 examples = processor.get_dev_examples(args['data_dir']) if evaluate else processor.get_train_examples(args['data_dir'])
16
17 if name == "main":

~\data\utils.py in get_train_examples(self, data_dir)
98 """See base class."""
99 return self._create_examples(
--> 100 self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
101
102 def get_dev_examples(self, data_dir):

~\data\utils.py in _read_tsv(cls, input_file, quotechar)
82 def _read_tsv(cls, input_file, quotechar=None):
83 """Reads a tab separated value file."""
---> 84 with open(input_file, "r", encoding="utf-8-sig") as f:
85 reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
86 lines = []

FileNotFoundError: [Errno 2] No such file or directory: 'data/train.tsv'

I have the train.tsv file under this file path but the code from this step

if args['do_train']:
train_dataset = load_and_cache_examples(task, tokenizer)
global_step, tr_loss = train(train_dataset, model, tokenizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

is giving me the error above. How can I edit the code so that this error isn't appearing?

task_name

In the args dictionary we have this entry: 'task_name': 'binary',.

Later it is used in here:

task = args['task_name']
processor = processors[task]()

I have tried to change this name (i.e. yelp) but it gives me an error (not too much info in the error, it only shows the name of the task I wrote). With binary works well. Is it the name for the task (i.e. a description) or maybe the type of text classification task?

Not An issue- Adding metadata

Hello again - I was wondering if you might have pointers on how to nincorporate metadata with the text. I think I am good with adding a custom layer on top of bert. I think I need to to figure out how to generate each example so a part of it goes to bert and the rest to the other layers on top of bert. Any ideas? Thanks as always.

AttributeError: 'tuple' object has no attribute 'items'

I got this error after activating the option to evaluate during training:

Untitled

I have tracked the results variable but I couldn't find the origin. Supposedly it is a dictionary type, as it has been initialized with results = {}.

Available languages

Hi,

Is there a place where we can check the pre-trained languages available for the transformer models (XLNet, RoBERTa, and XLM) up-to-date?

Thanks

Evaluation results by batch?

Hi,

I am just launching again the roberta model, and I just realized that the evaluation results are printed several times and showing different numbers, maybe by evaluation batch size? Is the last the average?

How to avoid CUDA out of memory error for large batch sizes?

I have two GPUs (2 x NVIDIA Tesla V100) and I'm running the codes in run_model.ipynb on Google Cloud. I get the CUDA out of memory exception when I want to run my code with a sequence length longer than 128 for greater batch sizes.

I wonder if I need to make any changes to the code to make it runnable using multiple GPUs? I think I shouldn't get the out of memory error considering the number of GPUs I have and their memory (please correct me if I'm wrong.)

UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`.

Hi,
I am getting this warning:

/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

I do not know if this is related to your code. Just wanted to let you know.

How to make predications

Can you teach me how to make predications after model is trained? Does it come with build in method like .predict()?

OpenSSL.SSL.Error when pulling xlm model

Not sure if it is pytorch-transformer issue.
Environment: python 3.7.3, requests 2.22.0, urllib3 1.24.1,

Stack trace:
INFO:pytorch_transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-pytorch_model.bin not found in cache, downloading to /tmp/tmp10cg6rcl
25%|█████████████████████▊ | 676553728/2668510627 [01:07<03:12, 10331611.59B/s]Traceback (most recent call last):
File "transformer-compliance.py", line 99, in
model = model_class.from_pretrained(args['model_name'])
File "/home/anaconda3/lib/python3.7/site-packages/pytorch_transformers/modeling_utils.py", line 452, in from_pretrained
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch_transformers/file_utils.py", line 114, in cached_path
return get_from_cache(url_or_filename, cache_dir)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch_transformers/file_utils.py", line 240, in get_from_cache
http_get(url, temp_file)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch_transformers/file_utils.py", line 180, in http_get
for chunk in req.iter_content(chunk_size=1024):
File "/home/anaconda3/lib/python3.7/site-packages/requests/models.py", line 750, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/home/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 494, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/home/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 442, in read
data = self._fp.read(amt)
File "/home/anaconda3/lib/python3.7/http/client.py", line 447, in read
n = self.readinto(b)
File "/home/anaconda3/lib/python3.7/http/client.py", line 491, in readinto
n = self.fp.readinto(b)
File "/home/anaconda3/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/home/anaconda3/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 294, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/anaconda3/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1822, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/anaconda3/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1647, in _raise_ssl_error
_raise_current_error()
File "/home/anaconda3/lib/python3.7/site-packages/OpenSSL/_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'ssl3_get_record', 'decryption failed or bad record mac')]

Does anyone encounter this issue? thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.