kaushaltrivedi / bert-toxic-comments-multilabel Goto Github PK

View Code? Open in Web Editor NEW

310.0 310.0 129.0 26 KB

Multilabel classification for Toxic comments challenge using Bert

Jupyter Notebook 100.00%

bert-toxic-comments-multilabel's People

Contributors

Stargazers

Watchers

Forkers

snowdj saravananpsg cherryleung zoplex iguy0 thomastilli kowalewski230 charlottesean asif31iqbal ilineicry pvr1 daymos hainan89 weiczhu amoliu generalzh sachinsingh3107 davidykzhao ab-forked-repos xinray muruent drinky78 zhihaolin billpku brunnurs stjordanis mrxiaohe yue1harriet1 wentropy arbarbera kevin2107 mridulnagpal fpcsong chengsen adamzolotarev lvaleriu melissaforti rautsunil mac-kim chrisseiler96 tiffen regrettablemouse136 xinpingluo arunkumarramanan useric hyx100e bimhud miyuiki jujuliette36 aliendeep nyounes grantfrefg zawecha1 pranjaldaga sebasmueller byuan186 gohhl fmplaza pavangadde ab0062 manalbah deepukr85 kuo77122 databill86 chetanseth ranjan13 melodyzhi 5yue8haogaoqi bulatuseinov aspirincode pranav4838 luckysunda arun-ghontale rgaonkar tanyanghzsd gaoyz0625 pmallari mejihero huaiyulin hello-ram gitrekm single430 abulhasanat xmxoxo amirstudy deepakthandra theboneless venkat-fsa lifesurfer99 lingsond vedraiyani zbaida-achraf o10o danglive rnekrasov-msk zzisme ichraf7 abdulrafay hanst tc-liu

bert-toxic-comments-multilabel's Issues

AttributeError: 'BertForMultiLabelSequenceClassification' object has no attribute 'module'

when i run the notebook to

model.module.unfreeze_bert_encoder()

got this error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-48-e5502767395c> in <module>
----> 1 model.module.unfreeze_bert_encoder()

c:\users\jiang\.conda\envs\python3.6\lib\site-packages\torch\nn\modules\module.py in __getattr__(self, name)
    946                 return modules[name]
    947         raise AttributeError("'{}' object has no attribute '{}'".format(
--> 948             type(self).__name__, name))
    949 
    950     def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:

AttributeError: 'BertForMultiLabelSequenceClassification' object has no attribute 'module'

where i missed ?

FileNotFoundError: [Errno 2] No such file or directory: 'BERT\\pytorch_model.bin'

I tried to rerun the code and get the file not found Error. Where can I download the pytorch_model.bin? Or what method do I need to use? I can't find anything helpful on the net.

I tried to download the model with "BertModel.from_pretrained('bert-base-uncased')", but couldn't find the pytorch_model.bin.

What is in the classes.txt file?

Great work, thanks for sharing, @kaushaltrivedi . I am trying to run this code, but having issues related to the input data. I downloaded the dataset from Kaggle. But, it looks like there are missing files like the classes.txt. Could you please explain the format of those?

Thanks.

Re: Installation of apex

Hi,

I am getting issue in installation of apex

As given in the blog, I tried the following commands:

!git clone https://github.com/NVIDIA/apex
cd apex
!pip install -v --no-cache-dir --global-option="--pyprof" --global-option="--cpp_ext" --global-option="--cuda_ext"

But the installation ends up with message :

Kindly help as I am stuck.

Thanks,
Deepti

Unable to freeze/unfreeze

model.module.freeze_bert_encoder() and model.module.unfreeze_bert_encoder() produce an error. Calling those methods from model works fine.

Tips how to use this for multi-class multi-label classification

Dear @kaushaltrivedi,

Apart from the loss function you mentioned in the blog, what are your suggestions in using this implementation for multi-class multi-label classification?

Thanks.

What is in the 'val.csv' file?

Hi, I am trying to run the BERT multilabel classification and was wondering what is contained in the 'val.csv' file? Thanks :)

Does some one know the Segmentation fault (core dumped)

I run the code in ipythonnotebook one cell by one cell,when it run the fit() function,it start to train,but it comes Segmentation fault (core dumped). Have some one know how to solve it???Thanks a lot!

Change `PreTrainedBertModel` to `BertPreTrainedModel`

Hi, thanks for sharing this wonderful work
I recently try to re-run the code and cannot import the pytorch_pretrained_bert correctly
and I figure out the module in huggingface/pytorch-pretrained-BERT have changed from PreTrainedBertModel to BertPreTrainedModel, just a reminder for those who facing the same issue

Wrong number of labels

The function get_labels is used to get the labels from the source csv files, and the length of this is used to get the size of the last layer in the model. However, the first column is the ID of the document, not a label, so using this results in a size mismatch in the model, which is unable to train.

if self.labels == None: self.labels = list(pd.read_csv(os.path.join(self.data_dir, "classes.txt"),header=None)[0].values)

Removing the first value (or saying num_labels = len(labels - 1)) fixes this problem.

FileNotFoundError: File val.csv does not exist

While running the notebook I'm stuck at the above mentioned error. The code is:

Eval Fn

eval_examples = processor.get_dev_examples(args['data_dir'], size=args['val_size'])
def eval():
......

Error:
FileNotFoundError Traceback (most recent call last)
in
1 # Eval Fn
----> 2 eval_examples = processor.get_dev_examples(args['data_dir'], size=args['val_size'])
3 def eval():
4 args['output_dir'].mkdir(exist_ok=True)
5

in get_dev_examples(self, data_dir, size)
22 filename = 'val.csv'
23 if size == -1:
---> 24 data_df = pd.read_csv(os.path.join(data_dir, filename))
25 # data_df['comment_text'] = data_df['comment_text'].apply(cleanHtml)
26 return self._create_examples(data_df, "dev")

/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
676 skip_blank_lines=skip_blank_lines)
677
--> 678 return _read(filepath_or_buffer, kwds)
679
680 parser_f.name = name

/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
438
439 # Create the parser.
--> 440 parser = TextFileReader(filepath_or_buffer, **kwds)
441
442 if chunksize or iterator:

/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in init(self, f, engine, **kwds)
785 self.options['has_index_names'] = kwds['has_index_names']
786
--> 787 self._make_engine(self.engine)
788
789 def close(self):

/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1012 def _make_engine(self, engine='c'):
1013 if engine == 'c':
-> 1014 self._engine = CParserWrapper(self.f, **self.options)
1015 else:
1016 if engine == 'python':

/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in init(self, src, **kwds)
1706 kwds['usecols'] = self.usecols
1707
-> 1708 self._reader = parsers.TextReader(src, **kwds)
1709
1710 passed_names = self.names is None

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.cinit()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: File b'kaggle_data/toxic_comments/tmp/val.csv' does not exist

Confused details about label_ids

when i read the code,i found u pass the label_ids as a list, though u define a dict named 'label_map', u don't convert label_ids to float number ,anything wrong in that???

Target size is not matching with the input size

I have loaded the data (Toxic dataset) and tried to run the model using batch_size_per_gpu = 4
but i am getting the below error.

ValueError: Target size (torch.Size([0, 6])) must be the same as input size (torch.Size([4, 6]))

Could you please help here.

CUDA out of memory. What can I do to improve model performance?

I have a Tesla GPU which has only 16 Gb -- much less than what you used for your experiment described in the Medium article. As a result, I had to reduce the max sequence length from 512 to 128, and the batch size from 32 to 16. After 4 epochs, the validation accuracies of the various toxic comment categories were around 0.6 to 0.65. I wonder if increasing the number of epochs would help increase the performance.

In addition, is there a way to continue training a model -- say after 4 epochs, if the validation results are not good, can I continue the training rather than restart the training with a larger number of epochs? Is it sufficient to just rerun fit()`?

Thanks !

Is there a way to do the fitting on an already trained model?

I would like to split my data and train the model in little chunks. So after one training when the model is saved I would like to get that model but instead of making predictions I want to train it further. Is that possible?

@kaushaltrivedi thanks in advance

Problem Loading Bert Weights

I am having trouble with this line:

model = BertForMultiLabelSequenceClassification.from_pretrained(bert_model_path, num_labels=num_labels)

Where bert_model_path is a path to a pytorch_model.bin.tar.gz file.

First, I get a complaint that the bert_config.json file (in the same folder) is not in the new temp folder. If I move it there manually, I get an error (an INFO message really) saying:

Weights of MultiLabelBert not initialized from pretrained model

Is this a bug or am I missing something?

Issue in importing Apex

I am having issues while importing apex. I get an error similar to the ones posted in run_classifier repository.

in
----> 1 import apex
2 import pandas as pd
3 import numpy as np
4 import torch
5

/anaconda3/lib/python3.7/site-packages/apex/init.py in
16 from apex.exceptions import (ApexAuthSecret,
17 ApexSessionSecret)
---> 18 from apex.interfaces import (ApexImplementation,
19 IApex)
20 from apex.lib.libapex import (groupfinder,

/anaconda3/lib/python3.7/site-packages/apex/interfaces.py in
8 pass
9
---> 10 class ApexImplementation(object):
11 """ Class so that we can tell if Apex is installed from other
12 applications

/anaconda3/lib/python3.7/site-packages/apex/interfaces.py in ApexImplementation()
12 applications
13 """
---> 14 implements(IApex)

/anaconda3/lib/python3.7/site-packages/zope/interface/declarations.py in implements(*interfaces)
481 # the coverage for this block there. :(
482 if PYTHON3:
--> 483 raise TypeError(_ADVICE_ERROR % 'implementer')
484 _implements("implements", interfaces, classImplements)
485

TypeError: Class advice impossible in Python3. Use the @Implementer class decorator instead.

Could anyone help me with this?