castorini / hedwig Goto Github PK

PyTorch deep learning models for document classification

License: Apache License 2.0

Python 100.00%

pytorch deep-learning document-classification

hedwig's Introduction

This repo contains PyTorch deep learning models for document classification, implemented by the Data Systems Group at the University of Waterloo.

Models

DocBERT : DocBERT: BERT for Document Classification (Adhikari et al., 2019)
Reg-LSTM: Regularized LSTM for document classification (Adhikari et al., NAACL 2019)
XML-CNN: CNNs for extreme multi-label text classification (Liu et al., SIGIR 2017)
HAN: Hierarchical Attention Networks (Zichao et al., NAACL 2016)
Char-CNN: Character-level Convolutional Network (Zhang et al., NIPS 2015)
Kim CNN: CNNs for sentence classification (Kim, EMNLP 2014)

Each model directory has a README.md with further details.

Setting up PyTorch

Hedwig is designed for Python 3.6 and PyTorch 0.4. PyTorch recommends Anaconda for managing your environment. We'd recommend creating a custom environment as follows:

$ conda create --name castor python=3.6
$ source activate castor

And installing PyTorch as follows:

$ conda install pytorch=0.4.1 cuda92 -c pytorch

Other Python packages we use can be installed via pip:

$ pip install -r requirements.txt

Code depends on data from NLTK (e.g., stopwords) so you'll have to download them. Run the Python interpreter and type the commands:

>>> import nltk
>>> nltk.download()

Datasets

There are two ways to download the Reuters, AAPD, and IMDB datasets, along with word2vec embeddings:

Option 1. Our Wasabi-hosted mirror:

$ wget http://nlp.rocks/hedwig -O hedwig-data.zip
$ unzip hedwig-data.zip

Option 2. Our school-hosted repository, hedwig-data:

$ git clone https://github.com/castorini/hedwig.git
$ git clone https://git.uwaterloo.ca/jimmylin/hedwig-data.git

Next, organize your directory structure as follows:

.
├── hedwig
└── hedwig-data

After cloning the hedwig-data repo, you need to unzip the embeddings and run the preprocessing script:

cd hedwig-data/embeddings/word2vec 
tar -xvzf GoogleNews-vectors-negative300.tgz

If you are an internal Hedwig contributor using the machines in the lab, follow the instructions here.

hedwig's People

Contributors

Stargazers

Watchers

Forkers

karkaroff iostudios-mobile yucoian codeaudit generalsemantics library-collections wuyunxiangwyx gdsttian paojianghu uctoronto flydsc vonrosenchild zhanghonglishanzai dainis-boumber hulumei123 airobotzhang husnejahan shikhar394 j-cahill lll4592 amirmohammadkz marjanhs nzhiltsov rafikrhouma02 piercarloslavazza vr25 laga fengxiaoiie laknath shaileshj2803 vrmpx achyudh ashutosh-adhikari justintranjt hmcbsj jatinarora2702 xrosliang hadryan fatmalearning zhaoqiuye warclans612 roysh kiminh rd77 hpzhang94 mrmoore98 manishiitg galtay elisaf wilkinsondi gvatas solversa mrkarezina chiehminwei antonmu acupofhotwater gmichalo kaiseryang1224 johnson7788 empenguinxh yaoliuoa xinhai-zhu jbdatascience andrew05200 jadentan sanardi csj199813 monstarrr sevinjyolchuyeva jtquisenberry rocfelix dahlia-chehata rita1223-0727 timberswift wjksiazek nashid dylansppy bczhu kiranvarghesev harishsdev jojochen123 zomun wyue1234 lhdgriver 20161105421 jjnunez11 jen-yuan bekyilma arcuity-ai adrienguille larryzzz98 robotsl nicemartin shainaraza isaak4 tommy-xu oooooolr mikhail-tsir techthiyanes d1shs0ap

hedwig's Issues

UnpicklingError with different BERT models

Hi, I'm trying to perform document classification with Hindi language. I want to use BERT models that are adapted to Hindi and Indian languages like muril-base-cased and muril-large-cased.

In order to load them, I downloaded the models into hedwig-data/models/bert_pretrained directory and I added these lines to constants.py:

PRETRAINED_MODEL_ARCHIVE_MAP = {
    ...
    'muril-large-cased': os.path.join(MODEL_DATA_DIR, 'bert_pretrained', 'muril-large-cased'),
    'muril-base-cased': os.path.join(MODEL_DATA_DIR, 'bert_pretrained', 'muril-base-cased')

}
PRETRAINED_VOCAB_ARCHIVE_MAP = {
    ...
    'muril-large-cased': os.path.join(MODEL_DATA_DIR, 'bert_pretrained', 'muril-large-cased', 'vocab.txt'),
    'muril-base-cased': os.path.join(MODEL_DATA_DIR, 'bert_pretrained', 'muril-base-cased', 'vocab.txt')
}

I'm getting this UnpicklingError which I think is because of the transformers package version.

.../hedwig$ python -m models.bert --dataset MFIN --model muril-base-cased --max-seq-length 256 --batch-size 8 --lr 2e-5 --epochs 1
Device: CUDA
Number of GPUs: 2
FP16: False
Traceback (most recent call last):
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/twbgmy/play/MFIN/hedwig/models/bert/__main__.py", line 87, in <module>
    model = BertForSequenceClassification.from_pretrained(pretrained_model_path, num_labels=args.num_labels)
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/site-packages/transformers/modeling_utils.py", line 345, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location='cpu')
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/site-packages/torch/serialization.py", line 532, in _load
    magic_number = pickle_module.load(f)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

Am I doing something wrong? I'd appreciate any guidance.

Correct save location for trained model

It seems that the save path is incorrectly documented.

In the args the default dir is model_checkpoints and the README.md says models/bert/saves/Reuters/best_model.pt. I test it on SST-2 and the trained model was saved at model_checkpoints.

Or are these two different things?

See:

hedwig/models/bert/args.py

Line 11 in c602b82

    
           parser.add_argument('--save-path', type=str, default=os.path.join('model_checkpoints', 'bert'))

fixed-size vector for each large document

Is it possible to get a fixed-size vector (embedding) for each document (way more than 1000 characters) before fine-tuning the model?

Number of test and dev lables are not similar to original data

I was getting predictions from the trained model and I faced that the array size of predictions is not equal to the test dataset lines. It's one line less than the original data.

does DocBERT freeze all BERT layers and add a fully connected layer on the top for classification?

@Ashutosh-Adhikari @achyudh (just tagged for quick reply)

Thanks!

Getting impossible predicted labels (all zeroes) from custom data

Hi, I create a dataset with the following categories:

classDict = {"text/dokujo-tsushin": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001",
"text/it-life-hack": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010",
"text/kaden-channel": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100",
"text/livedoor-homme": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000",
"text/movie-enter": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000",
"text/peachy": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000",
"text/smax": "000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000",
"text/sports-watch": "000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000",
"text/topic-news": "000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000"}

I make sure that the train, dev and test tsv file has only these arrays of zeros and ones. (using grep) However, I am getting the predicted label all zeros. Can you tell me what is likely to cause this strange result. Thanks.

How to feed the document data into DocBert

Hi, I am wondering how do you feed documents into Bert. Did you treat a document as one sentence, i.e. [CLS] document1 [SEP]? Or you split documents into separate sentence? Thank you!

[DocBERT] Can DocBERT bin any length document without truncaiton?

Does DocBERT still have the max_seq_len limitation, like BERT?

AttributeError: 'Tensor' object has no attribute 'uniform'

Env

Windows 10 + Anaconda
pytorch 0.4.1

Everything installed as specified in the radme

Error

When executing:

python -m models.han --dataset Reuters --mode rand --batch-size 32 --lr 0.01 --epochs 30 --seed 3435

I get:

(castor) PS C:\Users\piercarlo\Documents\workspace\personal\hedwig\hedwig> python -m models.han --dataset Reuters --mode rand --batch-size 32 --lr 0.01 --epochs 30 --seed 3435
Note: You are using GPU for training
Dataset: Reuters
No. of target classes: 90
No. of train instances 5827
No. of dev instances 1943
No. of test instances 3019
Traceback (most recent call last):
  File "C:\tools\Anaconda3\envs\castor\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\tools\Anaconda3\envs\castor\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\piercarlo\Documents\workspace\personal\hedwig\hedwig\models\han\__main__.py", line 117, in <module>
    model = HAN(config)
  File "C:\Users\piercarlo\Documents\workspace\personal\hedwig\hedwig\models\han\model.py", line 13, in __init__
    self.word_attention_rnn = WordLevelRNN(config)
  File "C:\Users\piercarlo\Documents\workspace\personal\hedwig\hedwig\models\han\word_level_rnn.py", line 15, in __init__
    rand_embed_init = torch.Tensor(words_num, words_dim).uniform(-0.25, 0.25)
AttributeError: 'Tensor' object has no attribute 'uniform'

DocRoBERTa

I am a student of NLP and I am studying the castorini/hedwig implementation of DocBERT.

I would like to try using RoBERTa. My question is in the implementation of convert_examples_to_features (in abstract_processor.py) for this goal. I think RoBERTa has a different way of adding the tokens.

By simply changing the classes and modeling data, the Transformers code modeling for RoBERTa is warning (see below) that the tokens are not applied. In the code that throws that warning, it checks if the first token is 0. Currently, it is 3 for [CLS].

Warning: "A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your tokenize.encode()or tokenizer.convert_tokens_to_ids()."

If I modify that method, do you think it's simply a matter of adding 0 at the beginning of input_ids (after adjusting for the length for this additional token) to make it work correctly? I tried it but it is not getting a good score compared to DocBERT.

BertForSequenceClassification prediction.

Can I get any sample code for training a SequenceClassification model and then the prediction?

Use NLTK sent_tokenize and word_tokenize

We should replace our primitive regex based tokenization with NLTK's tokenize module in the dataset pre-processing classes (after creating a snapshot release of this repository for the camera-ready)

Code duplication can be reduced if the pre-processing methods are moved to a util module rather than having it in each dataset class.

Bert Inference Issue

Thanks for the repo.. Can you guys please check for below error

$ python -m models.bert --dataset Reuters --model bert-base-uncased --max-seq-length 256 --batch-size 16 --lr 2e-5 --epochs 30 --trained-model model_checkpoints/bert/Reuters/2020-06-21_10-03-33.pt

Device: CUDA
Number of GPUs: 4
FP16: False
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/vivek/hedwig/models/bert/main.py", line 122, in
warmup_steps=args.warmup_proportion * num_train_optimization_steps)
TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'

Please check

DocBERT

I tried to run DocBERT with
python -m models.bert --dataset Reuters --model bert-base-uncased --max-seq-length 256 --batch-size 16 --lr 2e-5 --epochs 30
, but got the following error:

r66xu@hedwig:~/RX/hedwig$ python -m models.bert --dataset Reuters --model bert-base-uncased --max-seq-length 256 --batch-size 16 --lr 2e-5 --epochs 30
Device: CUDA
Number of GPUs: 1
FP16: False
Traceback (most recent call last):
  File "/jet/var/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/jet/var/python/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/r66xu/RX/hedwig/models/bert/__main__.py", line 75, in <module>
    tokenizer = BertTokenizer.from_pretrained(pretrained_vocab_path)
  File "/jet/var/python/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 282, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/jet/var/python/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 346, in _from_pretrained
    list(cls.vocab_files_names.values())))
OSError: Model name '../hedwig-data/models/bert_pretrained/bert-base-uncased-vocab.txt' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). We assumed '../hedwig-data/models/bert_pretrained/bert-base-uncased-vocab.txt' was a path or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

To fix it, I downloaded the pre-trained model for BERT and moved it in Hedwig-data, then it's able to run. Is it the correct way to fix this?

Memory error with Reuters and models.reg_lstm

Env

Ubuntu 18.04.3 (Digital Ocean)
32 GB RAM
no GPU
Everything installed as per the Readme

Memory Error

Command executed:

python -m models.reg_lstm --dataset Reuters --mode static --batch-size 32 --lr 0.01 --epochs 30 --bidirectional --num-layers 1 --hidden-dim 512 --wdrop 0.1 --embed-droprate 0.2 --dropout 0.5 --beta-ema 0.99 --seed 3435

It reaches 100% in the progress bar, and then gives "Memory error" as follows:

(castor) root@hedwig:~/hedwig# python -m models.reg_lstm --dataset Reuters --mode rand --batch-size 32 --lr 0.01 --epochs 10 --bidirectional --num-layers 1 --hidden-dim 512 --wdrop 0.1 --embed-droprate 0.2 --dropout 0.5 --beta-ema 0.99 --seed 3435
  0%|                                                                                                 | 0/3000001 [00:00<?, ?it/s]Skipping token 3000000 with 1-dimensional vector ['300']; likely a header
100%|█████████████████████████████████████████████████████████████████████████████████| 3000001/3000001 [05:06<00:00, 9782.60it/s]
Traceback (most recent call last):
  File "/root/anaconda3/envs/castor/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/envs/castor/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/hedwig/models/reg_lstm/__main__.py", line 95, in <module>
    unk_init=UnknownWordVecCache.unk)
  File "/root/hedwig/datasets/reuters.py", line 92, in iters
    vectors = Vectors(name=vectors_name, cache=vectors_cache, unk_init=unk_init)
  File "/root/anaconda3/envs/castor/lib/python3.6/site-packages/torchtext/vocab.py", line 236, in __init__
    self.cache(name, cache, url=url)
  File "/root/anaconda3/envs/castor/lib/python3.6/site-packages/torchtext/vocab.py", line 327, in cache
    self.vectors = torch.Tensor(vectors).view(-1, dim)
MemoryError

Are 32GB RAM really not enough for this task? If so, do you have an idea of the minimum RAM requirements?

Note: before getting the memory error, I was getting another error, explained and solved (?) as follows:

TypeError: init() got an unexpected keyword argument 'dtype'

Error log:

(castor) root@hedwig:~/hedwig# python -m models.reg_lstm --dataset Reuters --mode rand --batch-size 32 --lr 0.01 --epochs 10 --bidirectional --num-layers 1 --hidden-dim 512 --wdrop 0.1 --embed-droprate 0.2 --dropout 0.5 --beta-ema 0.99 --seed 3435
Traceback (most recent call last):
  File "/root/anaconda3/envs/castor/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/envs/castor/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/hedwig/models/reg_lstm/__main__.py", line 11, in <module>
    from datasets.aapd import AAPD
  File "/root/hedwig/datasets/aapd.py", line 9, in <module>
    from datasets.reuters import clean_string, split_sents
  File "/root/hedwig/datasets/reuters.py", line 126, in <module>
    class ReutersTFIDF(Reuters):
  File "/root/hedwig/datasets/reuters.py", line 128, in ReutersTFIDF
    TEXT_FIELD = Field(sequential=False, use_vocab=False, batch_first=True, preprocessing=load_json, dtype=torch.float)
TypeError: __init__() got an unexpected keyword argument 'dtype'

I have just removed the dtype parameter - no idea whether this might be the cause of the above memory error - or is anyway harming the training.

Pretrained weights via model zoo

Check out https://huggingface.co/models

Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated

when i fine tune BERT, output is :
Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated

is this ok?

Reg-LSTM

Who can this model be attributed? I would really appreciate a quick response because I am finishing a project report.

labels of the data

hello,
I downloaded the datasets from https://git.uwaterloo.ca/jimmylin/hedwig-data/-/tree/master/datasets/AAPD
thanks for this this is really amazing.
I noticed the labels were one hot encoded and was wondering if there was anyway to find the original target names and their corresponding encoding?
Thanks a lot!

Replace model caching mechanism

Due to intermittent network connectivity in the compute clusters, I have been facing issues when fine-tuning BERT models. In order to avoid this, we should move pre-trained models to hedwig-data and have the driver method load pre-trained models from that location rather than downloading it from AWS.

Early stopping criteria

Can someone point me to where I can look into about the conditions early stopping kicks in? Thanks.

Add type hints

Type hints explicitly constrain arguments' types, aiding the readability and maintainability of a project. AllenNLP is a great example of a clean, research-ready codebase that uses type hints; I think we should follow in their footsteps.

Paper of HBERT

Hi
I was wondering what's the original paper of Hbert

Many Thanks

Getting AttributeError: Can't get attribute 'gelu' on <module 'transformers.modeling_bert' in "/hedwig/models/bert/main.py", l

Hi, I clone your work and change the tokenizer and bert model to Japanese ones. It runs fine on colab but when I move it to a local server and load trained models. I get the following error:

File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in run_code
exec(code, run_globals)
File "/docBertJa/models/bert/main.py", line 125, in
model = torch.load(args.trained_model, map_location=lambda storage, loc: storage)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
AttributeError: Can't get attribute 'gelu' on <module 'transformers.modeling_bert' from '/usr/local/lib/python3.8/dist-packages/transformers/modeling_bert.py'>

Googling leads me to the following page:

pytorch/pytorch#28944

But I still have no idea what to fix. Can you help?

Killed message on trying models.reg_lstm

On executing this command:

python -m models.reg_lstm --dataset Reuters --mode static --batch-size 32 --lr 0.01 --epochs 30 --bidirectional --num-layers 1 --hidden-dim 512 --wdrop 0.1 --embed-droprate 0.2 --dropout 0.5 --beta-ema 0.99 --seed 3435,

I get the following response:

  0%|         | 0/3000001 [00:00<?, ?it/s]Skipping token 3000000 with 1-dimensional vector ['300']; likely a header
100%██████████████████| 3000001/3000001 [03:33<00:00, 14061.26it/s]
Killed

The process seems to have completed (shows 100%) but there is a Killed message at the end.

How do I use this with another dataset?

Reproducing Setup and README steps

Playing around with the package, I managed to run Kim-CNN but had some minor hiccups:

README

The Conv-RNN and Char-CNN links are dead
I needed to do conda activate castor instead of source activate castor
The README wants me to do nltk.download(), but I don't think that's needed? I didn't do it because I didn't know which datasets to use, and so far it worked...
There's no get_trec_eval.sh in utils/

Running a Model

The model READMEs tell me to run python -m modelname ..., but I need to run python -m models.modelname ... in the Hedwig directory.
I had to manually install tensorboardX to run Kim-CNN (used in common/trainers/classification_trainer.py)
The models try to load a Google-News-vectors-negative300.txt word2vec vector, but there's only a binary one in ../Castor-data/embeddings/word2vec/ I had to convert it manually. (Afterwards I realized there's an arg for that, maybe I could have changed --word-vectors-file to .bin...)
python -m kim_cnn --trained_model ... should be ... --trained-model ...

Failed saving the best model according to hedwig/models/bert/README.md

Tried the steps according to hedwig/models/bert/README.md but the best model did not get saved. New to pytorch, which part of the code does this job?

Remove tensorboardX logging

TensorboardX has changed their library API and this is causing a couple of issues when upgrading to newer versions of the library. Since no one is using this feature currently, we could just remove it.

Migrate from pytorch-pretrained-bert to transformers

Currently, we use huggingface/pytorch-pretrained-bert for our DocBERT implementation. Migrating to huggingface/transformers would provide hedwig access to a wider range of models such as XLNet and TransformerXL. Document classification is a very good use case for those models as they solve the fixed-length context problem we have with BERT. Further, we have duplicate copies of huggingface code just sitting around in hedwig and it's quite messy.

Is there a docker container?

Custom dataset error

Hi ,

I am using custom dataset and running reg_lstm model, I am getting the below error. But for the other models it's running fine. Can anybody help me out?

RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0

Alternate hosting for hedwig-data

The dataset repo https://git.uwaterloo.ca/jimmylin/hedwig-data isn't ideal and requires the user to extract the embeddings and process them with a python script. Do we have any alternate ways to host ~5 GB of data that would make it easier for others to replicate our results out of the box?

[Question] What are the "robust" datasets?

Hello,

First of all, thanks for those models and for those datasets, great work, this is very useful!
second of all, I am familiar with almost all the datasets you are using except robust04 , robust05 and robust45.
I was wondering if you could give some more information about those datasets?

also if you could give more info for all datasets (where did you get the original data, what are some statistics about the datasets, number of training samples, number of testing samples, ...) that would be even greater!

Thanks a lot

BUG report

none

IMDB dataset

Can you specify the source of IMDB dataset you used in this repository?
I have read HAN and found that they used IMDB from anothor paper who constructed dataset by themself, However, there other version of IMDB (e.g., standford presented in 2011) in the internet. so I want to know where the IMDB dataset comes from?

fasttext

is this the Facebook FastText model?

Testing a trained model throws error

We don't need to search for a snapshot when using the flag --trained-model.

self.model.train() should be outside the train-epoch loop

Currently, we redundantly use self.model.train() in every iteration. A model should be ideally set for training after every epoch's evaluation phase is over (only once)

Where I can get the KD model

Is it possible to provide a link to KD model ?

DocBERT model weights

Dear maintainers and authors :)

I just found the DocBERT paper on arxiv and would really like to know, if you plan to share/release the weights and config file - I would really like to load them with pytorch-pretrained-BERT and play around with the model ❤️

Thanks in advance,

Stefan

Fine-tuning BERT for a new dataset

Hi,
Is there the possibility to use the current DocBert implementation with my own custom dataset? Or it would imply internal code modification in Hedwig?
Thanks!

setup

Never mind. Got it to work.

Scaled embedded dropout mask

Hello,

why is the embedded dropout mask weighted by / (1 - dropout)?

hedwig/models/reg_lstm/embed_regularize.py

Line 38 in 98634d3

    
           mask = embed.weight.data.new().resize_((embed.weight.size(0), 1)).bernoulli_(1 - dropout).expand_as(embed.weight) / (1 - dropout)

I looked into the references in the corresponding paper but
Gal and Ghahramani (2016) are not using it and a Merity et al. (2018) are just stating that they do it. So I wonder what the idea behind the step is.