Code Monkey home page Code Monkey logo

hedwig's Introduction

This repo contains PyTorch deep learning models for document classification, implemented by the Data Systems Group at the University of Waterloo.

Models

Each model directory has a README.md with further details.

Setting up PyTorch

Hedwig is designed for Python 3.6 and PyTorch 0.4. PyTorch recommends Anaconda for managing your environment. We'd recommend creating a custom environment as follows:

$ conda create --name castor python=3.6
$ source activate castor

And installing PyTorch as follows:

$ conda install pytorch=0.4.1 cuda92 -c pytorch

Other Python packages we use can be installed via pip:

$ pip install -r requirements.txt

Code depends on data from NLTK (e.g., stopwords) so you'll have to download them. Run the Python interpreter and type the commands:

>>> import nltk
>>> nltk.download()

Datasets

There are two ways to download the Reuters, AAPD, and IMDB datasets, along with word2vec embeddings:

Option 1. Our Wasabi-hosted mirror:

$ wget http://nlp.rocks/hedwig -O hedwig-data.zip
$ unzip hedwig-data.zip

Option 2. Our school-hosted repository, hedwig-data:

$ git clone https://github.com/castorini/hedwig.git
$ git clone https://git.uwaterloo.ca/jimmylin/hedwig-data.git

Next, organize your directory structure as follows:

.
├── hedwig
└── hedwig-data

After cloning the hedwig-data repo, you need to unzip the embeddings and run the preprocessing script:

cd hedwig-data/embeddings/word2vec 
tar -xvzf GoogleNews-vectors-negative300.tgz

If you are an internal Hedwig contributor using the machines in the lab, follow the instructions here.

hedwig's People

Contributors

achyudh avatar ashutosh-adhikari avatar bazingagin avatar d1shs0ap avatar daemon avatar gauravbaruah avatar hatianzhang avatar impavidity avatar likicode avatar lintool avatar meng-f avatar mikhail-tsir avatar rosequ avatar salman1993 avatar tuzhucheng avatar victor0118 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hedwig's Issues

UnpicklingError with different BERT models

Hi, I'm trying to perform document classification with Hindi language. I want to use BERT models that are adapted to Hindi and Indian languages like muril-base-cased and muril-large-cased.

In order to load them, I downloaded the models into hedwig-data/models/bert_pretrained directory and I added these lines to constants.py:

PRETRAINED_MODEL_ARCHIVE_MAP = {
    ...
    'muril-large-cased': os.path.join(MODEL_DATA_DIR, 'bert_pretrained', 'muril-large-cased'),
    'muril-base-cased': os.path.join(MODEL_DATA_DIR, 'bert_pretrained', 'muril-base-cased')

}
PRETRAINED_VOCAB_ARCHIVE_MAP = {
    ...
    'muril-large-cased': os.path.join(MODEL_DATA_DIR, 'bert_pretrained', 'muril-large-cased', 'vocab.txt'),
    'muril-base-cased': os.path.join(MODEL_DATA_DIR, 'bert_pretrained', 'muril-base-cased', 'vocab.txt')
}

I'm getting this UnpicklingError which I think is because of the transformers package version.

.../hedwig$ python -m models.bert --dataset MFIN --model muril-base-cased --max-seq-length 256 --batch-size 8 --lr 2e-5 --epochs 1
Device: CUDA
Number of GPUs: 2
FP16: False
Traceback (most recent call last):
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/twbgmy/play/MFIN/hedwig/models/bert/__main__.py", line 87, in <module>
    model = BertForSequenceClassification.from_pretrained(pretrained_model_path, num_labels=args.num_labels)
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/site-packages/transformers/modeling_utils.py", line 345, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location='cpu')
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/home/twbgmy/anaconda3/envs/hindiclass/lib/python3.6/site-packages/torch/serialization.py", line 532, in _load
    magic_number = pickle_module.load(f)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

Am I doing something wrong? I'd appreciate any guidance.

Correct save location for trained model

It seems that the save path is incorrectly documented.

In the args the default dir is model_checkpoints and the README.md says models/bert/saves/Reuters/best_model.pt. I test it on SST-2 and the trained model was saved at model_checkpoints.

Or are these two different things?

See:

parser.add_argument('--save-path', type=str, default=os.path.join('model_checkpoints', 'bert'))

Getting impossible predicted labels (all zeroes) from custom data

Hi, I create a dataset with the following categories:

classDict = {"text/dokujo-tsushin": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001",
"text/it-life-hack": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010",
"text/kaden-channel": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100",
"text/livedoor-homme": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000",
"text/movie-enter": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000",
"text/peachy": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000",
"text/smax": "000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000",
"text/sports-watch": "000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000",
"text/topic-news": "000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000"}

I make sure that the train, dev and test tsv file has only these arrays of zeros and ones. (using grep) However, I am getting the predicted label all zeros. Can you tell me what is likely to cause this strange result. Thanks.

How to feed the document data into DocBert

Hi, I am wondering how do you feed documents into Bert. Did you treat a document as one sentence, i.e. [CLS] document1 [SEP]? Or you split documents into separate sentence? Thank you!

AttributeError: 'Tensor' object has no attribute 'uniform'

Env

  • Windows 10 + Anaconda
  • pytorch 0.4.1

Everything installed as specified in the radme

Error

When executing:

python -m models.han --dataset Reuters --mode rand --batch-size 32 --lr 0.01 --epochs 30 --seed 3435

I get:

(castor) PS C:\Users\piercarlo\Documents\workspace\personal\hedwig\hedwig> python -m models.han --dataset Reuters --mode rand --batch-size 32 --lr 0.01 --epochs 30 --seed 3435
Note: You are using GPU for training
Dataset: Reuters
No. of target classes: 90
No. of train instances 5827
No. of dev instances 1943
No. of test instances 3019
Traceback (most recent call last):
  File "C:\tools\Anaconda3\envs\castor\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\tools\Anaconda3\envs\castor\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\piercarlo\Documents\workspace\personal\hedwig\hedwig\models\han\__main__.py", line 117, in <module>
    model = HAN(config)
  File "C:\Users\piercarlo\Documents\workspace\personal\hedwig\hedwig\models\han\model.py", line 13, in __init__
    self.word_attention_rnn = WordLevelRNN(config)
  File "C:\Users\piercarlo\Documents\workspace\personal\hedwig\hedwig\models\han\word_level_rnn.py", line 15, in __init__
    rand_embed_init = torch.Tensor(words_num, words_dim).uniform(-0.25, 0.25)
AttributeError: 'Tensor' object has no attribute 'uniform'

DocRoBERTa

I am a student of NLP and I am studying the castorini/hedwig implementation of DocBERT.

I would like to try using RoBERTa. My question is in the implementation of convert_examples_to_features (in abstract_processor.py) for this goal. I think RoBERTa has a different way of adding the tokens.

By simply changing the classes and modeling data, the Transformers code modeling for RoBERTa is warning (see below) that the tokens are not applied. In the code that throws that warning, it checks if the first token is 0. Currently, it is 3 for [CLS].

Warning: "A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your tokenize.encode()or tokenizer.convert_tokens_to_ids()."

If I modify that method, do you think it's simply a matter of adding 0 at the beginning of input_ids (after adjusting for the length for this additional token) to make it work correctly? I tried it but it is not getting a good score compared to DocBERT.

Use NLTK sent_tokenize and word_tokenize

We should replace our primitive regex based tokenization with NLTK's tokenize module in the dataset pre-processing classes (after creating a snapshot release of this repository for the camera-ready)

Code duplication can be reduced if the pre-processing methods are moved to a util module rather than having it in each dataset class.

Bert Inference Issue

Thanks for the repo.. Can you guys please check for below error

$ python -m models.bert --dataset Reuters --model bert-base-uncased --max-seq-length 256 --batch-size 16 --lr 2e-5 --epochs 30 --trained-model model_checkpoints/bert/Reuters/2020-06-21_10-03-33.pt


Device: CUDA
Number of GPUs: 4
FP16: False
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/vivek/hedwig/models/bert/main.py", line 122, in
warmup_steps=args.warmup_proportion * num_train_optimization_steps)
TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'


Please check

DocBERT

I tried to run DocBERT with
python -m models.bert --dataset Reuters --model bert-base-uncased --max-seq-length 256 --batch-size 16 --lr 2e-5 --epochs 30
, but got the following error:

r66xu@hedwig:~/RX/hedwig$ python -m models.bert --dataset Reuters --model bert-base-uncased --max-seq-length 256 --batch-size 16 --lr 2e-5 --epochs 30
Device: CUDA
Number of GPUs: 1
FP16: False
Traceback (most recent call last):
  File "/jet/var/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/jet/var/python/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/r66xu/RX/hedwig/models/bert/__main__.py", line 75, in <module>
    tokenizer = BertTokenizer.from_pretrained(pretrained_vocab_path)
  File "/jet/var/python/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 282, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/jet/var/python/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 346, in _from_pretrained
    list(cls.vocab_files_names.values())))
OSError: Model name '../hedwig-data/models/bert_pretrained/bert-base-uncased-vocab.txt' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). We assumed '../hedwig-data/models/bert_pretrained/bert-base-uncased-vocab.txt' was a path or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

To fix it, I downloaded the pre-trained model for BERT and moved it in Hedwig-data, then it's able to run. Is it the correct way to fix this?

Memory error with Reuters and models.reg_lstm

Env

  • Ubuntu 18.04.3 (Digital Ocean)
  • 32 GB RAM
  • no GPU
  • Everything installed as per the Readme

Memory Error

Command executed:

python -m models.reg_lstm --dataset Reuters --mode static --batch-size 32 --lr 0.01 --epochs 30 --bidirectional --num-layers 1 --hidden-dim 512 --wdrop 0.1 --embed-droprate 0.2 --dropout 0.5 --beta-ema 0.99 --seed 3435

It reaches 100% in the progress bar, and then gives "Memory error" as follows:

(castor) root@hedwig:~/hedwig# python -m models.reg_lstm --dataset Reuters --mode rand --batch-size 32 --lr 0.01 --epochs 10 --bidirectional --num-layers 1 --hidden-dim 512 --wdrop 0.1 --embed-droprate 0.2 --dropout 0.5 --beta-ema 0.99 --seed 3435
  0%|                                                                                                 | 0/3000001 [00:00<?, ?it/s]Skipping token 3000000 with 1-dimensional vector ['300']; likely a header
100%|█████████████████████████████████████████████████████████████████████████████████| 3000001/3000001 [05:06<00:00, 9782.60it/s]
Traceback (most recent call last):
  File "/root/anaconda3/envs/castor/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/envs/castor/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/hedwig/models/reg_lstm/__main__.py", line 95, in <module>
    unk_init=UnknownWordVecCache.unk)
  File "/root/hedwig/datasets/reuters.py", line 92, in iters
    vectors = Vectors(name=vectors_name, cache=vectors_cache, unk_init=unk_init)
  File "/root/anaconda3/envs/castor/lib/python3.6/site-packages/torchtext/vocab.py", line 236, in __init__
    self.cache(name, cache, url=url)
  File "/root/anaconda3/envs/castor/lib/python3.6/site-packages/torchtext/vocab.py", line 327, in cache
    self.vectors = torch.Tensor(vectors).view(-1, dim)
MemoryError

Are 32GB RAM really not enough for this task? If so, do you have an idea of the minimum RAM requirements?

Note: before getting the memory error, I was getting another error, explained and solved (?) as follows:

TypeError: init() got an unexpected keyword argument 'dtype'

Error log:

(castor) root@hedwig:~/hedwig# python -m models.reg_lstm --dataset Reuters --mode rand --batch-size 32 --lr 0.01 --epochs 10 --bidirectional --num-layers 1 --hidden-dim 512 --wdrop 0.1 --embed-droprate 0.2 --dropout 0.5 --beta-ema 0.99 --seed 3435
Traceback (most recent call last):
  File "/root/anaconda3/envs/castor/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/envs/castor/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/hedwig/models/reg_lstm/__main__.py", line 11, in <module>
    from datasets.aapd import AAPD
  File "/root/hedwig/datasets/aapd.py", line 9, in <module>
    from datasets.reuters import clean_string, split_sents
  File "/root/hedwig/datasets/reuters.py", line 126, in <module>
    class ReutersTFIDF(Reuters):
  File "/root/hedwig/datasets/reuters.py", line 128, in ReutersTFIDF
    TEXT_FIELD = Field(sequential=False, use_vocab=False, batch_first=True, preprocessing=load_json, dtype=torch.float)
TypeError: __init__() got an unexpected keyword argument 'dtype'

I have just removed the dtype parameter - no idea whether this might be the cause of the above memory error - or is anyway harming the training.

Reg-LSTM

Who can this model be attributed? I would really appreciate a quick response because I am finishing a project report.

Replace model caching mechanism

Due to intermittent network connectivity in the compute clusters, I have been facing issues when fine-tuning BERT models. In order to avoid this, we should move pre-trained models to hedwig-data and have the driver method load pre-trained models from that location rather than downloading it from AWS.

Early stopping criteria

Can someone point me to where I can look into about the conditions early stopping kicks in? Thanks.

Add type hints

Type hints explicitly constrain arguments' types, aiding the readability and maintainability of a project. AllenNLP is a great example of a clean, research-ready codebase that uses type hints; I think we should follow in their footsteps.

Paper of HBERT

Hi
I was wondering what's the original paper of Hbert

Many Thanks

Getting AttributeError: Can't get attribute 'gelu' on <module 'transformers.modeling_bert' in "/hedwig/models/bert/__main__.py", l

Hi, I clone your work and change the tokenizer and bert model to Japanese ones. It runs fine on colab but when I move it to a local server and load trained models. I get the following error:

File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in run_code
exec(code, run_globals)
File "/docBertJa/models/bert/main.py", line 125, in
model
= torch.load(args.trained_model, map_location=lambda storage, loc: storage)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
AttributeError: Can't get attribute 'gelu' on <module 'transformers.modeling_bert' from '/usr/local/lib/python3.8/dist-packages/transformers/modeling_bert.py'>

Googling leads me to the following page:

pytorch/pytorch#28944

But I still have no idea what to fix. Can you help?

Killed message on trying models.reg_lstm

On executing this command:

python -m models.reg_lstm --dataset Reuters --mode static --batch-size 32 --lr 0.01 --epochs 30 --bidirectional --num-layers 1 --hidden-dim 512 --wdrop 0.1 --embed-droprate 0.2 --dropout 0.5 --beta-ema 0.99 --seed 3435,

I get the following response:

  0%|         | 0/3000001 [00:00<?, ?it/s]Skipping token 3000000 with 1-dimensional vector ['300']; likely a header
100%██████████████████| 3000001/3000001 [03:33<00:00, 14061.26it/s]
Killed

The process seems to have completed (shows 100%) but there is a Killed message at the end.

Reproducing Setup and README steps

Playing around with the package, I managed to run Kim-CNN but had some minor hiccups:

README

  • The Conv-RNN and Char-CNN links are dead
  • I needed to do conda activate castor instead of source activate castor
  • The README wants me to do nltk.download(), but I don't think that's needed? I didn't do it because I didn't know which datasets to use, and so far it worked...
  • There's no get_trec_eval.sh in utils/

Running a Model

  • The model READMEs tell me to run python -m modelname ..., but I need to run python -m models.modelname ... in the Hedwig directory.
  • I had to manually install tensorboardX to run Kim-CNN (used in common/trainers/classification_trainer.py)
  • The models try to load a Google-News-vectors-negative300.txt word2vec vector, but there's only a binary one in ../Castor-data/embeddings/word2vec/ I had to convert it manually. (Afterwards I realized there's an arg for that, maybe I could have changed --word-vectors-file to .bin...)
  • python -m kim_cnn --trained_model ... should be ... --trained-model ...

Remove tensorboardX logging

TensorboardX has changed their library API and this is causing a couple of issues when upgrading to newer versions of the library. Since no one is using this feature currently, we could just remove it.

Migrate from pytorch-pretrained-bert to transformers

Currently, we use huggingface/pytorch-pretrained-bert for our DocBERT implementation. Migrating to huggingface/transformers would provide hedwig access to a wider range of models such as XLNet and TransformerXL. Document classification is a very good use case for those models as they solve the fixed-length context problem we have with BERT. Further, we have duplicate copies of huggingface code just sitting around in hedwig and it's quite messy.

Custom dataset error

Hi ,

I am using custom dataset and running reg_lstm model, I am getting the below error. But for the other models it's running fine. Can anybody help me out?

RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0

[Question] What are the "robust" datasets?

Hello,

First of all, thanks for those models and for those datasets, great work, this is very useful!
second of all, I am familiar with almost all the datasets you are using except robust04 , robust05 and robust45.
I was wondering if you could give some more information about those datasets?

also if you could give more info for all datasets (where did you get the original data, what are some statistics about the datasets, number of training samples, number of testing samples, ...) that would be even greater!

Thanks a lot

IMDB dataset

Can you specify the source of IMDB dataset you used in this repository?
I have read HAN and found that they used IMDB from anothor paper who constructed dataset by themself, However, there other version of IMDB (e.g., standford presented in 2011) in the internet. so I want to know where the IMDB dataset comes from?

fasttext

is this the Facebook FastText model?

DocBERT model weights

Dear maintainers and authors :)

I just found the DocBERT paper on arxiv and would really like to know, if you plan to share/release the weights and config file - I would really like to load them with pytorch-pretrained-BERT and play around with the model ❤️

Thanks in advance,

Stefan

Fine-tuning BERT for a new dataset

Hi,
Is there the possibility to use the current DocBert implementation with my own custom dataset? Or it would imply internal code modification in Hedwig?
Thanks!

setup

Never mind. Got it to work.

Scaled embedded dropout mask

Hello,

why is the embedded dropout mask weighted by / (1 - dropout)?

mask = embed.weight.data.new().resize_((embed.weight.size(0), 1)).bernoulli_(1 - dropout).expand_as(embed.weight) / (1 - dropout)

I looked into the references in the corresponding paper but
Gal and Ghahramani (2016) are not using it and a Merity et al. (2018) are just stating that they do it. So I wonder what the idea behind the step is.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.