ukplab / emnlp2017-bilstm-cnn-crf Goto Github PK

BiLSTM-CNN-CRF architecture for sequence tagging

License: Apache License 2.0

Python 99.75% Dockerfile 0.25%

emnlp2017-bilstm-cnn-crf's Introduction

BiLSTM-CNN-CRF Implementation for Sequence Tagging

This repository contains a BiLSTM-CRF implementation that used for NLP Sequence Tagging (for example POS-tagging, Chunking, or Named Entity Recognition). The implementation is based on Keras 2.2.0 and can be run with Tensorflow 1.8.0 as backend. It was optimized for Python 3.5 / 3.6. It does not work with Python 2.7.

The architecture is described in our papers:

The implementation is highly configurable, so you can tune the different hyperparameters easily. You can use it for Single Task Learning as well as different options for Multi-Task Learning. You can also use it for Multilingual Learning by using multilingual word embeddings.

This code can be used to run the systems proposed in the following papers:

Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging - You can choose between a softmax and a CRF classifier
Ma and Hovy, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF - Character based word representations using CNNs is achieved by setting the parameter charEmbeddings to CNN
Lample et al, Neural Architectures for Named Entity Recognition - Character based word representations using LSTMs is achieved by setting the parameter charEmbeddings to LSTM
Sogard, Goldberg: Deep multi-task learning with low level tasks supervised at lower layers - Train multiple task and supervise them on different levels.

The implementation was optimized for speed: By grouping sentences with the same lengths together, this implementation is multiple factors faster than the systems by Ma et al. or Lample et al.

The training of the network is simple and the neural network can easily be trained on new datasets. For an example, see Train_POS.py.

Trained models can be stored and loaded for inference. Simply execute python RunModel.py models/modelname.h5 input.txt. Pretrained-models for some sequence tagging task using this LSTM-CRF implementations are provided in Pretrained Models.

This implementation can be used for Multi-Task Learning, i.e. learning simultanously several task with non-overlapping datasets. The file Train_MultiTask.py depicts an example, how the LSTM-CRF network can be used to learn POS-tagging and Chunking simultaneously. The number of tasks are not limited. Tasks can be supervised at the same level or at different output level.

Implementation with ELMo representations

The repository elmo-bilstm-cnn-crf contains an extension of this architecture to work with the ELMo representations from AllenNLP (from the Paper: Peters et al., 2018, Deep contextualized word representations). ELMo representations are computationally expensive to compute, but they usually improve the performance by about 1-5 percentage points F1-measure.

Citation

If you find the implementation useful, please cite the following paper: Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

@InProceedings{Reimers:2017:EMNLP,
  author    = {Reimers, Nils, and Gurevych, Iryna},
  title     = {{Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging}},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  month     = {09},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  pages     = {338--348},
  url       = {http://aclweb.org/anthology/D17-1035}
}

Contact person: Nils Reimers, [email protected]

https://www.ukp.tu-darmstadt.de/ https://www.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Setup

In order to run the code, I recommend Python 3.5 or higher. The code is based on Keras 2.2.0 and as backend I recommend Tensorflow 1.8.0. I cannot ensure that the code works with different versions for Keras / Tensorflow or with different backends for Keras. The code does not work with Python 2.7.

Setup with virtual environment (Python 3)

Setup a Python virtual environment (optional):

virtualenv --system-site-packages -p python3 env
source env/bin/activate

Install the requirements:

env/bin/pip3 install -r requirements.txt

If everything works well, you can run python3 Train_POS.py to train a deep POS-tagger for the POS-tagset from universal dependencies.

Setup with docker

See the docker-folder for more information how to run these scripts in a docker container.

Running a stored model

If enabled during the trainings process, models are stored to the 'models' folder. Those models can be loaded and be used to tag new data. An example is implemented in RunModel.py:

python RunModel.py models/modelname.h5 input.txt

This script will read the model models/modelname.h5 as well as the text file input.txt. The text will be splitted into sentences and tokenized using NLTK. The tagged output will be written in a CoNLL format to standard out.

Training

See Train_POS.py for a simple example how to train the model. More details can be found in docs/Training.md.

For training, you specify the datasets you want to train on:

datasets = {
    'unidep_pos':                            #Name of the dataset
        {'columns': {1:'tokens', 3:'POS'},   #CoNLL format for the input data. Column 1 contains tokens, column 3 contains POS information
         'label': 'POS',                     #Which column we like to predict
         'evaluate': True,                   #Should we evaluate on this task? Set true always for single task setups
         'commentSymbol': None}              #Lines in the input data starting with this string will be skipped. Can be used to skip comments
}

And you specify the pass to a pre-trained word embedding file:

embeddingsPath = 'komninos_english_embeddings.gz'

The util.preprocessing.py fle contains some methods to read your dataset (from the data folder) and to store a pickle file in the pkl folder.

You can then train the network in the following way:

params = {'classifier': ['CRF'], 'LSTM-Size': [100], 'dropout': (0.25, 0.25)}
model = BiLSTM(params)
model.setMappings(mappings, embeddings)
model.setDataset(datasets, data)
model.modelSavePath = "models/[ModelName]_[DevScore]_[TestScore]_[Epoch].h5"
model.fit(epochs=25)

Multi-Task-Learning

Multi-Task Learning can simply be done by specifying multiple datasets (Train_MultiTask.py)

datasets = {
    'unidep_pos':
        {'columns': {1:'tokens', 3:'POS'},
         'label': 'POS',
         'evaluate': True,
         'commentSymbol': None},
    'conll2000_chunking':
        {'columns': {0:'tokens', 2:'chunk_BIO'},
         'label': 'chunk_BIO',
         'evaluate': True,
         'commentSymbol': None},
}

Here, the networks trains jointly for the POS-task (unidep_pos) and for the chunking task (conll2000_chunking).

You can also train task on different levels. For details, see docs/Training_MultiTask.md.

LSTM-Hyperparameters

The parameters in the LSTM-CRF network can be configured by passing a parameter-dictionary to the BiLSTM-constructor: BiLSTM(params).

The following parameters exists:

dropout: Set to 0, for no dropout. For naive dropout, set it to a real value between 0 and 1. For variational dropout, set it to a two-dimensional tuple or list, with the first entry corresponding to output dropout and the second entry to the recurrent dropout. Default value: [0.5, 0.5]
classifier: Set to Softmax to use a softmax classifier or to CRF to use a CRF-classifier as the last layer of the network. Default value: Softmax
LSTM-Size: List of integers with the number of recurrent units for the stacked LSTM-network. The list [100,75,50] would create 3 stacked BiLSTM-layers with 100, 75, and 50 recurrent units. Default value: [100]
optimizer: Available optimizers: SGD, AdaGrad, AdaDelta, RMSProp, Adam, Nadam. Default value: nadam
earlyStopping: Early stoppig after certain number of epochs, if no improvement on the development set was achieved. Default value: 5
miniBatchSize: Size (Nr. of sentences) for mini-batch training. Default value: 32
addFeatureDimensions: Dimension for additional features, that are passed to the network. Default value: 10
charEmbeddings: Available options: [None, 'CNN', 'LSTM']. If set to None, no character-based representations will be used. With CNN, the approach by Ma & Hovy using a CNN will be used. With LSTM, an LSTM network will be used to derive the character-based representation (Lample et al.). Default value: None.
- charEmbeddingsSize: The dimension for characters, if the character-based representation is enabled. Default value: 30
- charFilterSize: If the CNN approach is used, this parameters defined the filter size, i.e. the output dimension of the convolution. Default: 30
- charFilterLength: If the CNN approach is used, this parameters defines the filter length. Default: 3
- charLSTMSize: If the LSTM approach is used, this parameters defines the size of the recurrent units. Default: 25
clipvalue: If non-zero, the gradient will be clipped to this value. Default: 0
clipnorm: If non-zero, the norm of the gradient will be normalized to this value. Default: 1
featureNames: Which features the network should use. You can specify additinal features that are used, for example, this could be POS-tags. See Train_Custom_Features.py for an example. Default: ['tokens', 'casing']
addFeatureDimensions: Size of the embedding matrix for all features except `tokens'. Default: 10

For multi-task learning scenarios, the following additional parameter exists:

customClassifier: A dictionary, that maps each dataset an individual classifier. For example, the POS tag could use a Softmax-classifier, while the Chunking dataset is trained with a CRF-classifier.
useTaskIdentifier: Including a task-ID as an input feature. Default: False

Documentation

docs/Training.md: How to train simple, single-task architectures
docs/Training_MultiTask.md: How to use the architecture for Multi-Task Learning
docs/Save_Load_Models.md: How to save & load models
docs/Pretrained_Models.md: Several pretraiend models that can be downloaded for common sequence tagging tasks.

Acknowledgments

This code uses the CRF-Implementation of Philipp Gross from the Keras Pull Request #4621. Thank you for contributing this to the community.

emnlp2017-bilstm-cnn-crf's People

Contributors

Stargazers

Watchers

Forkers

benjamesbabala zhoujialinmumu zhangxt wesamalnabki helloworld0909 wonyonyon zhujunnan colinsongf nicknign xindadade arminmsg lampts zilongzhong shashankg7 ttslr ufukhurriyetoglu snci liushifeng hjnhjn123 jinfengr gybta rabia-noureen ryfan-rs kaeflint ty01csbaidu wyinggui zjcerwin stfeiseu leezqcst lybgithub waiteryee1 yyljlyy fence xiongshufeng chiuyeelau alekstk doradumplier tommasoc80 me4ldr zhujiahui fendaq mouradgridach pingoogle shaoyandea chzeze monireh2 charlotteliu yutingliu wyxingyux saradhix coloratto hanksantford guidachengong kodakfu zyc130130 kamalkraj wayneouyang zhaoqiuye zldeng jaejunh laceyfan waiteryee127 zsgchinese schangpi peluz silasxue rongchen89 sxdkxgwan hfxunlp skybirdhe judepark96 shuangxieirene wwmmqq tusharbihani fakeryfx baylee001 whumashuai ersinyar bidishasamantakgp catcatrun pvcastro shaileshj2803 shubhampachori12110095 shaileshjannu alexliyang akemisetti roysh zhangyijia1979 phipleg daniel1586 frankslb jind11 schen149 ravikiran0606 lxtbell huaduxiaoya s4sarath ritali sszzsupersupersupersuper ccluqh

emnlp2017-bilstm-cnn-crf's Issues

Error when I run the Train_NER_German.py

Hi, @nreimers
when I I run the Train_NER_German.py, I got the following error. Please have a look at the issue. Thanks.
Using Theano backend. Using gpu device 1: GeForce GTX TITAN X (CNMeM is enabled with initial size: 40.0% of memory, cuDNN 6021) Using existent pickle file: pkl/GermEval_2014_tudarmstadt_german_50mincount.vocab.pkl Dataset: GermEval ['NER_IOBES', 'NER_IOB', 'NER_BIO', 'tokens', 'casing', 'characters', 'NER_class'] Label key: NER_BIO Train Sentences: 24000 Dev Sentences: 2200 Test Sentences: 5100 BiLSTM model initialized with parameters: {'clipnorm': 1, 'optimizer': 'nadam', 'dropout': [0.25, 0.25], 'miniBatchSize': 32, 'earlyStopping': 5, 'addFeatureDimensions': 10, 'charFilterLength': 3, 'charLSTMSize': 25, 'charFilterSize': 30, 'classifier': 'CRF', 'clipvalue': 0, 'charEmbeddingsSize': 30, 'charEmbeddings': None, 'LSTM-Size': [100, 75]} 24000 train sentences 2200 dev sentences 5100 test sentences --------- Epoch 1 ----------- Traceback (most recent call last): File "Train_NER_German.py", line 86, in <module> model.evaluate(50) File "/home/xidb/Desktop/tf/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 392, in evaluate self.trainModel() File "/home/xidb/Desktop/tf/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 93, in trainModel self.buildModel() File "/home/xidb/Desktop/tf/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 247, in buildModel tokens.add(Embedding(input_dim=embeddings.shape[0], output_dim=embeddings.shape[1], weights=[embeddings], trainable=False, name='token_emd')) File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/models.py", line 294, in add layer.create_input_layer(batch_input_shape, input_dtype) File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 398, in create_input_layer self(x) File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 543, in __call__ self.build(input_shapes[0]) File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/layers/embeddings.py", line 101, in build self.set_weights(self.initial_weights) File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 966, in set_weights str(weights)[:50] + '...') ValueError: You called set_weights(weights)` on layer "token_emd" with a weight list of length 1, but the layer was expecting 0 weights. Provided weights: [array([[ 0. , 0. , 0. , .....

eof Error

python3.4 train_pos_mai.py

Using TensorFlow backend.
Generate new embeddings files for a dataset
Read file: komninos_english_embeddings.gz
Traceback (most recent call last):
File "Train_POS.py", line 48, in
pickleFile = perpareDataset(embeddingsPath, datasets)
File "/home/Ankur_JRF/Backup_Ubuntu/LSTM/bilstm/util/preprocessing.py", line 42, in perpareDataset
embeddings, word2Idx = readEmbeddings(embeddingsPath, datasets, frequencyThresholdUnknownTokens, reducePretrainedEmbeddings)
File "/home/Ankur_JRF/Backup_Ubuntu/LSTM/bilstm/util/preprocessing.py", line 135, in readEmbeddings
for line in embeddingsIn:
File "/usr/lib64/python3.4/gzip.py", line 389, in read1
while self.extrasize <= 0 and self._read():
File "/usr/lib64/python3.4/gzip.py", line 449, in _read
self._read_eof()
File "/usr/lib64/python3.4/gzip.py", line 482, in _read_eof
crc32, isize = struct.unpack("<II", self._read_exact(8))
File "/usr/lib64/python3.4/gzip.py", line 286, in _read_exact
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

Transfer Learning

Hi,

Can I use this model for transfer learning? Training on one corpus and then using that pretrained model for a new corpus? How can I do that? or it's not possible with this ?

Docker fails due to lack of space . Segmentation Fault

Hi ,

This is the first time i am trying to run a project through Docker.
By default the container is running on the root partition . At run time, after the first epoch, the code fails due to segmentation error . Can you instruct me on how to use this docker on another partition and run the process there. I have a 1 TB partition at my disposal . I am housing the Project in this partition only and also building the container there . However on the run step i can see that the app , by default gets mounted on the /usr partition.

Thanks in Advance
Debayan

Where to modify the random seed value？

read paper : Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. I want to change the seed value, find in code: np.random.uniform() ,how to get different result of different seed value?

Use the config (and weights) of a trained model to initialize the Bilstm for training on another dataset

Errors occurs when run RunModel.py

I have installed dependencies in the requirements.txt, but error occurs when using the command "python RunModel.py models/EN_NER.h5 input.txt".

When using tensorflow as backend, the error is as follows:
Traceback (most recent call last):
File "RunModel.py", line 22, in
lstmModel.loadModel(modelPath)
File "/home/pinkee/Desktop/vulnerability/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 574, in loadModel
model = keras.models.load_model(modelPath, custom_objects=create_custom_objects())
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/models.py", line 176, in load_model
model.model._make_train_function()
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 760, in _make_train_function
self.total_loss)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/optimizers.py", line 562, in get_updates
grads = self.get_gradients(loss, params)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/optimizers.py", line 85, in get_gradients
grads = [clip_norm(g, self.clipnorm, norm) for g in grads]
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/optimizers.py", line 14, in clip_norm
g = K.switch(n >= c, g * c / n, g)
TypeError: unsupported operand type(s) for *: 'IndexedSlices' and 'int'

When using theano as backend, the error is as follows:
Traceback (most recent call last):
File "RunModel.py", line 34, in
lstmModel.loadModel(modelPath)
File "/home/pinkee/Desktop/vulnerability/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 574, in loadModel
model = keras.models.load_model(modelPath, custom_objects=create_custom_objects())
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/models.py", line 142, in load_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/models.py", line 193, in model_from_config
return layer_from_config(config, custom_objects=custom_objects)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/utils/layer_utils.py", line 42, in layer_from_config
return layer_class.from_config(config['config'])
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/models.py", line 1091, in from_config
model.add(layer)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/models.py", line 332, in add
output_tensor = layer(self.outputs[0])
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 572, in call
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 635, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 166, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/home/pinkee/Desktop/vulnerability/emnlp2017-bilstm-cnn-crf/neuralnets/keraslayers/ChainCRF.py", line 360, in call
y_pred = viterbi_decode(x, self.U, self.b_start, self.b_end, mask)
File "/home/pinkee/Desktop/vulnerability/emnlp2017-bilstm-cnn-crf/neuralnets/keraslayers/ChainCRF.py", line 163, in viterbi_decode
y = _backward(gamma, mask)
File "/home/pinkee/Desktop/vulnerability/emnlp2017-bilstm-cnn-crf/neuralnets/keraslayers/ChainCRF.py", line 220, in _backward
go_backwards=True)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 1136, in rnn
go_backwards=go_backwards)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan.py", line 773, in scan
condition, outputs, updates = scan_utils.get_updates_and_outputs(fn(*args))
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 1124, in _step
output, new_states = step_function(input, states)
File "/home/pinkee/Desktop/vulnerability/emnlp2017-bilstm-cnn-crf/neuralnets/keraslayers/ChainCRF.py", line 213, in _backward_step
y_t = batch_gather(gamma_t, y_tm1)
File "/home/pinkee/Desktop/vulnerability/emnlp2017-bilstm-cnn-crf/neuralnets/keraslayers/ChainCRF.py", line 39, in batch_gather
indices = tf.pack([tf.range(batch_size), indices], axis=1)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1086, in range
limit = ops.convert_to_tensor(limit, dtype=dtype, name="limit")
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 669, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 441, in make_tensor_proto
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/pinkee/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got Subtensor{int64}.0

How to solve the errors? Thank you!

EOFError while training NER model

Hi @reckart I am new to python so i need your help.... I tried to train the model using English dataset and Glove3 (glove.840B.300d) word embeddings by running Train_NER_German.py. Since the original file for glove.840B.300d was a rar file and the script takes a .gz or txt file as an input so i extracted a .txt file from the rar. I got the following error.

run Train_NER_German.py
Using Theano backend.
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GT 620M (CNMeM is enabled with initial size: 85.0% of memory, cuDNN not available)
Using existent pickle file: pkl/GermEval_glove.840B.300d.pkl
Traceback (most recent call last):

  File "E:\New-Code\New-emnlp2017-bilstm-cnn-crf-master\Runable code\emnlp2017-bilstm-cnn-crf-master\emnlp2017-bilstm-cnn-crf-master\Train_NER_German.py", line 68, in <module>
    embeddings, word2Idx, datasets = loadDatasetPickle(pickleFile)

  File "util\preprocessing.py", line 357, in loadDatasetPickle
    pklObjects = pkl.load(f)

EOFError

Can you please help me in order to solve the issue? I am using python 2.7 on Windows 10 OS.

embeddings = np.array(embeddings) MemoryError

Hi,

I get the memory error when I run NER model by using word2vec embeddings from this link (http://evexdb.org/pmresources/vec-space-models/). But I am able to run Elmo-Bilstm model with these embeddings without getting any error. Is there any way to fix this issue? My embeddings file is 13.2 GB whereas I have 16 GB of RAM.

Using more than one label columns

Hi @nreimers , thank you for sharing your code, really nice work!
Some datasets like GermEval have more than one label columns. However, judging from the training files (e.g. Train_NER_German.py) it seems that you can only use one column at a time for training a model specified by the variable "labelKey". So is there no way to train a model on more than one column?

Performance difference

Hi there,

Thank you for uploading your implementation of the NER Tagger! Can you please tell me, with which settings it is possible to replicate the performance of glample's NER tagger on German conll data while using the original embeddings? In 100 epoch, the highest value I get is around 71% (with Theano backend for BiLSTM-CNN-CRF v. 1.2.2) or 70% (with Tensorflow 1.8 backend for BiLSTM-CNN-CRF v. 2.2.0) while using the original configurations

params = {'classifier': 'CRF', 'LSTM-Size': [64], 'dropout': (0.5), 'charEmbeddings': 'LSTM', 'charEmbeddingsSize':'30', 'maxCharLength': 50, 'optimizer': 'sgd', 'earlyStopping': 30}

and further using the IOB Tagging. Do you know how to solve this issue?

Thanks!

request for scripts to run conll 2003 ner

hi, @nreimers

  can you share script of conll 2003 ner as well ?
  I tried to modify the Train_NER_German to fit the conll 2003 ner, but performance is bad.
   ---- epoch 14 -- 23.82 

 Do you have any idea about this? Thank you.

Scores from epoch with best dev-scores:

Hi,

I am just trying to understand what does that sentence mean, "Scores from epoch with best dev-scores:"?

It is reporting score of test and dev set for each epoch, then what does "best dev score" imply? Does it have to do with mini-batch?

Assertion Error when trying to run model

Hi ,

I am facing the following error when trying to run the model . i am using the standard encodings.
i could not understand why is computation failing for the encoding IOEBS. is ther a way to fix this without changing encoding . Also in case i want to run experiment with labels with the IO encoding what should i do

model.evaluate(50) File "/data/Debayan/Experiment_Emnlp/BILSTM_CNN_CRF_TEST/neuralnets/BiLSTM.py", line 402, in evaluate dev_score, test_score = self.computeScores(devMatrix, testMatrix) File "/data/Debayan/Experiment_Emnlp/BILSTM_CNN_CRF_TEST/neuralnets/BiLSTM.py", line 454, in computeScores return self.computeF1Scores(devMatrix, testMatrix) File "/data/Debayan/Experiment_Emnlp/BILSTM_CNN_CRF_TEST/neuralnets/BiLSTM.py", line 459, in computeF1Scores dev_pre, dev_rec, dev_f1 = self.computeF1(devMatrix, 'dev') File "/data/Debayan/Experiment_Emnlp/BILSTM_CNN_CRF_TEST/neuralnets/BiLSTM.py", line 526, in computeF1 pre, rec, f1 = BIOF1Validation.compute_f1(predLabels, correctLabels, self.idx2Label, 'O', encodingScheme) File "/data/Debayan/Experiment_Emnlp/BILSTM_CNN_CRF_TEST/util/BIOF1Validation.py", line 69, in compute_f1 checkBIOEncoding(label_pred, correctBIOErrors) File "/data/Debayan/Experiment_Emnlp/BILSTM_CNN_CRF_TEST/util/BIOF1Validation.py", line 207, in checkBIOEncoding assert(False) #Should never be reached AssertionError

Non GPU tensorflow compatible version

Thanks for making this implementation available! But i do not have access to a GPU. Can I still try your code? I have tensorflow 1.5.0 and Keras 2.0.2
I made a couple of changes in neural nets/BiLSTM.py
line 296 and 285
Set mask_zero=False and used Concatenate as Merge is not supported
Please let me know additional changes do I need to make to use the code in my usecase?
I am getting the following error when I try to use the code:

File "Train_NER_E2ESL.py", line 85, in <module>
    model.evaluate(50)
  File "/data/venv/BILSTM_CNN_CRF_TEST/neuralnets/BiLSTM.py", line 394, in evaluate
    self.trainModel()
  File "/data/venv/BILSTM_CNN_CRF_TEST/neuralnets/BiLSTM.py", line 93, in trainModel
    self.buildModel()
  File "/data/venv/BILSTM_CNN_CRF_TEST/neuralnets/BiLSTM.py", line 362, in buildModel
    model.summary()
  File "/data/venv/lib/python3.5/site-packages/keras/engine/network.py", line 1263, in summary
    'This model has never been called, this its weights '
ValueError: This model has never been called, this its weights have not yet been created, so no summary can be displayed. Build the model first (e.g. by calling it on some test data). 
it is not getting saved 
the training is happening but model isnt saved and hence testing isnt done

Thanks in advance!

will it work for python3.4?

TypeError: unsupported operand type(s) for *: 'IndexedSlices' and 'int'

python Train_NER_German.py
Using TensorFlow backend.
Generate new embeddings files for a dataset: pkl/GermEval_2014_tudarmstadt_german_50mincount.vocab.pkl
Read file: 2014_tudarmstadt_german_50mincount.vocab.gz
Added words: 81
Unknown-Tokens: 3.73%
Unknown-Tokens: 3.89%
Unknown-Tokens: 3.73%
DONE - Embeddings file saved: pkl/GermEval_2014_tudarmstadt_german_50mincount.vocab.pkl
Dataset: GermEval
['NER_IOBES', 'NER_IOB', 'NER_BIO', 'tokens', 'casing', 'characters', 'NER_class']
Label key: NER_BIO
Train Sentences: 24000
Dev Sentences: 2200
Test Sentences: 5100
BiLSTM model initialized with parameters: {'clipnorm': 1, 'optimizer': 'nadam', 'dropout': [0.25, 0.25], 'miniBatchSize': 32, 'earlyStopping': 5, 'addFeatureDimensions': 10, 'charFilterLength': 3, 'charLSTMSize': 25, 'charFilterSize': 30, 'classifier': 'CRF', 'clipvalue': 0, 'charEmbeddingsSize': 30, 'charEmbeddings': 'CNN', 'LSTM-Size': [100, 75]}
24000 train sentences
2200 dev sentences
5100 test sentences
--------- Epoch 1 -----------

Layer (type) Output Shape Param # Connected to

token_emd (Embedding) (None, None, 100) 64854500

casing_emd (Embedding) (None, None, 8) 64

char_emd (TimeDistributed) (None, None, 51, 30) 2850

char_cnn (TimeDistributed) (None, None, 51, 30) 2730

char_pooling (TimeDistributed) (None, None, 30) 0

varLSTM_1 (Bidirectional) (None, None, 200) 191200 merge_1[0][0]

varLSTM_2 (Bidirectional) (None, None, 150) 165600 varLSTM_1[0][0]

hidden_layer (TimeDistributed) (None, None, 25) 3775 varLSTM_2[0][0]

chaincrf_1 (ChainCRF) (None, None, 25) 675 hidden_layer[0][0]

Total params: 65,221,394
Trainable params: 366,830
Non-trainable params: 64,854,564

Traceback (most recent call last):
File "Train_NER_German.py", line 86, in
model.evaluate(50)
File "/Users/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 391, in evaluate
self.trainModel()
File "/Users/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 107, in trainModel
self.model.train_on_batch(nnInput, labels)
File "/Library/Python/2.7/site-packages/keras/models.py", line 766, in train_on_batch
class_weight=class_weight)
File "/Library/Python/2.7/site-packages/keras/engine/training.py", line 1319, in train_on_batch
self._make_train_function()
File "/Library/Python/2.7/site-packages/keras/engine/training.py", line 760, in _make_train_function
self.total_loss)
File "/Library/Python/2.7/site-packages/keras/optimizers.py", line 562, in get_updates
grads = self.get_gradients(loss, params)
File "/Library/Python/2.7/site-packages/keras/optimizers.py", line 85, in get_gradients
grads = [clip_norm(g, self.clipnorm, norm) for g in grads]
File "/Library/Python/2.7/site-packages/keras/optimizers.py", line 14, in clip_norm
g = K.switch(n >= c, g * c / n, g)
TypeError: unsupported operand type(s) for *: 'IndexedSlices' and 'int'

Using handcrafted Numerical and Boolean features along with Text

Can you please help me figure out as how to feed in Numerical and Boolean features along with the text. These features should be used as is.

So instead of putting in just the text like this:

Licence other
No. other
: other
DL-8388568791506 B-id_no
(P) other
N other

I can provide OCR localization based features along with the text like

Licence None No. None XXXXXXX 1 7 9.667 0 0.002 0.014 0.019 0.0 0.0 0.269 0 1 0 0 7 517 518 0.167 6 0 0 other
No. Licence : None XXX 1 7 9.667 9.667 0.002 0.014 0.019 0.019 0.2 0.269 4 0 0 0 3 517 518 0.333 6 0 0 other
: No. DL-4941170078518 Licence X 1 7 9.667 9.667 0.002 0.014 0.019 0.019 0.333 0.269 5 0 0 0 1 517 518 0.5 6 0 0 other
DL-8388568791506 : (P) No. XXX0000000000000 1 7 9.667 9.667 0.002 0.014 0.019 0.019 0.056 0.269 7 0 0 0 16 517 518 0.667 6 0 0 B-id_no
(P) DL-4941170078518 N : XXX 1 7 9.667 9.667 0.002 0.014 0.019 0.019 0.2 0.269 13 0 0 0 3 517 518 0.833 6 0 0 other
N (P) None DL-4941170078518 X 1 7 33 9.667 0.002 0.014 0.064 0.019 0.185 0.269 14 1 0 0 1 517 518 1.0 6 0 0 other

Any help or suggestion would be highly appreciable. Thanks

Error when I run the Train_NER_German.py

Hi, @nreimers
when I run the Train_NER_German.py, I got the following error. Please have a look at the issue. Thanks.
`Using Theano backend.

Using gpu device 1: GeForce GTX TITAN X (CNMeM is enabled with initial size: 40.0% of memory, cuDNN 6021)
Using existent pickle file: pkl/GermEval_2014_tudarmstadt_german_50mincount.vocab.pkl
Dataset: GermEval
['NER_IOBES', 'NER_IOB', 'NER_BIO', 'tokens', 'casing', 'characters', 'NER_class']
Label key: NER_BIO
Train Sentences: 24000
Dev Sentences: 2200
Test Sentences: 5100
BiLSTM model initialized with parameters: {'clipnorm': 1, 'optimizer': 'nadam', 'dropout': [0.25, 0.25], 'miniBatchSize': 32, 'earlyStopping': 5, 'addFeatureDimensions': 10, 'charFilterLength': 3, 'charLSTMSize': 25, 'charFilterSize': 30, 'classifier': 'CRF', 'clipvalue': 0, 'charEmbeddingsSize': 30, 'charEmbeddings': None, 'LSTM-Size': [100, 75]}
24000 train sentences
2200 dev sentences
5100 test sentences
--------- Epoch 1 -----------
Traceback (most recent call last):
File "Train_NER_German.py", line 86, in
model.evaluate(50)
File "/home/xidb/Desktop/tf/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 392, in evaluate
self.trainModel()
File "/home/xidb/Desktop/tf/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 93, in trainModel
self.buildModel()
File "/home/xidb/Desktop/tf/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 247, in buildModel
tokens.add(Embedding(input_dim=embeddings.shape[0], output_dim=embeddings.shape[1], weights=[embeddings], trainable=False, name='token_emd'))
File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/models.py", line 294, in add
layer.create_input_layer(batch_input_shape, input_dtype)
File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 398, in create_input_layer
self(x)
File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 543, in call
self.build(input_shapes[0])
File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/layers/embeddings.py", line 101, in build
self.set_weights(self.initial_weights)
File "/home/xidb/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 966, in set_weights
str(weights)[:50] + '...')
ValueError: You called set_weights(weights) on layer "token_emd" with a weight list of length 1, but the layer was expecting 0 weights. Provided weights: [array([[ 0. , 0. , 0. , .....

CoNLL.py Getting an Error in Train_Chunking.py

Using TensorFlow backend.
Generate new embeddings files for a dataset
Read file: komninos_english_embeddings.gz
Added words: 3
:: Transform agac_chunking dataset ::
Traceback (most recent call last):
File "Train_Chunking.py", line 47, in
pickleFile = perpareDataset(embeddingsPath, datasets)
File "/public/home/zcyu/ref/NLP/emnlp2017-bilstm-cnn-crf/util/preprocessing.py", line 57, in perpareDataset
pklObjects['data'][datasetName] = createPklFiles(paths, mappings, datasetColumns, commentSymbol, valTransformations, padOneTokenSentence)
File "/public/home/zcyu/ref/NLP/emnlp2017-bilstm-cnn-crf/util/preprocessing.py", line 318, in createPklFiles
trainSentences = readCoNLL(datasetFiles[0], cols, commentSymbol, valTransformation)
File "/public/home/zcyu/ref/NLP/emnlp2017-bilstm-cnn-crf/util/CoNLL.py", line 48, in readCoNLL
val = splits[colIdx]
IndexError: list index out of range

Question regarding F1 score

Hi,
Calculating F1 score in NER is popular but seems to be less intuitive as compared to the standard binary classification scenario. E.g. one must decide how to handle cases like partial n-gram matches etc.
Maybe I overlooked something, but in your papers that are referenced on the readme.md of this repository, I could not find an explanation of how you (or the community in general) handle those cases.
Could you kindly point me in the right direction, maybe provide a reference? Thank you very much!

Training with New Set of Fine grained labels - NER

Hello @nreimers , I really enjoyed going through your work. Could we improve your existing model / code to retrain for say 50 fine grained ner classes ?

using fasttext embeddings

Hi,
Thanks for offering this great implementation.
I was wondering if the scripts support using fasttext embeddings. If not, what should we do to make fasttext embeddings work with the current implementation?
thanks in advance!!

Train MTL model for two tasks using one dataset only.

I would like to train a multi-task learning model for two tasks, pos tagging, and language id. I have one dataset that is annotated for pos tagging and language id. My question is that is it possible to train a MTL model that learns these two tasks jointly using one dataset?

Thanks in advance for your response.

Transfer Learning using Save_and_Load.py

Hi @nreimers

I have tried to use Save_and_Load.py for pre-training on a bigger dataset and after reloading training on a smaller set. But it seems that Save_and_Load.py only works for the same dataset. Like you train for few epoch on a dataset, then save the model and then continue training the new loaded model. Because if i use new dataset after loading pre-existing model it throws a Key-Error. Is this how it is supposed to be working?

Use a pre-trained model for evaluating a new test set?

Hi,

How can I use one of the pre-trained models from the 'models' folder to evaluate on a new test set? Just to check the generalization capability of the model?
The test set is already in BIO tagging scheme and need not to be splitted and tokenized. Just want to do the predictions and evaluation in terms of F1-score.

Thanks

UnicodeDecodeError: 'gbk' codec can't decode

I ran "Train_POS" on win10, and tried to load komninos_english_embeddings.gz. The error appeared as following. But when I ran it on Ubuntu, the code is all right. I want to figure out why the decoding process goes wrong on Win10.
Traceback (most recent call last): File "C:/Users/46312/Desktop/python-practice/my-test/emnlp2017-bilstm-cnn-crf-master/Train_POS.py", line 48, in <module> pickleFile = perpareDataset(embeddingsPath, datasets) File "C:\Users\46312\Desktop\python-practice\my-test\emnlp2017-bilstm-cnn-crf-master\util\preprocessing.py", line 42, in perpareDataset embeddings, word2Idx = readEmbeddings(embeddingsPath, datasets, frequencyThresholdUnknownTokens, reducePretrainedEmbeddings) File "C:\Users\46312\Desktop\python-practice\my-test\emnlp2017-bilstm-cnn-crf-master\util\preprocessing.py", line 134, in readEmbeddings for line in embeddingsIn: UnicodeDecodeError: 'gbk' codec can't decode byte 0xa2 in position 2127: illegal multibyte sequence

Best results of Multitask

Hi Dear UKP lab team,

Thanks for your nice implementation of this model,
I am trying to do some multitask seq tagging project, and do you still remember the best results of your model and hyperparameters.
I want to use it as a baseline.

Sincerely,

Peng

Add IOB encoding issue

https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/blob/master/util/CoNLL.py#L81

if newVal[0] == 'B':
   if oldVal != 'I'+newVal[1:]:
       newVal = 'I'+newVal[1:]

Are these codes right?

Multi-task POS/NER parallel data

Hello @nreimers ,

Thank you for making available this useful implementation. I have collected a number of datasets in Portuguese and Spanish where, some datasets have POS tags, some have NER tags, and the rest have both POS and NER tags in parallel. All these datasets use the CoNLL format and I have standardized both the POS and NER tagsets. I am interested in training a single network to jointly perform POS and NER tagging for the datasets of each language, where the network architecture should use 2 CRF layers - one for each task. My main question here is how could I achieve training for datasets with both tasks. Is it possible to specify that a mini-batch can have two (or more) tasks without copying the same parallel data to multiple folders?

Thank you for your time in answering these questions.

Can't save the CRF model

I tried to run code below:

from chain_crf import ChainCRF
from keras.models import Sequential, Model
from keras.layers import Embedding, TimeDistributed, Input
import numpy as np
import json

vocab_size = 20
session_len = 2
n_classes = 4
batch_size, maxlen = 10, 4

layer = Sequential()
layer.add(Embedding(vocab_size, n_classes))
crf = ChainCRF()
layer.add(crf)
sequential_layer = TimeDistributed(layer)
input = Input(shape=(session_len, maxlen,))
output = sequential_layer(input)
sequential_model = Model(input, output)
sequential_model.compile(loss=crf.loss, optimizer='sgd')
print(sequential_model.summary())
sequential_model.save('a.txt') # don't work
json.dump(sequential_model.to_json(), open('a.txt', 'w')) # don't work

But the last two lines throws following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2506, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 106, in save_model
    'config': model.get_config()
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2322, in get_config
    layer_config = layer.get_config()
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/wrappers.py", line 80, in get_config
    'config': self.layer.get_config()}}
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 1206, in get_config
    'config': layer.get_config()})
  File "chain_crf.py", line 386, in get_config
    config = {'init': self.init.__name__,
AttributeError: 'VarianceScaling' object has no attribute '__name__'

Do you have any idea of how to fix this issue? Thanks!

Run the NER Error use Theano

Exception: ('The following error happened while compiling the node', GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), '\n', 'nvcc return status', 2, 'for cmd', '/usr/local/cuda/bin/nvcc -shared -O3 -Xlinker -rpath,/usr/local/cuda/lib64 -use_fast_math -arch=sm_61 -m64 -Xcompiler -fno-math-errno,-Wno-unused-label,-Wno-unused-variable,-Wno-write-strings,-DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,/home/fzuir/external/.theano/compiledir_Linux-4.2--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.13-64/cuda_ndarray -I/home/fzuir/external/.theano/compiledir_Linux-4.2--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.13-64/cuda_ndarray -I/usr/local/cuda/include -I/home/fzuir/czz/git/emnlp2017-bilstm-cnn-crf/.env/lib/python2.7/site-packages/theano/sandbox/cuda -I/home/fzuir/czz/git/emnlp2017-bilstm-cnn-crf/.env/lib/python2.7/site-packages/numpy/core/include -I/home/fzuir/anaconda2/include/python2.7 -I/home/fzuir/czz/git/emnlp2017-bilstm-cnn-crf/.env/lib/python2.7/site-packages/theano/gof -L/home/fzuir/external/.theano/compiledir_Linux-4.2--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.13-64/cuda_ndarray -L/home/fzuir/anaconda2/lib -o /home/fzuir/external/.theano/compiledir_Linux-4.2--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.13-64/tmpeIWWZR/ea4e203b6529466794536f8a1bfa77ae.so mod.cu -lcudart -lcublas -lcuda_ndarray -lcudnn -lpython2.7', "[GpuDnnConv{algo='small', inplace=True}(<CudaNdarrayType(float32, (False, False, False, True))>, <CudaNdarrayType(float32, 4D)>, <CudaNdarrayType(float32, 4D)>, <CDataType{cudnnConvolutionDescriptor_t}>, Constant{1.0}, Constant{0.0})]")

Multi-task settings

Hi @nreimers

For the multi-task framework, does it always have to be pos and chunking? or it could be any sequence labelling tasks?

Average of F1 of all epochs?

Hi,

How can I report the average F1 measure of all the epochs?

Tips for training MTL on large dataset

Are there tips on how to train MLT model on large datasets that have millions of trainable parameters. I am trying to train this on 1TB memory of machine but still facing memory limit.

Thanks.

How to use this on domain specific English language dataset

I have been trying to develop ner for domain specific English dataset.How to disable those german embeddings?

BiLSTM.py Getting an Error in Train_POS.py

[root@sks bilstm_mai]# python3.6 Train_POS.py
Using TensorFlow backend.
Using existent pickle file: pkl/unidep_pos_cc.mai.300.vec.pkl
--- unidep_pos ---
2298 train sentences
1 dev sentences
1 test sentences
Traceback (most recent call last):
File "Train_POS.py", line 70, in
model.fit(epochs=25)
File "/home/Ankur_JRF/Backup_Ubuntu/LSTM/bilstm_mai/neuralnets/BiLSTM.py", line 381, in fit
self.buildModel()
File "/home/Ankur_JRF/Backup_Ubuntu/LSTM/bilstm_mai/neuralnets/BiLSTM.py", line 105, in buildModel
tokens = Embedding(input_dim=self.embeddings.shape[0], output_dim=self.embeddings.shape[1], weights=[self.embeddings], trainable=False, name='word_embeddings')(tokens_input)
IndexError: tuple index out of range

Question about tag scheme

Hi UKP team,

I want to try different tag scheme in my work.
Could you tell me if there is any difference between IOBES and BIOES in your implementation?

Thanks,

Peng

Learning rate

Where can we change the learning rate of nadam optimizer? Which optimizer is chosen if we don't specify in the params constructor?

Train data description for NER training, how to train NER model

Looks like you are not using pos information for training the model for NER.
can you share the metadata for each column in the dataset?
There is also no description about training NER model. Can you please update the readme?
Thanks a ton.

Wrong transition in crf when doing a sequence labeling task

I use the ChainCRF.py as the CRF Layer in my model to do a sequence labeling task using the OBIE as the tags ,but I meet a problemthat there are some unexpected transition in the predict like E to I.
And it doesn't show up in the train data.
The keras version is 2.2.2.And tensorflow is 1.10.0
the code:

from keras.preprocessing import text, sequence
from keras.layers import *
from keras.models import *
from keras.callbacks import EarlyStopping,ModelCheckpoint
from ChainCRF import ChainCRF
from keras import backend as K

def Bilstm_CNN_Crf(maxlen,nb_words,class_label_count,embedding_weights=None,is_train=True):
    word_input=Input(shape=(maxlen,),dtype='int32',name='word_input')
    word_emb=Embedding(nb_words+1,output_dim=100,\
                    input_length=maxlen,\
                    embeddings_initializer = 'uniform',
                    name='word_emb')(word_input)
    # bilstm
    bilstm=Bidirectional(LSTM(64,return_sequences=True))(word_emb)
    bilstm_d=Dropout(0.1)(bilstm)

    # cnn
    half_window_size=2
    padding_layer=ZeroPadding1D(padding=half_window_size)(word_emb)
    conv=Conv1D(nb_filter=50,filter_length=2*half_window_size+1,\
            padding='valid')(padding_layer)
    conv_d=Dropout(0.1)(conv)
    dense_conv=TimeDistributed(Dense(50))(conv_d)

    # merge
    rnn_cnn_merge=concatenate([bilstm_d,dense_conv])
    dense=TimeDistributed(Dense(class_label_count))(rnn_cnn_merge)

    # crf
    crf = ChainCRF(name='CRF_Layer')
    crf_output=crf(dense)

    # build model
    model=Model(inputs=[word_input],outputs=[crf_output])

    model.compile(loss=crf.loss,optimizer='adam',metrics=['accuracy'])

    # model.summary()

    return model

model = Bilstm_CNN_Crf(maxlen, nb_words, 5)
earlystop = EarlyStopping(monitor='val_acc',patience=2,verbose=1)
checkpoint = ModelCheckpoint('best_model.hdf5',monitor='val_acc',verbose=1,save_best_only=True,period=1,save_weights_only=True)
model.fit(x_train_1, y, epochs=epochs, batch_size=64, verbose=1,validation_data=(x_train_1,y),callbacks=[earlystop,checkpoint])
model.load_weights('best_model.hdf5')
pred_prob = model.predict(x_train_1)
pred = np.argmax(pred_prob, axis=2)

Is there something wrong with the model?Or somet badcase that i didnt find in the data?
Any help is appreciate!Thx!

Additional Features

What are these additional features?

I have been trying to find them in the code.

Question about Backend

The doc says that if we want to use character embeddings, we have to use Theano as the backend. I'm not very familiar with Keras, but I don't see anywhere that we can set the backend. Could you elaborate on that a little bit?

Also, does that mean we cannot use the ELMo embedding, as it only supports PyTorch and TensorFlow at this point?

Thanks.

util.preprocessing.perpareDataset() reducePretrainedEmbeddings==True causes error

Within the util.preprocessing.perpareDataset() function the reducePretrainedEmbeddings argument when set to True causes the following error:

Traceback (most recent call last):
  File "test.py", line 25, in <module>
    reducePretrainedEmbeddings=True)
  File "/home/andrew/Documents/another/emnlp2017-bilstm-cnn-crf/util/preprocessing.py", line 42, in perpareDataset
    embeddings, word2Idx = readEmbeddings(embeddingsPath, datasets, frequencyThresholdUnknownTokens, reducePretrainedEmbeddings)
  File "/home/andrew/Documents/another/emnlp2017-bilstm-cnn-crf/util/preprocessing.py", line 118, in readEmbeddings
    dataColumnsIdx = {y: x for x, y in dataset['cols'].items()}
TypeError: string indices must be integers

To re-create this error I have provided the Python code below and you must ensure within the util.preprocessing.perpareDataset() function that you are not caching the pickle file caused by lines 37-39 or else it will return the cached pickle which might not have the reduced embeddings I believe and therefore no error.

Example code to re-create error:

import os
import logging
import sys
from neuralnets.BiLSTM import BiLSTM
from util.preprocessing import perpareDataset, loadDatasetPickle
abspath = os.path.abspath(__file__)
dname = os.path.dirname(abspath)
os.chdir(dname)
datasets = {
    'unidep_pos':                            #Name of the dataset
        {'columns': {1:'tokens', 3:'POS'},   #CoNLL format for the input data. Column 1 contains tokens, column 3 contains POS information
         'label': 'POS',                     #Which column we like to predict
         'evaluate': True,                   #Should we evaluate on this task? Set true always for single task setups
         'commentSymbol': None}              #Lines in the input data starting with this string will be skipped. Can be used to skip comments
}
embeddingsPath = 'komninos_english_embeddings.gz'
pickleFile = perpareDataset(embeddingsPath, datasets, 
                            reducePretrainedEmbeddings=True)

Weighted Loss Functions

Hi @nreimers ,
Thanks for this amazing code. Did you have an experience in assigning a different weight for the loss function for each task in a multi-task learning setup? I was thinking to add for example, loss_weights=0.2 on this line model.compile(loss=lossFct, optimizer=opt) in BiLSTM.py.
Is that the correct way to do it?

TypeError when training with Train_NER_German.py

In BilSTM.py after nnInput sent into the train_on_batch function, it caused TypeError :
unsupported operand type(s) for *: 'IndexedSlices' and 'int'
in line 107
when calling self.model.train_on_batch()

with the dataset GermEval, our Env :
Keras==1.2.2
nltk==3.2.2
numpy=1.14.1
scipy=1.0.0
theano==0.9.0
h5py==2.6.0
tensorflow==0.12.1

In Epho 1 it shows that :
--------- Epoch 1 -----------

Layer (type) Output Shape Param # Connected to

token_emd (Embedding) (None, None, 100) 64854500

casing_emd (Embedding) (None, None, 8) 64

char_emd (TimeDistributed) (None, None, 51, 30) 2850

char_cnn (TimeDistributed) (None, None, 51, 30) 2730

char_pooling (TimeDistributed) (None, None, 30) 0

varLSTM_1 (Bidirectional) (None, None, 200) 191200 merge_1[0][0]

varLSTM_2 (Bidirectional) (None, None, 150) 165600 varLSTM_1[0][0]

hidden_layer (TimeDistributed) (None, None, 26) 3926 varLSTM_2[0][0]

chaincrf_1 (ChainCRF) (None, None, 26) 728 hidden_layer[0][0]

Total params: 65,221,598
Trainable params: 367,034
Non-trainable params: 64,854,564

Could you please tell me how to correct it and how this error occured ?

about attention mechanism...

nice work and highly configurable.
Is there a plan to increase the implementation of the attention mechanism?

Running python3 Train_POS.py and get an error

Hi,
I am facing the following error when trying to run python3 Train_POS.py . How could I fix this?
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:1114: calling reduce_max (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead Traceback (most recent call last): File "Train_Chunking.py", line 69, in <module> model.fit(epochs=25) File "/root/codebase/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 381, in fit self.buildModel() File "/root/codebase/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 250, in buildModel model.compile(loss=lossFct, optimizer=opt) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 910, in compile sample_weight, mask) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 436, in weighted score_array = fn(y_true, y_pred) File "/root/codebase/emnlp2017-bilstm-cnn-crf/neuralnets/keraslayers/ChainCRF.py", line 355, in sparse_loss mask = self._fetch_mask() File "/root/codebase/emnlp2017-bilstm-cnn-crf/neuralnets/keraslayers/ChainCRF.py", line 300, in _fetch_mask if self._inbound_nodes: AttributeError: 'ChainCRF' object has no attribute '_inbound_nodes'

Why is the word embedding layer frozen (i.e. trainable=False)?

Thanks!

Different score using different compute f1 functions

Hi,

This code snippet (adapted from RunModel_CoNLL_Format.py) produces different outputs for the last two lines. However shouldn't we expect the same input?

#!/usr/bin/python
from __future__ import print_function
from util.preprocessing import readCoNLL, createMatrices, addCharInformation, addCasingInformation
from neuralnets.BiLSTM import BiLSTM
import sys
import logging

if len(sys.argv) < 3:
    print("Usage: python RunModel.py modelPath inputPathToConllFile")
    exit()

modelPath = sys.argv[1]
inputPath = sys.argv[2]
inputColumns = {0: "tokens", 1: 'NER_BIO'}

# :: Prepare the input ::
sentences = readCoNLL(inputPath, inputColumns)
addCharInformation(sentences)
addCasingInformation(sentences)

# :: Load the model ::
lstmModel = BiLSTM.loadModel(modelPath)
dataMatrix = createMatrices(sentences, lstmModel.mappings, True)

from util.BIOF1Validation import compute_f1_token_basis
print(compute_f1_token_basis(list(lstmModel.tagSentences(dataMatrix).values())[0], [s['NER_BIO'] for s in sentences], 'O'))
print(lstmModel.computeF1(list(lstmModel.models.keys())[0], dataMatrix))

ukplab / emnlp2017-bilstm-cnn-crf Goto Github PK

emnlp2017-bilstm-cnn-crf's Introduction

BiLSTM-CNN-CRF Implementation for Sequence Tagging

Implementation with ELMo representations

Citation

Setup

Setup with virtual environment (Python 3)

Setup with docker

Running a stored model

Training

Multi-Task-Learning

LSTM-Hyperparameters

Documentation

Acknowledgments

emnlp2017-bilstm-cnn-crf's People

Contributors

Stargazers

Watchers

Forkers

emnlp2017-bilstm-cnn-crf's Issues

Layer (type) Output Shape Param # Connected to

chaincrf_1 (ChainCRF) (None, None, 25) 675 hidden_layer[0][0]

Layer (type) Output Shape Param # Connected to

chaincrf_1 (ChainCRF) (None, None, 26) 728 hidden_layer[0][0]

Recommend Projects

Recommend Topics

Recommend Org