Looks like you are not using pos information for training the model for NER. can y

Can you please share the for converting IOB to BIO encoding?

Train data description for NER training, how to train NER model about emnlp2017-bilstm-cnn-crf HOT 4 OPEN

ukplab commented on August 22, 2024

Train data description for NER training, how to train NER model

from emnlp2017-bilstm-cnn-crf.

Comments (4)

nreimers commented on August 22, 2024 1

Sure, here is the code. It assumes that there is a train.txt, dev.txt and test.txt in the folder with IOB encoding. It creates then train.txt.bio ...

"""
Converts the IOB encoding from CoNLL 2003 to BIO encoding
"""


filenames = ['train.txt', 'dev.txt', 'test.txt']

for filename in filenames:
    fOut = open(filename+'.bio', 'w')
    fIn = open(filename, 'r')
    
    for line in fIn:
        if line.startswith('-DOCSTART-'):
            lastChunk = 'O'
            lastNER = 'O'
            continue
        
        if len(line.strip()) == 0:
            lastChunk = 'O'
            lastNER = 'O'
            fOut.write("\n")
            continue
            
        
        splits = line.strip().split()
        
        chunk = splits[2]
        ner = splits[3]
        
        if chunk[0] == 'I':
            if chunk[1:] != lastChunk[1:]:
                chunk = 'B'+chunk[1:]
                
        if ner[0] == 'I':
            if ner[1:] != lastNER[1:]:
                ner = 'B'+ner[1:]
                
        splits[2] = chunk 
        splits[3] = ner
        
        fOut.write("\t".join(splits))
        fOut.write("\n")
        
        lastChunk = chunk
        lastNER = ner

from emnlp2017-bilstm-cnn-crf.

nreimers commented on August 22, 2024

No, I didn't use the POS information. I would also recommend not to use it, because at inference you would need a POS tagger to detect the named entities in a sentence. Further, adding POS information does not improve the performance of the classifier.

Training on CoNLL 2003 NER is rather straight forward. Due to copyright issues I sadly cannot share the dataset, but here are the steps:

Convert the strange IOB encoding in the original CoNLL 2003 dataset to an BIO encoding. See issue #22 why this is needed. If you need a script to convert from IOB to BIO, let me know.
The files for CoNLL 2003 NER contains lines that start with -DOCSTART- => remove these lines, they are meta data from the dataset to indicate that a new document starts.
Training is similar to the Train_Chunking.py . Only change the dataset description:

datasets = {
    'conll2003_ner':                            
        {'columns': {0:'tokens', 3:'NER_BIO'},   
         'label': 'NER_BIO',                     
         'evaluate': True,                   
         'commentSymbol': None}              
}

from emnlp2017-bilstm-cnn-crf.

pramod2157 commented on August 22, 2024

Thanks.

from emnlp2017-bilstm-cnn-crf.

pramod2157 commented on August 22, 2024

Can you please share the script for converting IOB to BIO encoding?

from emnlp2017-bilstm-cnn-crf.

Recommend Projects

Train data description for NER training, how to train NER model about emnlp2017-bilstm-cnn-crf HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent