Comments (25)
The following is the code I used (just added the line index to Xuezhe's code) for converting the original CoNLL2003 files to the format used by run_ner_crf.sh
which yielded F1 score 91.36% in the best case (consistent with the paper).
def transform(ifile, ofile):
"""
Transform original CoNLL2003 format to BIO format for the named entity column (last column) only
:param ifile: input file name (a original CoNLL2003 data file)
:param ofile: output file name
"""
with open(ifile, 'r') as reader, open(ofile, 'w') as writer:
prev = 'O'
line_idx = 1
for line in reader:
line = line.strip()
if len(line) == 0:
line_idx = 1
prev = 'O'
writer.write('\n')
continue
tokens = line.split()
# print tokens
label = tokens[-1]
if label != 'O' and label != prev:
if prev == 'O':
label = 'B-' + label[2:]
elif label[2:] != prev[2:]:
label = 'B-' + label[2:]
else:
label = label
tokens.insert(0, str(line_idx))
writer.write(" ".join(tokens[:-1]) + " " + label)
writer.write('\n')
prev = tokens[-1]
line_idx += 1
transform("eng.train", "eng.train.bio.conll")
transform("eng.testa", "eng.dev.bio.conll")
transform("eng.testb", "eng.test.bio.conll")
from neuronlp2.
How about the index of the "DOCSTART" ? 0?
from neuronlp2.
"DOCSTART" in my data sets is placed in a separated sentence, like
1 -DOCSTART- -X- O O
But as it provide no useful information, you can remove it from your data.
from neuronlp2.
I get it, thanks for your reply !
from neuronlp2.
Thanks for your explanation on the data format, but I am still confused about the word embedding format or standard you used, can you give me some details on this?
from neuronlp2.
The detailed information about word embedding is introduced in Ma's paper(Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF[J]. 2016.). It writes in the paper that Standford's Glove 100 dimensional embedding achieve best result.
from neuronlp2.
Thanks a lot, I had checked that out after several minutes of sending the comment. Sorry for bother, and thanks for your reply, again!
from neuronlp2.
I am still not clear about the format. Is index of each word per sentence or gets incremented for all words?
from neuronlp2.
Also i am getting error
$ bash ./examples/run_ner_crf.sh
loading embedding: glove from data/glove/glove.6B/glove.6B.100d.gz
2018-05-17 16:56:01,917 - NERCRF - INFO - Creating Alphabets
2018-05-17 16:56:01,922 - Create Alphabets - INFO - Word Alphabet Size (Singleton): 48 (0)
2018-05-17 16:56:01,922 - Create Alphabets - INFO - Character Alphabet Size: 35
2018-05-17 16:56:01,922 - Create Alphabets - INFO - POS Alphabet Size: 19
2018-05-17 16:56:01,922 - Create Alphabets - INFO - Chunk Alphabet Size: 9
2018-05-17 16:56:01,922 - Create Alphabets - INFO - NER Alphabet Size: 125
2018-05-17 16:56:01,923 - NERCRF - INFO - Word Alphabet Size: 48
2018-05-17 16:56:01,923 - NERCRF - INFO - Character Alphabet Size: 35
2018-05-17 16:56:01,923 - NERCRF - INFO - POS Alphabet Size: 19
2018-05-17 16:56:01,923 - NERCRF - INFO - Chunk Alphabet Size: 9
2018-05-17 16:56:01,923 - NERCRF - INFO - NER Alphabet Size: 125
2018-05-17 16:56:01,923 - NERCRF - INFO - Reading Data
Reading data from data/conll2003/english/eng.train.bioes.conll
Traceback (most recent call last):
File "examples/NERCRF.py", line 248, in
main()
File "examples/NERCRF.py", line 110, in main
data_train = conll03_data.read_data_to_variable(train_path, word_alphabet, char_alphabet, pos_alphabet, chunk_alphabet, ner_alphabet, use_gpu=use_gpu)
File "./neuronlp2/io/conll03_data.py", line 313, in read_data_to_variable
max_size=max_size, normalize_digits=normalize_digits)
File "./neuronlp2/io/conll03_data.py", line 157, in read_data
inst = reader.getNext(normalize_digits)
File "./neuronlp2/io/reader.py", line 165, in getNext
pos_ids.append(self.__pos_alphabet.get_index(pos))
File "./neuronlp2/io/alphabet.py", line 64, in get_index
raise KeyError("instance not found: %s" % instance)
KeyError: u'instance not found: NNP'
Is is possible for you to share your data files for NER task?
from neuronlp2.
@nrasiwas sorry for late response.
Here is a more clear example of the data format.
The following is the correct format for your examples:
1 EU NNP I-NP I-ORG
2 rejects VBZ I-VP O
3 German JJ I-NP I-MISC
4 call NN I-NP O
5 to TO I-VP O
6 boycott VB I-VP O
7 British JJ I-NP I-MISC
8 lamb NN I-NP O
9 . . O O
1 Peter NNP I-NP I-PER
2 Blackburn NNP I-NP I-PER
3 BRUSSELS NNP I-NP I-LOC
4 1996-08-22 CD I-NP O
The index is of each word per sentence.
And make sure to remove the alphabet folder in 'data/' when you use a different data set or different versions of a data set. Otherwise, the program will load the old vocabulary from disk.
from neuronlp2.
@XuezheMax, here's a script for adding the starting indexes. Do you think it's ok?
def add_starting_index(ifile, ofile):
with open(ifile, 'r') as reader, open(ofile, 'w') as writer:
prev = None
skip_next = False
for line in reader:
if skip_next:
skip_next = False
continue
line = line.strip()
docstart = line.startswith('-DOCSTART-')
if docstart:
skip_next = True
if len(line) == 0 or docstart:
prev = None
if not docstart:
writer.write('\n')
continue
tokens = line.split()
if prev is None:
prev = 1
else:
prev += 1
indexed_tokens = [str(prev)] + tokens
# print tokens
writer.write(" ".join(indexed_tokens))
writer.write('\n')
from neuronlp2.
Could you give a more detailed explanation on the data format for dependency parsing?
You have already provided an example, but I am still not clear what each column means.
(The second column is _ for everything: what does it mean? Shouldn't it be something related to lemma as it is the case in conllu?)
Plus, does the format you used include a line with annotation? For example, conllu format typically has two lines starting with #, to indicate sentence id and raw text.
Thanks in advance!
from neuronlp2.
The second column is reserved for lemma, the same as conllu. But our model does not use lemma information. So the second column can be filled with any thing.
Our format does not include the lines starting with #
from neuronlp2.
@XuezheMax Could you share the data used for POS tagging? Thanks in advance!
from neuronlp2.
Hi, the data is under PTB licence. If it is not an issue, it is good for me to send you the data. Can you give me your email?
from neuronlp2.
@XuezheMax I've sent you an email.Thank you very much!
from neuronlp2.
Hi, Thanks for your codes and data format. But I am still confused about the data format. So I don't sure that I used it correctly. Could you give information about whole schema of your CoNLL-X format and NER data format? Or could you share your data for me? Thanks in advance.
I guess schema of CoNLL format:
( ID, FORM, LEMMA, POSTAG1, POSTAG2, CPOSTAG, HEAD, DEPREL, PHEAD, PDEPREL )
and NER data format:
( ID, FORM, POSTAG, CHUNK, NERTAG )
Is it right schema?
from neuronlp2.
For CoNLL-x format, the schema is:
ID, FORM, LEMMA, CPOSTAG, POSTAG, MORPH-FEATURES, HEAD, DEPREL, PHEAD, PDEPREL
For NER data, the schema is:
ID, FORM, POSTAG, CHUNK, NERTAG
from neuronlp2.
Thank you for your reply!
from neuronlp2.
Hi
How do I get the penn tree bank datasets?
POS-penn/wsj
Thanks,
Sankar
from neuronlp2.
Hi, Thanks for your codes and data format. But I am still confused about the datasets.
I want to know how to get 'data/POS-penn/wsj/'
Thanks in advance!
from neuronlp2.
For the POS tagging dataset, you need to get it from Penn Treebank.
from neuronlp2.
Hi, I'm very interested in your nice work, and I'd love to build my new model upon yours.
However, I cannot find appropriate data to reproduce your work. Could you please share the conllx-style dependency parsing data you used so I can reproduce your results?
Looking forward to your reply @XuezheMax ~
from neuronlp2.
Hey @YuxianMeng
For the data for dependency parsing, please provide your email so that I can send you the data.
Since the data are from PTB corpus, please make sure that license is not an issue for you.
from neuronlp2.
@XuezheMax Hi, license is not an issue for me. Actually we have downloaded and processed PTB now. Just want to double-check our data :). My email is [email protected] and thanks again~
from neuronlp2.
Related Issues (20)
- RuntimeError: maximum recursion depth exceeded (python3+torch0.4) HOT 4
- TypeError: invalid file: None ===conllx_stacked_data.create_alphabets(alphabet_path, None,
- No such file or directory: 'data/sskip/sskip.ger.64.gz' && data/sskip/sskip.eng.100.gz && data/conll2003/english/eng.train.bioes.conll HOT 3
- whart is /run_analyze.sh used for, looks missing some files HOT 1
- Time to Compute
- Run a trained model HOT 1
- RuntimeError: maximum recursion depth exceeded HOT 2
- some questions about dataset and f1 score! HOT 1
- Can you give me the data and sskip.eng.100.gz? HOT 2
- Error Analysis
- how to use Elmo or Bert HOT 1
- embedding size HOT 1
- parsing input format HOT 3
- variability in the results HOT 2
- RuntimeError: CUDA error: an illegal instruction was encountered HOT 3
- Interpretation of code variables HOT 2
- Error trying to train a model HOT 2
- AssertionError for word_dim HOT 2
- Unable to find training data HOT 2
- TypeError in Biaffine HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neuronlp2.