shangjingbo1226 / autoner Goto Github PK

View Code? Open in Web Editor NEW

482.0 16.0 92.0 3.63 MB

Learning Named Entity Tagger from Domain-Specific Dictionary

Home Page: https://shangjingbo1226.github.io/AutoNER/

License: Apache License 2.0

Makefile 0.01% Shell 0.07% C++ 0.76% Python 2.27% ChucK 96.88%

ner named-entity-recognition distant-supervision dictionary domain-specific data-driven sequence-labeling

autoner's Introduction

AutoNER

Check Our New NER Toolkit🚀🚀🚀

Inference:
- LightNER: inference w. models pre-trained / trained w. any following tools, efficiently.
Training:
- LD-Net: train NER models w. efficient contextualized representations.
- VanillaNER: train vanilla NER models w. pre-trained embedding.
Distant Training:
- AutoNER: train NER models w.o. line-by-line annotations and get competitive performance.

No line-by-line annotations, AutoNER trains named entity taggers with distant supervision.

Details about AutoNER can be accessed at: https://arxiv.org/abs/1809.03599

Model notes
Benchmarks
Training
Citation

Model Notes

Benchmarks

Method	Precision	Recall	F1
Supervised Benchmark	88.84	85.16	86.96
Dictionary Match	93.93	58.35	71.98
Fuzzy-LSTM-CRF	88.27	76.75	82.11
AutoNER	88.96	81.00	84.80

Training

Required Inputs

Tokenized Raw Texts
- Example: data/BC5CDR/raw_text.txt
  - One token per line.
  - An empty line means the end of a sentence.
Two Dictionaries
- Core Dictionary w/ Type Info
  - Example: data/BC5CDR/dict_core.txt
    - Two columns (i.e., Type, Tokenized Surface) per line.
    - Tab separated.
  - How to obtain?
    - From domain-specific dictionaries.
- Full Dictionary w/o Type Info
  - Example: data/BC5CDR/dict_full.txt
    - One tokenized high-quality phrases per line.
  - How to obtain?
    - From domain-specific dictionaries.
    - Applying the high-quality phrase mining tool on domain-specific corpus.
      - AutoPhrase
Pre-trained word embeddings
- Train your own or download from the web.
- The example run uses embedding/bio_embedding.txt, which can be downloaded from our group's server. For example, curl http://dmserv4.cs.illinois.edu/bio_embedding.txt -o embedding/bio_embedding.txt. Since the embedding encoding step consumes quite a lot of memory, we also provide the encoded file in the autoner_train.sh.
[Optional] Development & Test Sets.
- Example: data/BC5CDR/truth_dev.ck and data/BC5CDR/truth_test.ck
  - Three columns (i.e., token, Tie or Break label, entity type).
  - I is Break.
  - O is Tie.
  - Two special tokens <s> and <eof> mean the start and end of the sentence.

Dependencies

This project is based on python>=3.6. The dependent package for this project is listed as below:

numpy==1.13.1
tqdm
torch-scope>=0.5.0
pytorch==0.4.1

Command

To train an AutoNER model, please run

./autoner_train.sh

To apply the trained AutoNER model, please run

./autoner_test.sh

You can specify the parameters in the bash files. The variables names are self-explained.

Citation

Please cite the following two papers if you are using our tool. Thanks!

Jingbo Shang*, Liyuan Liu*, Xiaotao Gu, Xiang Ren, Teng Ren and Jiawei Han, "Learning Named Entity Tagger using Domain-Specific Dictionary", in Proc. of 2018 Conf. on Empirical Methods in Natural Language Processing (EMNLP'18), Brussels, Belgium, Oct. 2018. (* Equal Contribution)
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", accepted by IEEE Transactions on Knowledge and Data Engineering, Feb. 2018.

@inproceedings{shang2018learning,
  title = {Learning Named Entity Tagger using Domain-Specific Dictionary}, 
  author = {Shang, Jingbo and Liu, Liyuan and Ren, Xiang and Gu, Xiaotao and Ren, Teng and Han, Jiawei}, 
  booktitle = {EMNLP}, 
  year = 2018, 
}

@article{shang2018automated,
  title = {Automated phrase mining from massive text corpora},
  author = {Shang, Jingbo and Liu, Jialu and Jiang, Meng and Ren, Xiang and Voss, Clare R and Han, Jiawei},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {2018},
  publisher = {IEEE}
}

autoner's People

Contributors

Stargazers

Watchers

Forkers

mingyates nooralahzadeh fendaq zorrock we1l1n panyang zhj12388 ryfan-rs jainaayush05 yndu13 zwglory henryalps itsmengzaime yangfengkaust zgd716 jackysnake hoangcuong2011 liuwq168 rezanachmad cuizhigang1989 kajyuuen rtygbwwwerr ttklm20 yyht nlngh munaachyuta halolimat vickzhang yingjun2 cn-albertwu96 kevinzbw lyaction tzkwkblab qibaoyuan 90217 andysdc ryosukeozaki nadileaf trungtv pchankh tppppppppp wusongxu xtang27 coranholmes dhmodi frances255 a101269 mylv1222 berryhn quynh0510 rhtrht gdsttian xumeng123 zouning68 xuyongfu rayx-x ray1chen yaqingwang gkovaig karine0321 ssameerr gemire greitzmann lx-nlp gaohuan2015 s-tatsu sunbuhui laisun xiaomogui ilibx fireindark707 huyun-cs trinh-hoang-hiep dakw donhuvy monkey1712 liyp0095 yinghy18 rocke-dong native2019 245293206 parnaljoshi wudi001007 fishguysword rocke2020 tturn codefly13 daishu7 jensheinrich standardgalactic techthiyanes

autoner's Issues

Does it work for single label?

I was trying to run the model for single label, i.e. marking out all the Technical terms, created relevant datasets, and dictionaries. But during training the model doesn't show any progress bar, nor trains the model.
Finally, it shows the F1 to be -inf on dev.

Does this model work for single label NER?

What is the format of bio_embedding.txt

Is is hard for us to download bio_embedding.txt because its memory is nearly 10GB.

We hope to construct one by ourselves, so what is the baisc format of bio_embedding.txt.

Can you show us a simple screenshot? Thank you very much.

Can autoner be applied to Chinese NER?

Firstly, I appreciate your works very much. Now I wanna do some experiments on the task of chinese ner. SO my question is :
1.whether it is possible or not? yes or no,
2.if yes is for the first question, then what are the changes should be made?
Thank you in advance~~~

bio_embedding.txt link not working

Hey Prof. Shang,

The link to the bio_embedding.txt is broken. Do you have a new place to host that file? Or could you explain the format of the embedding file? Is it similar to Glove's embedding file?

Thanks!

loding dataset error

when i training the model (./autoner_train.sh), a error accured like :
Traceback (most recent call last):
File "train_partial_ner.py", line 66, in
dataset = pickle.load(open(args.eval_dataset, 'rb'))
FileNotFoundError: [Errno 2] No such file or directory: './models/BC5CDR/encoded_data/test.pk'

where can i find the test.pk?

and i find the file './models/BC5CDR/encoded_data/' is empty , so the train_0.pk is also missed

how much memory in your machine?

it looks stuck on the 16G memory while run 'preprocess_partial_ner/save_emb.py'

about the c++ codes

Hi, I am reading the C++ codes in the repo. I am not an expert in C++ so I only get a rough sense that the codes are for annotating the raw texts. But the output format of C++ codes (annotation.ck) is a bit different from truth_dev.ck and truth_test.ck as annotation.ck has the forth column.

<s> O None S
( I None S
Sch O None D
) O None D
was I None S
administered I None S
i.v I None S
. I None S
<eof> I None S

Is the forth column used for another model (fuzzy-lstm-crf) in your paper? Is it using IOBES tagging?

I am going to translate the C++ codes into python in the purpose of preparing data for autoNER, so i don't need the forth column, right? In addition, could you please give more insights on the C++ codes? What algorithm do you use (i.e. trie tree)? What exactly do the codes do?

To be more specific, in the example I post above, for this line Sch O None D , does it indicate that Sch is tied with the previous token (. Similar for the next line, ) is tied with the previous token Sck. In summary, ( Sch ) is an entity detected as None type?

Moreover, why aren't there any Unknown tagging in the annotation? According to your paper, if at least one of the tokens belongs to an kunknown-typed high-quality phrase, the tokens would be tagged as Unknown.

Could not get the bio_embedding.txt file

Hello, the link to bio_embedding.txt does not open, where can I get this file?

Questions about the unknown type high quality phrases.

Hi, the original paper says

In our AutoNER model, these “unknown” positions have undefined boundary and type losses, be- cause (1) they make the boundary labels unclear; and (2) they have no type labels. Therefore, they are skipped.

Is that mean high quality phrase should not have entity types that we are trying to identify? Otherwise, the model will predict it as Entity Type: None as shown in Figure 2 for 8GB RAM. And if AutoNER is applied to the example of Figure 1, can it and should it identify prostaglandin synthesis as a named entity?

Thanks.

multi tag?

	label mapping: check --> 1
	label mapping: disease --> 2
	label mapping: food --> 3
	label mapping: drug --> 4
	label mapping: symptom --> 5
	label mapping: body --> 6
	label mapping: S --> 7
	label mapping: operation --> 8
	label mapping: D --> 9
	label mapping: check,drug --> 10
	label mapping: drug,operation --> 11

S for what ?
D for what ?
the last two means multi-tag?

where to download the pre-trained embeddings 'embedding/bio_embedding.txt'

or whether it can be replaced by glove.6B.300.txt ?

Difference in the performance of autoNER when the high-quality phrases are not involved?

In table 4, it seems the high-quality phrases provides a large effect on the performance.
And when you compare it with the fuzzy CRF, it seems the Fuzzy CRF outperform the tie-break mechanism, if you do not involve the high-quality phrases.
Could you elaborate more on that?

_pickle.UnpicklingError: pickle data was truncated ---on bio_embedding.pk

mldl@ub1604:~/ub16_prj/AutoNER$ md5sum models/BC5CDR/bio_embedding.pk
dd549629b7ea9cf97d7df62cd16c0e9f models/BC5CDR/bio_embedding.pk

mldl@ub1604:/ub16_prj/AutoNER$ python3.6 preprocess_partial_ner/encode_folder.py --input_train models/BC5CDR/annotations.ck --input_testa data/BC5CDR/truth_dev.ck --input_testb data/BC5CDR/truth_test.ck --pre_word_emb models/BC5CDR/embedding.pk --output_folder models/BC5CDR/encoded_data
args.pre_word_emb is models/BC5CDR/embedding.pk
Traceback (most recent call last):
File "preprocess_partial_ner/encode_folder.py", line 263, in
w_emb = pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated
mldl@ub1604:/ub16_prj/AutoNER$

How does Fuzzy CRF work during decoding?

Hi, I'm just curious about the decoding process of Fuzzy CRF.
The paper said "For inference, we apply the Viterbi algorithm to maximize the score." But the original Viterbi algorithm only select one path with the highest score, In the multi-label classification problem, there may be multi valid path, so how to decide the number of the paths.. Is there any threshold or something other in the decoding process. Since unfamiliar with C++, I cant find the decoding process in the repository.
Looking forward your answer, thank you!

where can i find the file "test.pk"

Traceback (most recent call last):
File "train_partial_ner.py", line 66, in
dataset = pickle.load(open(args.eval_dataset, 'rb'))
FileNotFoundError: [Errno 2] No such file or directory: './models/BC5CDR/encoded_data/test.pk'

subprocess.CalledProcessError: Command '['tput', 'cols']'

This problem appeared from "from torch_scope import wrapper" and in line 47 "COLS = int(subprocess.check_output(shlex.split('tput cols')))",Is there anything I didn't install？

Dictionaries used in the paper and other bio medical dataset

Hi, thanks for sharing your implementation. I loved reading your paper and have some doubts.
In your paper you mentioned using

MeSH database and the CTD Chemical and Disease vocabularies. The dictionary contains 322,882 Chemical and Disease entity surfaces.

Whereas the dictionary provided contains much fewer terms.

Did you use the provided dictionary for the results presented in the paper?
Also, is the same provided dictionary used for NCBI-disease dataset?
Would it be possible to share the datasets and dictionaries used for other datasets?

Fuzzy CRF

Hi,
Is there any code related to Fuzzy CRF experiments that you reported in your paper?
Thanks.

dataset generate error

=== Generating Distant Supervision ===
loading KB...
generate: src/annotation.h:187: void Annotation::loadKBForMatching(const string&, const string&): Assertion `tokens.size() == 2' failed.

dict_core.txt not seperate by \t

question about the data processing?

Thanks for the work.
When I dig deep into the code, I found that in the result of the function 'read_noisy_corpus', all the start token is marked as 'O' but not for the end token; is there any intention specific for this point?

TypeError: init() got an unexpected keyword argument 'encoding'

(py3env) gpuws@gpuws32g:/ub16_prj/AutoNER$ CUDA_VISIBLE_DEVICES=0 python3 train_partial_ner.py --cp_root models/BC5CDR/checkpoint/ --checkpoint_name autoner
[2018-11-05 07:42:13,004] Checkpoint Folder Already Exists: models/BC5CDR/checkpoint/autoner
[2018-11-05 07:42:13,004] Input 'yes' to confirm deleting this folder; or 'no' to exit.
yes for delete or no for exit: yes
[2018-11-05 07:42:16,398] Saving system environemnt and python packages
[2018-11-05 07:42:16,709] It's recommended to set CUDA_DEVICE_ORDERto be PCI_BUS_ID by export CUDA_DEVICE_ORDER=PCI_BUS_ID;otherwise, it's not guaranteed that the gpu index frompytorch to be consistent the nvidia-smi results.
Traceback (most recent call last):
File "train_partial_ner.py", line 61, in
gpu_index = pw.auto_device() if 'auto' == args.gpu else int(args.gpu)
File "/home/gpuws/py3env/lib/python3.5/site-packages/torch_scope/wrapper.py", line 530, in auto_device
return basic_wrapper.auto_device(metrics = metrics, logger = self.logger, use_logger = use_logger, required_minimal = required_minimal, wait_time = wait_time)
File "/home/gpuws/py3env/lib/python3.5/site-packages/torch_scope/wrapper.py", line 250, in auto_device
memory_list = basic_wrapper.nvidia_memory_map(logger = logger)
File "/home/gpuws/py3env/lib/python3.5/site-packages/torch_scope/wrapper.py", line 175, in nvidia_memory_map
'--format=csv,noheader'], encoding='utf-8')
File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout
File "/usr/lib/python3.5/subprocess.py", line 693, in run
with Popen(*popenargs, **kwargs) as process:
TypeError: init() got an unexpected keyword argument 'encoding'
(py3env) gpuws@gpuws32g:/ub16_prj/AutoNER$

Question about the results for LaptopReview

Hi,
Could you please let me know, how you did experiments on LaptopReviwe dataset,
what were the Gold and Distant supervision datasets? Because the gold has 3,845 sentences, and in Figure 3.C it seems that you used another dataset (raw text) for Distantly supervised annotation!

And if it is possible to give me the last scores of AutoNER-Gold-DistantSupervision for Figure 3.a and 3.c where it uses all of the training set.

Thanks
Farhad

Could not replicate the Dictionary match results

Hello,

How to replicate the "DictionaryMatch" results on the BC5CDR/NCBI-disease/LaptopReview datasets?
I followed the "autoner_train.sh" script till "Generating Distant Supervision" to annotate the "test data" using the "dictionary_core". However, the precisions are a lot lower than the numbers mentioned in the paper.

Could you please help me figure out how to get the results of "DictionaryMatch" as mentioned in the paper?

Is there a separate script for "DictionaryMatch"?

Thanks :)

_pickle.UnpicklingError: pickle data was truncated

(autoner) A@7420:~/AutoNER-master$ ./autoner_train.sh
=== Compilation ===
mkdir -p bin
g++ -std=c++11 -Wall -O3 -msse2 -fopenmp -I.. -pthread -lm -Wno-unused-result -Wno-sign-compare -Wno-unused-variable -Wno-parentheses -Wno-format -o bin/generate src/generate.cpp
=== Generating Distant Supervision ===
loading KB...
core dict inserted
full dict marked
cleaning stopwords...
initialized! # of trie nodes = 23819
=== Encoding Dataset ===
Traceback (most recent call last):
File "preprocess_partial_ner/encode_folder.py", line 262, in
w_emb = pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated

For reproducing the experimental results with LaptopReview dataset

Hello,
Thank you for sharing your code!

I would like to follow your experiments. I can do the same experiments with BC5CDR (according to Readme) and NCBI-Disease (according to the descriptions in your paper).
But, I cannot follow the completely same way for LaptopReview. In order to do that, I need a "domain-specific dictionary" and an "unknown-typed high-quality phrase" list.

I'm not sure that the source of the domain-specific dictionary has not been changed since then.
Could you share the dictionary with us?

In terms of the high-quality phrase list, we will make the same list with your "AutoPhrase" and Amazon laptop reviews as you say in your paper.
But, some preprocessing is required to feed the review dataset into AutoPhrase and there are some options about it. For example, whether or not we include the titles of the reviews, what sentence splitter we will use, and so on.
I would appreciate it if you could share the high-quality phrase list.

Thanks.

LaptopReview dataset

Hi,
Is it possible to share the Raw Sent. dataset and dictionary that you used in experiments for LaptopReview dataset?

Thanks.

File not found test.pk

when running the train script, it does not create a test.pk file, is there a solution to this?

Question about train/dev/test data

Hi Jingbo,

Thanks for providing the tool, which is very useful.

I have a question about the data. How did you split the data into train/dev/test sets? I find some sentences in raw_text that are also in truth_dev.ck and truth_test.ck. Does this mean that some of the dev/test data are in the training set as well?

In addition, I also wonder whether you evaluated performance on the auto-annotated dataset or human-annotated dataset? You mention that dev/test files are optional, I think in this case, there are no human-annotated data for evaluation.

Thanks a lot.

mistake when construct new_w_map

In encode_folder.py, when we want to narrow down the the word mapping from pre-trained embedding file, like glove.100.pk, this function is to add the embedding of words that appear in the documents (train & test). Since the word in documents could contain capital letters but the words in pre-trained embedding file, like glove.100.pk only contain small letters, so the words with capital letters will be ignored. For example, in the training set, we have word "Japan" but no "japan", we cannot get the embedding of "japan" from glove.100.pk.
We should change word = line[0] to word = line[0].lower()

def filter_words(w_map, emb_array, ck_filenames):
    vocab = set()
    for filename in ck_filenames:
        for line in open(filename, 'r'):
            if not (line.isspace() or (len(line) > 10 and line[0:10] == '-DOCSTART-')):
                line = line.rstrip('\n').split()
                assert len(line) >= 3, 'wrong ck file format'
                word = line[0]
                vocab.add(word)
    new_w_map = {}
    new_emb_array = []
    # obtain the embedding of words appear in both wmap and vocab
    for (word, idx) in w_map.items():
        if word in vocab or word in ['<unk>', '<s>', '< >', '<\n>']:
            assert word not in new_w_map, "%s appears twice in ebd file"%word
            new_w_map[word] = len(new_emb_array)
            new_emb_array.append(emb_array[idx])
    print('filtered %d --> %d' % (len(emb_array), len(new_emb_array)))
    return new_w_map, new_emb_array

training on CPU

I am running your codes on MacOS and got the following errors. I search online and discover that in Macs there is no nvidia-smi command that comes with nvidia drivers. I try to comment this line pw.nvidia_memory_map(gpu_index = gpu_index) and now the codes can run. But I am not sure is it correct to do like that?

[2019-10-02 16:09:23,892] Epoch: 0                                              
[2019-10-02 16:09:23,892] It's recommended to set ``CUDA_DEVICE_ORDER``to be ``PCI_BUS_ID`` by ``export CUDA_DEVICE_ORDER=PCI_BUS_ID``;otherwise, it's not guaranteed that the gpu index frompytorch to be consistent the ``nvidia-smi`` results.
Traceback (most recent call last):
  File "train_partial_ner.py", line 124, in <module>
    pw.nvidia_memory_map(gpu_index = gpu_index)
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/site-packages/torch_scope/wrapper.py", line 483, in nvidia_memory_map
    return basic_wrapper.nvidia_memory_map(use_logger = use_logger, gpu_index = gpu_index)
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/site-packages/torch_scope/wrapper.py", line 190, in nvidia_memory_map
    '--format=csv,noheader'])
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/subprocess.py", line 395, in check_output
    **kwargs).stdout
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/subprocess.py", line 472, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/subprocess.py", line 775, in __init__
    restore_signals, start_new_session)
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/subprocess.py", line 1522, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-smi': 'nvidia-smi'
Done.

Chinese language experiments

Hi shang,
have your team experimented this method on Chinese data? Could you share your progress and any plans on releasing Multi-lingual version of AutoNER?
THX

AutoNER 效果疑问

Hi,
对 AutoNER 有几点疑问，希望能够得到解答
请问下 AutoNER 的paper 什么时候能够放出来？
请问下 AutoNER 的效果怎么样？有相应的数据集比较图嘛？
请问下 AutoNER 的思路是什么，我目前看得是引入了 chunk 标签？具体还没看明白，求解答

retrain model

hi :
i have train an model , how do i set the args to retrain the model with exist one

FileNotFoundError: [Errno 2] No such file or directory: './models/BC5CDR/annotations.ck'

could anyone help me with this error?

Tailored Dict

Hi,
I am wondering the dictionary in the following file is tailored dictionary or not? if not how I can find it?
/data/BC5CDR/dict_core.txt
Thanks

two input docs(dict_core&dict_full)

hi Jingbo:

your model need two input files (dict_core&dict_full), and i find that the dict_full contains dict_core. so why don't you combine the two files ,and annotate all words?

thands.

About optional DEV_SET and TEST_SET

I got this error after running the AutoNER without DEV_SET and TEST_SET:

Traceback (most recent call last):
  File "preprocess_partial_ner/encode_folder.py", line 281, in <module>
    testa_dataset = encode_dataset(args.input_testa, w_map, c_map, cl_map, tl_map)
  File "preprocess_partial_ner/encode_folder.py", line 221, in encode_dataset
    features, labels_chunk, labels_point, labels_typing = read_corpus(lines)
  File "preprocess_partial_ner/encode_folder.py", line 115, in read_corpus
    assert len(line) == 3, "the format of corpus"
AssertionError: the format of corpus

I noticed that the ./autoner_train.sh tries to use TRAINING_SET as DEV_SET and TEST_SET:

if [ DEV_SET == "" ]; then
    DEV_SET=$TRAINING_SET
fi
``
if [ TEST_SET == "" ]; then
    TEST_SET=$TRAINING_SET
fi

But somehow such replacement wouldn't happen during execution, so I manually replaced them.
It seems that TRAINING_SET (or annotation.ck) has one more column than the required format of DEV_SET/TEST_SET, does it mean such replacement is not valid and DEV_SET and TEST_SET are actually required?