Code Monkey home page Code Monkey logo

autoner's Introduction

AutoNER

Check Our New NER Toolkit🚀🚀🚀

  • Inference:
    • LightNER: inference w. models pre-trained / trained w. any following tools, efficiently.
  • Training:
    • LD-Net: train NER models w. efficient contextualized representations.
    • VanillaNER: train vanilla NER models w. pre-trained embedding.
  • Distant Training:
    • AutoNER: train NER models w.o. line-by-line annotations and get competitive performance.

License Documentation Status

No line-by-line annotations, AutoNER trains named entity taggers with distant supervision.

Details about AutoNER can be accessed at: https://arxiv.org/abs/1809.03599

Model Notes

AutoNER-Framework

Benchmarks

Method Precision Recall F1
Supervised Benchmark 88.84 85.16 86.96
Dictionary Match 93.93 58.35 71.98
Fuzzy-LSTM-CRF 88.27 76.75 82.11
AutoNER 88.96 81.00 84.80

Training

Required Inputs

  • Tokenized Raw Texts
    • Example: data/BC5CDR/raw_text.txt
      • One token per line.
      • An empty line means the end of a sentence.
  • Two Dictionaries
    • Core Dictionary w/ Type Info
      • Example: data/BC5CDR/dict_core.txt
        • Two columns (i.e., Type, Tokenized Surface) per line.
        • Tab separated.
      • How to obtain?
        • From domain-specific dictionaries.
    • Full Dictionary w/o Type Info
      • Example: data/BC5CDR/dict_full.txt
        • One tokenized high-quality phrases per line.
      • How to obtain?
        • From domain-specific dictionaries.
        • Applying the high-quality phrase mining tool on domain-specific corpus.
  • Pre-trained word embeddings
    • Train your own or download from the web.
    • The example run uses embedding/bio_embedding.txt, which can be downloaded from our group's server. For example, curl http://dmserv4.cs.illinois.edu/bio_embedding.txt -o embedding/bio_embedding.txt. Since the embedding encoding step consumes quite a lot of memory, we also provide the encoded file in the autoner_train.sh.
  • [Optional] Development & Test Sets.
    • Example: data/BC5CDR/truth_dev.ck and data/BC5CDR/truth_test.ck
      • Three columns (i.e., token, Tie or Break label, entity type).
      • I is Break.
      • O is Tie.
      • Two special tokens <s> and <eof> mean the start and end of the sentence.

Dependencies

This project is based on python>=3.6. The dependent package for this project is listed as below:

numpy==1.13.1
tqdm
torch-scope>=0.5.0
pytorch==0.4.1

Command

To train an AutoNER model, please run

./autoner_train.sh

To apply the trained AutoNER model, please run

./autoner_test.sh

You can specify the parameters in the bash files. The variables names are self-explained.

Citation

Please cite the following two papers if you are using our tool. Thanks!

@inproceedings{shang2018learning,
  title = {Learning Named Entity Tagger using Domain-Specific Dictionary}, 
  author = {Shang, Jingbo and Liu, Liyuan and Ren, Xiang and Gu, Xiaotao and Ren, Teng and Han, Jiawei}, 
  booktitle = {EMNLP}, 
  year = 2018, 
}

@article{shang2018automated,
  title = {Automated phrase mining from massive text corpora},
  author = {Shang, Jingbo and Liu, Jialu and Jiang, Meng and Ren, Xiang and Voss, Clare R and Han, Jiawei},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {2018},
  publisher = {IEEE}
}

autoner's People

Contributors

liyuanlucasliu avatar shangjingbo1226 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autoner's Issues

Does it work for single label?

I was trying to run the model for single label, i.e. marking out all the Technical terms, created relevant datasets, and dictionaries. But during training the model doesn't show any progress bar, nor trains the model.
Finally, it shows the F1 to be -inf on dev.

Does this model work for single label NER?

What is the format of bio_embedding.txt

Is is hard for us to download bio_embedding.txt because its memory is nearly 10GB.

We hope to construct one by ourselves, so what is the baisc format of bio_embedding.txt.

Can you show us a simple screenshot? Thank you very much.

Can autoner be applied to Chinese NER?

Firstly, I appreciate your works very much. Now I wanna do some experiments on the task of chinese ner. SO my question is :
1.whether it is possible or not? yes or no,
2.if yes is for the first question, then what are the changes should be made?
Thank you in advance~~~

bio_embedding.txt link not working

Hey Prof. Shang,

The link to the bio_embedding.txt is broken. Do you have a new place to host that file? Or could you explain the format of the embedding file? Is it similar to Glove's embedding file?

Thanks!

loding dataset error

when i training the model (./autoner_train.sh), a error accured like :
Traceback (most recent call last):
File "train_partial_ner.py", line 66, in
dataset = pickle.load(open(args.eval_dataset, 'rb'))
FileNotFoundError: [Errno 2] No such file or directory: './models/BC5CDR/encoded_data/test.pk'

where can i find the test.pk?

and i find the file './models/BC5CDR/encoded_data/' is empty , so the train_0.pk is also missed

about the c++ codes

Hi, I am reading the C++ codes in the repo. I am not an expert in C++ so I only get a rough sense that the codes are for annotating the raw texts. But the output format of C++ codes (annotation.ck) is a bit different from truth_dev.ck and truth_test.ck as annotation.ck has the forth column.

<s> O None S
( I None S
Sch O None D
) O None D
was I None S
administered I None S
i.v I None S
. I None S
<eof> I None S

Is the forth column used for another model (fuzzy-lstm-crf) in your paper? Is it using IOBES tagging?

I am going to translate the C++ codes into python in the purpose of preparing data for autoNER, so i don't need the forth column, right? In addition, could you please give more insights on the C++ codes? What algorithm do you use (i.e. trie tree)? What exactly do the codes do?

To be more specific, in the example I post above, for this line Sch O None D , does it indicate that Sch is tied with the previous token (. Similar for the next line, ) is tied with the previous token Sck. In summary, ( Sch ) is an entity detected as None type?

Moreover, why aren't there any Unknown tagging in the annotation? According to your paper, if at least one of the tokens belongs to an kunknown-typed high-quality phrase, the tokens would be tagged as Unknown.

Questions about the unknown type high quality phrases.

Hi, the original paper says

In our AutoNER model, these “unknown” positions have undefined boundary and type losses, be- cause (1) they make the boundary labels unclear; and (2) they have no type labels. Therefore, they are skipped.

Is that mean high quality phrase should not have entity types that we are trying to identify? Otherwise, the model will predict it as Entity Type: None as shown in Figure 2 for 8GB RAM. And if AutoNER is applied to the example of Figure 1, can it and should it identify prostaglandin synthesis as a named entity?

Thanks.

multi tag?

	label mapping: check --> 1
	label mapping: disease --> 2
	label mapping: food --> 3
	label mapping: drug --> 4
	label mapping: symptom --> 5
	label mapping: body --> 6
	label mapping: S --> 7
	label mapping: operation --> 8
	label mapping: D --> 9
	label mapping: check,drug --> 10
	label mapping: drug,operation --> 11

S for what ?
D for what ?
the last two means multi-tag?

_pickle.UnpicklingError: pickle data was truncated ---on bio_embedding.pk

mldl@ub1604:~/ub16_prj/AutoNER$ md5sum models/BC5CDR/bio_embedding.pk
dd549629b7ea9cf97d7df62cd16c0e9f models/BC5CDR/bio_embedding.pk

mldl@ub1604:/ub16_prj/AutoNER$ python3.6 preprocess_partial_ner/encode_folder.py --input_train models/BC5CDR/annotations.ck --input_testa data/BC5CDR/truth_dev.ck --input_testb data/BC5CDR/truth_test.ck --pre_word_emb models/BC5CDR/embedding.pk --output_folder models/BC5CDR/encoded_data
args.pre_word_emb is models/BC5CDR/embedding.pk
Traceback (most recent call last):
File "preprocess_partial_ner/encode_folder.py", line 263, in
w_emb = pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated
mldl@ub1604:
/ub16_prj/AutoNER$

How does Fuzzy CRF work during decoding?

Hi, I'm just curious about the decoding process of Fuzzy CRF.
The paper said "For inference, we apply the Viterbi algorithm to maximize the score." But the original Viterbi algorithm only select one path with the highest score, In the multi-label classification problem, there may be multi valid path, so how to decide the number of the paths.. Is there any threshold or something other in the decoding process. Since unfamiliar with C++, I cant find the decoding process in the repository.
Looking forward your answer, thank you!

where can i find the file "test.pk"

Traceback (most recent call last):
File "train_partial_ner.py", line 66, in
dataset = pickle.load(open(args.eval_dataset, 'rb'))
FileNotFoundError: [Errno 2] No such file or directory: './models/BC5CDR/encoded_data/test.pk'

Dictionaries used in the paper and other bio medical dataset

Hi, thanks for sharing your implementation. I loved reading your paper and have some doubts.
In your paper you mentioned using

MeSH database and the CTD Chemical and Disease vocabularies. The dictionary contains 322,882 Chemical and Disease entity surfaces.

Whereas the dictionary provided contains much fewer terms.

  1. Did you use the provided dictionary for the results presented in the paper?
  2. Also, is the same provided dictionary used for NCBI-disease dataset?
  3. Would it be possible to share the datasets and dictionaries used for other datasets?

Fuzzy CRF

Hi,
Is there any code related to Fuzzy CRF experiments that you reported in your paper?
Thanks.

dataset generate error

=== Generating Distant Supervision ===
loading KB...
generate: src/annotation.h:187: void Annotation::loadKBForMatching(const string&, const string&): Assertion `tokens.size() == 2' failed.

dict_core.txt not seperate by \t

question about the data processing?

Thanks for the work.
When I dig deep into the code, I found that in the result of the function 'read_noisy_corpus', all the start token is marked as 'O' but not for the end token; is there any intention specific for this point?

TypeError: __init__() got an unexpected keyword argument 'encoding'

(py3env) gpuws@gpuws32g:/ub16_prj/AutoNER$ CUDA_VISIBLE_DEVICES=0 python3 train_partial_ner.py --cp_root models/BC5CDR/checkpoint/ --checkpoint_name autoner
[2018-11-05 07:42:13,004] Checkpoint Folder Already Exists: models/BC5CDR/checkpoint/autoner
[2018-11-05 07:42:13,004] Input 'yes' to confirm deleting this folder; or 'no' to exit.
yes for delete or no for exit: yes
[2018-11-05 07:42:16,398] Saving system environemnt and python packages
[2018-11-05 07:42:16,709] It's recommended to set CUDA_DEVICE_ORDERto be PCI_BUS_ID by export CUDA_DEVICE_ORDER=PCI_BUS_ID;otherwise, it's not guaranteed that the gpu index frompytorch to be consistent the nvidia-smi results.
Traceback (most recent call last):
File "train_partial_ner.py", line 61, in
gpu_index = pw.auto_device() if 'auto' == args.gpu else int(args.gpu)
File "/home/gpuws/py3env/lib/python3.5/site-packages/torch_scope/wrapper.py", line 530, in auto_device
return basic_wrapper.auto_device(metrics = metrics, logger = self.logger, use_logger = use_logger, required_minimal = required_minimal, wait_time = wait_time)
File "/home/gpuws/py3env/lib/python3.5/site-packages/torch_scope/wrapper.py", line 250, in auto_device
memory_list = basic_wrapper.nvidia_memory_map(logger = logger)
File "/home/gpuws/py3env/lib/python3.5/site-packages/torch_scope/wrapper.py", line 175, in nvidia_memory_map
'--format=csv,noheader'], encoding='utf-8')
File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout
File "/usr/lib/python3.5/subprocess.py", line 693, in run
with Popen(*popenargs, **kwargs) as process:
TypeError: init() got an unexpected keyword argument 'encoding'
(py3env) gpuws@gpuws32g:
/ub16_prj/AutoNER$

Question about the results for LaptopReview

Hi,
Could you please let me know, how you did experiments on LaptopReviwe dataset,
what were the Gold and Distant supervision datasets? Because the gold has 3,845 sentences, and in Figure 3.C it seems that you used another dataset (raw text) for Distantly supervised annotation!

And if it is possible to give me the last scores of AutoNER-Gold-DistantSupervision for Figure 3.a and 3.c where it uses all of the training set.

Thanks
Farhad

Could not replicate the Dictionary match results

Hello,

How to replicate the "DictionaryMatch" results on the BC5CDR/NCBI-disease/LaptopReview datasets?
I followed the "autoner_train.sh" script till "Generating Distant Supervision" to annotate the "test data" using the "dictionary_core". However, the precisions are a lot lower than the numbers mentioned in the paper.

Could you please help me figure out how to get the results of "DictionaryMatch" as mentioned in the paper?

Is there a separate script for "DictionaryMatch"?

Thanks :)

_pickle.UnpicklingError: pickle data was truncated

(autoner) A@7420:~/AutoNER-master$ ./autoner_train.sh
=== Compilation ===
mkdir -p bin
g++ -std=c++11 -Wall -O3 -msse2 -fopenmp -I.. -pthread -lm -Wno-unused-result -Wno-sign-compare -Wno-unused-variable -Wno-parentheses -Wno-format -o bin/generate src/generate.cpp
=== Generating Distant Supervision ===
loading KB...
core dict inserted
full dict marked
cleaning stopwords...
initialized! # of trie nodes = 23819
=== Encoding Dataset ===
Traceback (most recent call last):
File "preprocess_partial_ner/encode_folder.py", line 262, in
w_emb = pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated

For reproducing the experimental results with LaptopReview dataset

Hello,
Thank you for sharing your code!

I would like to follow your experiments. I can do the same experiments with BC5CDR (according to Readme) and NCBI-Disease (according to the descriptions in your paper).
But, I cannot follow the completely same way for LaptopReview. In order to do that, I need a "domain-specific dictionary" and an "unknown-typed high-quality phrase" list.

I'm not sure that the source of the domain-specific dictionary has not been changed since then.
Could you share the dictionary with us?

In terms of the high-quality phrase list, we will make the same list with your "AutoPhrase" and Amazon laptop reviews as you say in your paper.
But, some preprocessing is required to feed the review dataset into AutoPhrase and there are some options about it. For example, whether or not we include the titles of the reviews, what sentence splitter we will use, and so on.
I would appreciate it if you could share the high-quality phrase list.

Thanks.

LaptopReview dataset

Hi,
Is it possible to share the Raw Sent. dataset and dictionary that you used in experiments for LaptopReview dataset?

Thanks.

File not found test.pk

when running the train script, it does not create a test.pk file, is there a solution to this?

Question about train/dev/test data

Hi Jingbo,

Thanks for providing the tool, which is very useful.

I have a question about the data. How did you split the data into train/dev/test sets? I find some sentences in raw_text that are also in truth_dev.ck and truth_test.ck. Does this mean that some of the dev/test data are in the training set as well?

In addition, I also wonder whether you evaluated performance on the auto-annotated dataset or human-annotated dataset? You mention that dev/test files are optional, I think in this case, there are no human-annotated data for evaluation.

Thanks a lot.

mistake when construct new_w_map

In encode_folder.py, when we want to narrow down the the word mapping from pre-trained embedding file, like glove.100.pk, this function is to add the embedding of words that appear in the documents (train & test). Since the word in documents could contain capital letters but the words in pre-trained embedding file, like glove.100.pk only contain small letters, so the words with capital letters will be ignored. For example, in the training set, we have word "Japan" but no "japan", we cannot get the embedding of "japan" from glove.100.pk.
We should change word = line[0] to word = line[0].lower()

def filter_words(w_map, emb_array, ck_filenames):
    vocab = set()
    for filename in ck_filenames:
        for line in open(filename, 'r'):
            if not (line.isspace() or (len(line) > 10 and line[0:10] == '-DOCSTART-')):
                line = line.rstrip('\n').split()
                assert len(line) >= 3, 'wrong ck file format'
                word = line[0]
                vocab.add(word)
    new_w_map = {}
    new_emb_array = []
    # obtain the embedding of words appear in both wmap and vocab
    for (word, idx) in w_map.items():
        if word in vocab or word in ['<unk>', '<s>', '< >', '<\n>']:
            assert word not in new_w_map, "%s appears twice in ebd file"%word
            new_w_map[word] = len(new_emb_array)
            new_emb_array.append(emb_array[idx])
    print('filtered %d --> %d' % (len(emb_array), len(new_emb_array)))
    return new_w_map, new_emb_array

training on CPU

I am running your codes on MacOS and got the following errors. I search online and discover that in Macs there is no nvidia-smi command that comes with nvidia drivers. I try to comment this line pw.nvidia_memory_map(gpu_index = gpu_index) and now the codes can run. But I am not sure is it correct to do like that?

[2019-10-02 16:09:23,892] Epoch: 0                                              
[2019-10-02 16:09:23,892] It's recommended to set ``CUDA_DEVICE_ORDER``to be ``PCI_BUS_ID`` by ``export CUDA_DEVICE_ORDER=PCI_BUS_ID``;otherwise, it's not guaranteed that the gpu index frompytorch to be consistent the ``nvidia-smi`` results.
Traceback (most recent call last):
  File "train_partial_ner.py", line 124, in <module>
    pw.nvidia_memory_map(gpu_index = gpu_index)
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/site-packages/torch_scope/wrapper.py", line 483, in nvidia_memory_map
    return basic_wrapper.nvidia_memory_map(use_logger = use_logger, gpu_index = gpu_index)
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/site-packages/torch_scope/wrapper.py", line 190, in nvidia_memory_map
    '--format=csv,noheader'])
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/subprocess.py", line 395, in check_output
    **kwargs).stdout
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/subprocess.py", line 472, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/subprocess.py", line 775, in __init__
    restore_signals, start_new_session)
  File "/Users/weiling.chen/anaconda2/envs/py3/lib/python3.7/subprocess.py", line 1522, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'nvidia-smi': 'nvidia-smi'
Done.

Chinese language experiments

Hi shang,
have your team experimented this method on Chinese data? Could you share your progress and any plans on releasing Multi-lingual version of AutoNER?
THX

AutoNER 效果疑问

Hi,
对 AutoNER 有几点疑问,希望能够得到解答
请问下 AutoNER 的paper 什么时候能够放出来?
请问下 AutoNER 的效果怎么样?有相应的数据集比较图嘛?
请问下 AutoNER 的思路是什么,我目前看得是引入了 chunk 标签?具体还没看明白,求解答

retrain model

hi :
i have train an model , how do i set the args to retrain the model with exist one

Tailored Dict

Hi,
I am wondering the dictionary in the following file is tailored dictionary or not? if not how I can find it?
/data/BC5CDR/dict_core.txt
Thanks

two input docs(dict_core&dict_full)

hi Jingbo:

your model need two input files (dict_core&dict_full), and i find that the dict_full contains dict_core. so why don't you combine the two files ,and annotate all words?

thands.

About optional DEV_SET and TEST_SET

I got this error after running the AutoNER without DEV_SET and TEST_SET:

Traceback (most recent call last):
  File "preprocess_partial_ner/encode_folder.py", line 281, in <module>
    testa_dataset = encode_dataset(args.input_testa, w_map, c_map, cl_map, tl_map)
  File "preprocess_partial_ner/encode_folder.py", line 221, in encode_dataset
    features, labels_chunk, labels_point, labels_typing = read_corpus(lines)
  File "preprocess_partial_ner/encode_folder.py", line 115, in read_corpus
    assert len(line) == 3, "the format of corpus"
AssertionError: the format of corpus

I noticed that the ./autoner_train.sh tries to use TRAINING_SET as DEV_SET and TEST_SET:

if [ DEV_SET == "" ]; then
    DEV_SET=$TRAINING_SET
fi
``
if [ TEST_SET == "" ]; then
    TEST_SET=$TRAINING_SET
fi

But somehow such replacement wouldn't happen during execution, so I manually replaced them.
It seems that TRAINING_SET (or annotation.ck) has one more column than the required format of DEV_SET/TEST_SET, does it mean such replacement is not valid and DEV_SET and TEST_SET are actually required?

OSError: [Errno 12] Cannot allocate memory

hi:
when i trained the model , i got the error like:
image

so i print the system mem every epoch, i find that the available mem decrease every epoch, the process like:
image
image
image
image
image
image
image

i think every epoch should release the mem , and it's seems not , how can i fix this error? thanks

ps :
dict_core:1000 lines(phrase)
dict_full:2000 lines(phrase)
raw_text:20w lines(words)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.