Code Monkey home page Code Monkey logo

blinkout's Introduction

BLINKout: Out-of-KB Mention Discovery

This is the official repository for Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, accepted for CIKM 2023.

The study adapts BERT-based Entity Linking (BLINK) to identify mentions that do not have corresponding KB entities by matching them to a special NIL entity, with NIL entity representation and classification, and synonym enhancement.

The study also applies KB Pruning and Versioning strategies to automatically construct out-of-KB datasets from common in-KB Entity Linking datasets. Please see the model training and data construction scripts below.

Model Training and Inference

See step_all_BLINK.sh for running BLINK models with Threshold-based and NIL-rep-based methods.

See step_all_BLINKout.sh for running BLINKout models and the dynamic feature baseline.

See step_all_BM25+cross-enc.sh for all BM25+BERT models.

For all scripts above:

  • setting dataset (and mm_onto_ver_model_mark for MedMentions)
  • setting bi_enc_bertmodel and cross_enc_bertmodel (and change further_model_mark accordingly)
  • setting train_bi (except BM25), rep_ents, train_cross, inference to true to perform each step.
  • setting use_best_top_k as true if using tuned top-k, otherwise using default

For step_all_BLINK.sh, further

  • setting use_NIL_threshold to true when using the Threshold-based approach (and the corresponding th2 as threshold value for each dataset)
  • setting use_NIL_ranking to true when using the NIL-rep-based approach (and setting NIL representation binary parameters of use_NIL_tag, use_NIL_desc, and use_NIL_desc_tag)

For step_all_BLINKout.sh, further

  • setting NIL representation binary parameters of use_NIL_tag, use_NIL_desc, and use_NIL_desc_tag.
  • setting dynamic_emb_extra_ft_baseline to true and select the corresponding line (around 273-274) to use either the NIL regulariser (gu2021) or the dynamic feature baseline (full-features-NIL-infer), also setting the value of lambda_NIL.

For step_all_BM25+cross-enc.sh

  • requiring the tokenizer of the saved biencoder model, so run step_all_BLINK.sh with the same biencoder model first before running this script.

Data Availability and Data Sources

Link to out-of-KB mention discovery datasets: https://zenodo.org/record/8228371.

We acknowledge the sources below for data construction:

Data Scripts

See files under the preprocessing folder, where running scripts to create the datasets are in run_preprocess_ents_and_data.sh.

Acknowledgement

The repository is based on BLINK under the MIT license. Also, we acknowledge the data sources above.

blinkout's People

Stargazers

 avatar Benno Kruit avatar

Watchers

Hang Dong avatar

Forkers

fangzheng354

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.