Code Monkey home page Code Monkey logo

ucphrase-exp's Introduction

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

To appear on KDD'21...[pdf]

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora. In this work, we recognize the power of pretrained language models in identifying the structure of a sentence. The attention matrices generated by a Transformer model are informative to distinguish quality phrases from ordinary spans, as illustrated in the following example.

drawing

With a lightweight CNN model to capture inter-word relationships from various ranges, we can effectively tackle the task of phrase tagging as a multi-channel image classifiaction problem.

For model training, we seek to alleviate the need for human annotation and external knowledge bases. Instead, we show that sufficient supervision can be directly mined from large-scale unlabeled corpus. Specifically, we mine frequent max patterns with each document as context, since by definition, high-quality phrases are sequences that are consistently used in context. Compared with labels generated by distant supervision, silver labels mined from the corpus itself preserve better diversity, coverage, and contextual completeness. The superiority is supported by comparison on two public datasets.

image

We compare our method with existing ones on the KP20k dataset (publication data from CS domain) and the KPTimes dataset (news articles). UCPhrase significantly outperforms prior arts without supervision. Compared with off-the-shelf phrase tagging tools, UCPhrase also shows unique advantages, especially in its ability to generalize to specific domains without reliance on manually curated labels or KBs. We provide comprehensive case studies to demonstrate the comparison among different tagging methods. We also have some interesting findings in the discussion sections.

We aim to build UCPhrase as a practical tool for phrase tagging, though it is certainly far from perfect. Please feel free to try on your own corpus and give us feedbacks if you have any ideas that can help build better phrase tagging tools!

Facts: UCPhrase is a joint work by researchers from UI at Urbana Champaign, and University of California San Diago.

Quick Start

Step 1: Download and unzip the data folder

wget https://www.dropbox.com/s/1bv7dnjawykjsji/data.zip?dl=0 -O data.zip
unzip -n data.zip

Step 2: Install and compile dependencies

bash build.sh

Step 3: Run experiments

cd src
python exp.py --gpu 0 --dir_data ../data/devdata

The result files are organized under experiments/${expname}/model:

  • kpcand.decoded.epoch-${best_epoch}/eval.doc2cands-xxx.json
    • results of document-level candidate phrase extraction in the form of "docid": [phrases].
    • these phrases serve as the candidates of the phrase ranking phase for keyphrase extraction.
  • tagging.decoded.epoch-${best_epoch}/doc2sents-xxx.json
    • results of sentence-level phrase tagging in the form of a dictionary.
    • Each key is a document id.
    • The value is a list of sentences in the document.
    • Each sentence consists of its tokenized wordpieces ("tokens"), and the tagged phrases ("spans").
    • Each span is a tuple of (start_token_index, end_token_index, surface_string).

Citation

If you find the implementation useful, please consider citing the following paper:

Xiaotao Gu*, Zihan Wang*, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han, Jingbo Shang, "UCPhrase: Unsupervised Context-aware Quality Phrase Tagging", in Proc. of 2021 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'21), Aug. 2021

ucphrase-exp's People

Contributors

xgeric avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.