Code Monkey home page Code Monkey logo

lha's Introduction

Large-scale Hierarchical Alignment

This code implements large-scale hierarchical alignment from the paper Large-scale Hierarchical Alignment for Data-driven Text Rewriting, presented at RANLP 2019.

The code constructs Annoy indices using document/sentence embeddings of two datasets, following which it performs nearest neighbour search across the datasets. It first extracts similar documents (document alignment), and then similar sentences (sentence alignment). See the paper for more info.

Setting up

Install all project dependencies:

pip install -r requirements.txt

You will also need the linecache_light library (not available through pip, you can install it from source). I had to change a line in the file linecache_light/linecache_light.py to get it working in Python 3: from import cPickle as pkl to import pickle as pkl.

Running the aligner

1. Build document embedding indices

The first step is to build an index of document embeddings. This is implemented in the file build_annoy_index.py, e.g. if you want to align two files, source and target, that contain one document per line, run:

python build_annoy_index.py -src_file source -emb sent2vec -vec_size 600
python build_annoy_index.py -src_file target -emb sent2vec -vec_size 600

to compute sent2vec embeddings. You'd have to modify the script to point to the correct model file paths.

This will create two index files, source.sent2vec.ann and target.sent2vec.ann for the above example. Run python build_annoy_index.py --help for more info on all of the available options. The Annoy documentation also contains additional details.

2. Run the aligner

After you have the source and target indices prepared, you can run the aligner as:

python aligner.py -level hierarchical -src source -tgt target -emb sent2vec -vec_size 600 -batch_size 2000 -lower_th 0.65

run python aligner.py --help for more info on the available options.

This will first extract similar document pairs and then similar sentence pairs. The final sentence pairs will be stored in two files: source.hier.None and target.hier.None. Additionally, a file source.target.sims.None will be created, which will contain the final similarities of the sentence pairs.

After alignment, you can subsequently post-filter the extracted pairs, e.g. using:

python filter.py -src source.hier.None -tgt target.hier.None -sim source.target.sims.None -low_sim_th 0.7

Citation

@InProceedings{nikolov-alignment-ranlp19,
  author    = {Nikolov, Nikola  and  Hahnloser, Richard},
  title     = {Large-scale Hierarchical Alignment for Data-driven Text Rewriting},
  booktitle = {Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2019},
  year      = {2019}
}

lha's People

Contributors

ninikolov avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.