Colloquery

Colloquery is a web application to search for phrase translations, or collocations, as well as synonyms,in bilingual phrase translation tables.

It is developed for Van Dale by the Centre for Language and Speech Technology, Radboud University Nijmegen, and is licensed under the Affero GNU Public License.

Installation

First, clone this repository and edit settings.py.

Colloquery is not trivial to set-up and train, as it relies on numerous external dependencies:

On Debian/Ubuntu systems, these can be installed using sudo apt-get install python3 mongodb python3-mongoengine python3-django.

For the data generation step, the following additional dependencies are required:

colibri-core (shipped as part of LaMachine)
colibri-mt

To create phrase translation-tables in the first place, use the Moses training pipeline, which in turn invokes GIZA++:

Moses
GIZA++

Data Generation

Prepare your parallel corpus files. A parallel corpus consists of two plain-text UTF8 encoded files, one for the source language (corpus.fr in our example) and one for the target language (corpus.en). Make sure they are tokenised, lower-cased and contain one sentence per line (you can use ucto for this), sentences on the same line in the other file are considering translations.

Train a phrase translation table using Moses:

$ /path/to/moses/scripts/training/train-model.perl -external-bin-dir /path/to/moses/bin -root-dir .  --parallel --corpus corpus --f fr --e en  --first-step 1 --last-step 8

Invoke the data generation pipeline of Colloquery, adjust the thresholds as needed (see ./manage.py generatedata --help). This assumes a running and properly configured MongoDB:

./manage.py generatedata --title "YourCorpus" --phrasetable corpus.fr-en.phrasetable --sourcelang fr --targetlang en --targetcorpus corpus.fr --sourcecorpus corpus.en --pst 0.2 --pts 0.2 --divergencethreshold 0.1 --freqthreshold 4

The Moses and data generation pipeline may take considerable time and system resources (most notably memory). Set sane thresholds to prevent the data from becoming unmanageably large.

proycon / colloquery Goto Github PK

colloquery's Introduction

Colloquery

Installation

Data Generation

colloquery's People

Contributors

Stargazers

Watchers

colloquery's Issues

Implement more flexible filtering, current is too rigid

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent