Chronoi Corpus Processing

This is a collection of loosely related scripts and other resources that were used in setting up the Chronoi pilot corpus as well as the extended multilingual corpus. They are meant to document and make reproducible the following steps in corpus setup and analysis:

(pre-)processing of the texts (with preprocessing.py using preprocessing)
guidelines and tools for annotation (at annotation)
time-tagging (dockerized at heideltime and using our fork as a subproject)
gathering temponym data (at heideltime/scripts)
evaluation and preparation for evaluation (postprocessing)
basic corpus analysis (also at postprocessing and experiments)

Some scripts cover experimental steps that were never actually used in the end. These include:

automatic translation of temponyms and corpus data (at translation)
using machine learning approaches in the detection step (at learning)

Setup and use

The main container chronoi-pilot expects two directory paths in a .env-file, one for output and for input. An example is in the .env.example. The input folder is expected to contain pdf- and/or text files to process.

To pull our heideltime fork as a submodule, run:

git submodule update --init

The total setup comes in the form of three docker containers which can now be started with:

docker-compose up

Besides the chronoi-pilot container there are two additional containers b which will be started by that command.

The heideltime container offers a command heideltime that can be used with e.g. docker exec. It mounts the output directory of the chronoi-pilot container so that it can work on the data produced by that container.

The container tempeval3 also mounts the output directory of the chronoi-pilot container so that it can work on the data produced by that container. It was mainly used for checking our evaluation against an official script and will probably not be needed for most use cases.

Examples for the usage of the containers from the host are given in the experiments folder.

dainst / chronoi-corpus-processing Goto Github PK

chronoi-corpus-processing's Introduction

Chronoi Corpus Processing

Setup and use

chronoi-corpus-processing's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent