Code Monkey home page Code Monkey logo

axa-trees's Introduction

aXa-trees

aXa stands for arbustum ΧΡΙΣΤΙΑΝΩΝ antiquum, "ancient forest of Christians". Its symbol is a nomen sacrum evoking ancient scribal practices and a nod to the centrality of Christian scripture in historical corpus linguistics.

The Problem

While lots of other treebanks of New Testament texts exist, most of them do not conform to established international standards. This makes parsing and comparison with most Classical and Postclassical Greek linguistic corpora difficult.

The Background

These treebanks simultaneously follow and adopt both of the leading international standards for syntactical treebanks: Universal Dependencies (UD) and Ancient Greek Dependency Treebanks (AGDT). The UD standard is itself a combination, specifically of the Stanford dependencies and Google universal tags. Universal Dependencies now extends to test/train data and models in over 200 languages. The AGDT standard was developed among leading scholars of Classical Greek, and it boasts the greatest amount of gold-standard, expert-tagged syntactical data of Classical Greek in the world (over 500k tokens), with millions more tokens produced in a largely identical format by the GlauX project.

The Purpose

By adopting both UD and AGDT formats, then aligning and combining them into a single representation, we maximize the potential for cross-validation, for the use of these treebanks as crosswalks between the two leading formats, and for broad accessibility and comparison within computational linguistic analyses and applications.

The Credits

These Postclassical Greek syntactical trees (aka dependency treebanks) are (to varying degrees) typically based on earlier openly licensed treebanks and projects as detailed below.

PROIEL - Universal Dependencies format - CoNLL-U

The PROIEL treebanks include most Greek New Testament texts based on the Editio Octava (Eighth Edition) of the Greek New Testament (GNT) by Constantin Tischendorf (1869/1872), a work now in the public domain. The latest version of the PROIEL xml and conllu files may be found in the Syntacticus repo. The conllu data used in this repo are largely derived from the tripartite-split files used in the Universal Dependencies Treebanks PROIEL model. When specific verses, chapter, or books are missing, these are patched with a combination routine: extracting UD-parallel elements from the PROIEL xml and pulling other UD elements by passing the patch text through the UDpipe PROIEL model, with expert edits and corrections as needed. The PROIEL Treebank is freely available under a CC BY-NC-SA 3.0 license.

Dag T. T. Haug and Marius L. Jøhndal. 2008. "Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pp. 27-34. doi:10.5281/zenodo.10655061.

GlauX - Ancient Greek and Latin Dependency format - XML

The GlauX treebanks of New Testament texts were produced based on the manual tagging of the PROIEL project, again following Tischendorf's GNT, with automated, semi-automated, and manual transformations applied. The GlauX treebanks follow the AGDT standard in most respects, and were produced under the expert direction of Toon van Haal and Alek Keersmaekers at KU-Leuven. They are now available in Alek Keersmaekers' GlauX repo under a CC BY-SA 3.0 license. For a representative publication of their work, see:

Keersmaekers, A. 2021. "The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek." Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, 39–50. Online: Association for Computational Linguistics. doi:10.18653/v1/2021.lchange-1.6.

The Invitation

Collaborators are welcome to join in our efforts to curate gold-standard UD- and AGDT-treebanks of early Christian texts. The most effective way to contribute is to make pull requests and propose edits to the edited source files (xml and conllu), rather than the original files or the combined output, since the original files are needed for reference and the combined output is generated with a scripting pipeline.

axa-trees's People

Contributors

mgbilby avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.