zas-quest / eldpy Goto Github PK

6.0 4.0 1.0 8.4 MB

Python 100.00%

eldpy's Introduction

LangSci

This package provides tools for interfacing with endangered language archives.

For the time being, only the download functionality is robust enough for general use.

The package contains script for the analysis of ELAN files. These analyses are quantitative (duration, tiers, tokens) as well as qualitative (vernacular language, tranlations, glosses, semantic domains).

The analyses are cached in JSON format and can be exported to RDF.

Sample usage:

download all ELAN files from the AILLA archives:

from eldpy import download
download.bulk_download(archive='AILLA', filetype=1, username='janedoe', password='mypassword')

analyze all downloaded ELAN files

from eldpy.bulk import *
bulk_populate(cache=False)

cache for future usage: as above and add

bulk_cache()

read cached information

from eldpy.bulk import *
bulk_populate(cache=True)

compute tokens and durations

from eldpy.bulk import *
bulk_populate()
bulk_statistics()

analyze ELAN tier hierarchies

from eldpy.bulk import *
bulk_populate()
bulk_fingerprints()

-export as rdf

from eldpy.bulk import *
bulk_populate()
bulk_rdf()

eldpy's People

Contributors

Stargazers

Watchers

Forkers

taiqihe

eldpy's Issues

differentiate CLDF Primary_Text vs Annotated_Word

see cldf/cldf#153

cache glosses

support for Pangloss format?

"As of today, the library can download files of a given type from AILLA, ELAR, PARADISEC and TLA. Futher processing and analysis functions are also available within the library, which will be covered in future blog posts." (from a blog post)

Would you consider supporting Pangloss format? It's straightforward, and I know Json enthusiasts are itching to convert it to get rid of the XML markup so as to speed up processing :)

The Document Type Definition is here. Export to Text Encoding Initiative is already implemented (some links here). The XML documents can also be automagically converted to a reasonable Elan format with Benjamin Galliot's newfangled XSLT-2 converter, in case you prefer to harvest them that way.

The Pangloss Collection is not huge, admittedly, but it is carefully designed and makes progress, one corpus after another. Cocoon (the repository within which it is hosted) is consistently among five-star archives in the list maintained by OLAC. Cocoon is set up so as to facilitate bulk download of documents (audio/video & annotation), too. There are some notes here. Some other collections within Cocoon use the Pangloss DTD.

(Pangloss is now a proud member of the Digital Endangered Languages and Musics Archives Network.)

Just sayin' :)

positive test files are expected to produce some output
negative (corrupt) test files should throw meaningful errors with hints how to remedy the defects
- missing tier for transcription
- missing tier for translation
- no glosses
- glosses and source line do not match
- corrupt XML
- several translation tiers
- several transcription tiers
- several gloss tiers
- "flat hierarchy tiers"
- files with missing/blank/XXX/??/*** cells
test for Leipzig Glossing Rules compliance