Code Monkey home page Code Monkey logo

eldpy's Introduction

LangSci

This package provides tools for interfacing with endangered language archives.

For the time being, only the download functionality is robust enough for general use.

The package contains script for the analysis of ELAN files. These analyses are quantitative (duration, tiers, tokens) as well as qualitative (vernacular language, tranlations, glosses, semantic domains).

The analyses are cached in JSON format and can be exported to RDF.

Sample usage:

  • download all ELAN files from the AILLA archives:
from eldpy import download
download.bulk_download(archive='AILLA', filetype=1, username='janedoe', password='mypassword')
  • analyze all downloaded ELAN files
from eldpy.bulk import *
bulk_populate(cache=False)
  • cache for future usage: as above and add
bulk_cache()
  • read cached information
from eldpy.bulk import *
bulk_populate(cache=True)
  • compute tokens and durations
from eldpy.bulk import *
bulk_populate()
bulk_statistics()
  • analyze ELAN tier hierarchies
from eldpy.bulk import *
bulk_populate()
bulk_fingerprints()

-export as rdf

from eldpy.bulk import *
bulk_populate()
bulk_rdf()

eldpy's People

Contributors

glottotopia avatar

Stargazers

David Huggins-Daines avatar  avatar  avatar Alexis Michaud avatar Alexey Koshevoy avatar Daniel W. Hieber avatar

Watchers

James Cloos avatar  avatar Alexis Michaud avatar  avatar

Forkers

taiqihe

eldpy's Issues

support for Pangloss format?

"As of today, the library can download files of a given type from AILLA, ELAR, PARADISEC and TLA. Futher processing and analysis functions are also available within the library, which will be covered in future blog posts." (from a blog post)

Would you consider supporting Pangloss format? It's straightforward, and I know Json enthusiasts are itching to convert it to get rid of the XML markup so as to speed up processing :)

The Document Type Definition is here. Export to Text Encoding Initiative is already implemented (some links here). The XML documents can also be automagically converted to a reasonable Elan format with Benjamin Galliot's newfangled XSLT-2 converter, in case you prefer to harvest them that way.

The Pangloss Collection is not huge, admittedly, but it is carefully designed and makes progress, one corpus after another. Cocoon (the repository within which it is hosted) is consistently among five-star archives in the list maintained by OLAC. Cocoon is set up so as to facilitate bulk download of documents (audio/video & annotation), too. There are some notes here. Some other collections within Cocoon use the Pangloss DTD.

(Pangloss is now a proud member of the Digital Endangered Languages and Musics Archives Network.)

Just sayin' :)

rdf output

should have the form ailla-translations.n3; tla-glosses.n3; elar-nerd.n3 etc

turn into a proper python package?

Are there any plans to turn into this into a "proper" python package (i.e. installable via pip, preferably from pypi, ...)?

At least in case community collaboration is intended, proper python project structure would help a lot with introducing new people to the functionality.

OWL

provide an OWL representation of the ontology. This representation should subclass LIGT (and possibly NIF)

refactor bulk download

the download info in the terminal should be one function; currently there is some code duplication.

provide eaf files for testing purposes

  • positive test files are expected to produce some output
  • negative (corrupt) test files should throw meaningful errors with hints how to remedy the defects
    • missing tier for transcription
    • missing tier for translation
    • no glosses
    • glosses and source line do not match
    • corrupt XML
    • several translation tiers
    • several transcription tiers
    • several gloss tiers
    • "flat hierarchy tiers"
    • files with missing/blank/XXX/??/*** cells
  • test for Leipzig Glossing Rules compliance

naming of entities

check whether the entities in n3 files must be named, or whether blank nodes can be used instead

treatment of glosses

currently, the counts returned for glosses are way to high. Recheck the data model, parsing, and tallying

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.