lingpy / lingpy3 Goto Github PK

1.0 1.0 2.0 584 KB

LingPy 3.0

License: GNU General Public License v3.0

Python 99.99% Shell 0.01%

lingpy3's Introduction

LingPy: A Python Library for Automatic Tasks in Historical Linguistics

This repository contains the Python package lingpy which can be used for various tasks in computational historical linguistics.

Authors (Version 2.6.12): Johann-Mattis List and Robert Forkel

Collaborators: Christoph Rzymski, Simon J. Greenhill, Steven Moran, Peter Bouda, Johannes Dellert, Taraka Rama, Tiago Tresoldi, Gereon Kaiping, Frank Nagel, and Patrick Elmer.

LingPy is a Python library for historical linguistics. It is being developed for Python 2.7 and Python 3.x using a single codebase.

All source code is available at: https://github.com/lingpy/lingpy.
Documentation can be found at: http://lingpy.org.
For a list of papers in which LingPy was applied, see here.

Quick Installation

For our latest stable version, you can simply use pip or easy_install for installation:

$ pip install lingpy

$ pip install lingpy

Depending on which easy_install or pip version you use, either the Python2 or the Python3 version of LingPy will be installed.

If you want to install the current GitHub version of LingPy on your system, open a terminal and type in the following:

$ git clone https://github.com/lingpy/lingpy/
$ cd lingpy
$ python setup.py install

If the last command above returns you some error regarding user permissions (usually "Errno 13"), you can install LingPy in your home Python setup:

$ python setup.py install --user

In order to use the library, start an interactive Python session and import LingPy as follows:

>>> from lingpy import *

To install LingPy to hack on it, fork the repository on GitHub, open a terminal and type:

$ git clone https://github.com/<your-github-user>/lingpy/
$ cd lingpy
$ python setup.py develop

This will install LingPy in "development mode", i.e. you will be able edit the sources in the cloned repository and import the altered code just as the regular Python package.

lingpy3's People

Contributors

Stargazers

Watchers

Forkers

xrotwang lingulist

lingpy3's Issues

ISchema as a shortcut for similar orthographies

Lingpy distinguishes "schemas" for sound classes, including:

one routine for segmentation
one routine for conversion to sound classes (and a default sound class model)
one default routine for the scoring function in alignments

Currently, lingpy has two schemas: "ipa" and "asjp", the latter working on ASJP alphabet.

We should add an additional schema in lingpy3, and the possibility to register new schemas by the user:

plain ipa (assuming that orthogrpaphy is more or less regular IPA)
fuzzy ipa (assuming a messy IPA, with aspiration not written as superscript, etc., requiring a segmentation function based on a clean_string strategy)
asjp

More schemas are possible, for example "starling", as the whole data of Tower of Babel is in their own IPA version. The main argument for schemas is that it is too time-consuming to write individual orthography-profiles for all datasets, while on the other hand, many datasets are consistent enough to allow to be analysed by an enhanced function that is simpler than a full-fledged orthography profile.

Model specification for Sound-Class Models

Here's the current description of the basic structure.

http://lingpy.org/reference/lingpy.data.html#lingpy.data.derive.compile_model

The core is a "converter" file similar to a dictionary with lists as keys:

B : ɸ, β, f, p͡f, p͜f, ƀ
E : ɛ, æ, ɜ, ɐ, ʌ, e, ᴇ, ə, ɘ, ɤ, è, é, ē, ě, ê, ɚ
D : θ, ð, ŧ, þ, đ
G : x, ɣ, χ
...

This is then converted into a dictionary in which all list-items are a key and the original key is a value. This is for maintenance reasons, as it is much easier to handle such a structure than, say, a csv-file in which all sound-class-symbols are repeated. This format could be stored in JSON, but JSON is difficult to handle, as the specification replaces unicode symbols with the \u003 construct. Maybe the INIT structure is best, as one could define a key with a list of items. Internally, the list can be converted to json or any other format to make it quicker to load it, but there should be a backend for quick editing of the files.

Documenting Wordlist Metrics

We should not only provide but also carefully document the wordlist metrics which we define in LingPy, as these metrics are partially quite novel, might become influential, and have proven useful in the past.

basic wordlist statistics: external functions or built into the main class?

There are a couple of interesting metrics we want to have at hand when dealing with wordlists:

diversity (as the measure proposed in my diss, but I have ideas for enhancement, including partial cognates)
synonymity or semantic diversity of cognate judgments (we don't care at the moment, but with more cross-semantically coded cognate sets, it will be interesting to calculate how many meanings a cognate set has on average)
colexification coefficient (not clear for now, but we should have a metric that gives us a numeric impression on how pervasive colexification is)
coverage measures (how many gaps do we have in the data, good metrics pending, but also important for automatic cognate detection)

Should we create some extra class or a script that offers these metrics and can be applied to any wordlist object, or should we built them into the wordlist base class itself? Note also that we will always need to define both a normal and a partial version for the metrics, as partial cognate sets become increasingly available.

Set up some standardised benchmark tests

We should have some benchmarks to check performance against -- especially a good size LexStat analysis or something.

Alternatively since nose is being used for testing, could time all tests using nose-timer?

Test methods for correct data structure

We have this in rudimentary form in lexstat in linpgy2, but it should be more principled, as I often run into errors in other approaches. For example, when using partial cognate annotations, the number of morphemes needs to be the same as the number of cognate ids in a given row.

I'd suggest that each major class, be it Alignments, LexStat, and Partial, define their explicit routines for checking the input. In Partial, this would be above-mentioned problem of partial cognates and morphemes. In LexStat, we would also add coverage and how well segments have been recognized, etc.

Need numpy just for sqrt?

Just spotted this as the build is failing because travis isn't installing numpy. I was going to fix this but, it seems that the code so far is only using numpy in here and just for sqrt.

There's an implementation of sqrt in the stdlib (math.sqrt). Does numpy.sqrt have advantages? is numpy going to be used elsewhere? if not we could remove numpy as a prerequisite?

Wordlist Specification

My earlier thoughts and reports on functionality, which have not changed in large parts, are given here:

http://lingpy.org/tutorial/lingpy.basic.wordlist.html

Generally, one should be able to trigger output like this:

>>> wl.ipa
[['wɔldemɔrt', 'valdemar', 'vladimir', 'volodimir'],
 ['hæri', 'haralt', 'gari', 'hari'],
 ['lɛg', 'bain', 'noga', 'noha'],
 ['hænd', 'hant', 'ruka', 'ruka']]
>>> wl.cognate
[[6, 6, 6, 6], [7, 7, 7, 7], [4, 3, 5, 5], [1, 1, 2, 2]]

Based on a file like this:

ID   CONCEPT     COUNTERPART   IPA         DOCULECT     COGID
1    hand        Hand          hant        German       1
2    hand        hand          hænd        English      1
3    hand        рука          ruka        Russian      2
4    hand        рука          ruka        Ukrainian    2
5    leg         Bein          bain        German       3
6    leg         leg           lɛg         English      4
7    leg         нога          noga        Russian      5
8    leg         нога          noha        Ukrainian    5
9    Woldemort   Waldemar      valdemar    German       6
10   Woldemort   Woldemort     wɔldemɔrt   English      6
11   Woldemort   Владимир      vladimir    Russian      6
12   Woldemort   Володимир     volodimir   Ukrainian    6
13    Harry       Harald        haralt      German       7
14   Harry       Harry         hæri        English      7
15   Harry       Гарри         gari        Russian      7
16   Harry       Гаррi         hari        Ukrainian    7

But now, that I'm smarter than in the past, I would not make this a class-attribute, as it has lead to inconsistencies in the current lingpy. It is also not consequent, as far as language and concept return one-dimensional lists so far.

General Folder Structure

Apart from re-arranging, I think we'll still have the following structure:

algorithms/ reserved for algorithmic procedures that are used by dfferent methods, partially supporting cython
sequence: package for sequence handling, not comparison, including all cleaning functions
align: package for sequence comparison
wordlist: package for wordlist and etymological dictionary handling (basic input format)
compare: package for handling comparative routines in wordlists
trees: package for extended tree handling, based on python-newick
phylogeny: package for handling tree-clilmbing stuff etc., like Sankoff-Parsimony (code already there)
misc (?): package for writing files to different formats
data: package for data-handling

The names are not really important, but the responsibilities: handling sequences, wordlists, and trees, both as an object of itself, and in comparison, seems important for me.

NEXUS output format tweaks.

Looking at the NEXUS template in wordlist.py I can see a few tweaks that can be made.

I'm happy to make the changes and make a pull request, just need confirmation/thoughts where I've tagged you below.

The format declaration needs the attribute symbols which should list the symbols in use in the data. For PAPS this should just be 0 and 1 I think (@LinguList)? We could hard code this as:

FORMAT DATATYPE=STANDARD GAP=- MISSING={2} SYMBOLS="01" interleave=yes;

... unless there's an easy way to get a list of symbols in use in the PAPS output (@xrotwang?)

The Nexus format can specify which characters are which using a CHARSTATELABELS command which looks like this:


BEGIN DATA;
    DIMENSIONS NTAX=3 NCHAR=3;
    FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
    CHARSTATELABELS
        1 char1,
        2 char2,
        3 char3
    ;
    MATRIX

This could replace/augment the [PAPS-REFERENCE] thing at the end of the template, unless PAPS-REFERENCE is being used for anything (I suspect it's just for documentation @LinguList?)

"ntax" should be "NTAX" for consistency (and interleave -> INTERLEAVE). The NEXUS format doesn't care if it's upper or lower case, but we should be consistent. Any preference?