lingpy / lingrex Goto Github PK

Linguistic Reconstruction with LingPy

Home Page: https://github.com/lingpy/lingrex

License: MIT License

Python 100.00%

lingrex's Introduction

LingPy: A Python Library for Automatic Tasks in Historical Linguistics

This repository contains the Python package lingpy which can be used for various tasks in computational historical linguistics.

Authors (Version 2.6.12): Johann-Mattis List and Robert Forkel

Collaborators: Christoph Rzymski, Simon J. Greenhill, Steven Moran, Peter Bouda, Johannes Dellert, Taraka Rama, Tiago Tresoldi, Gereon Kaiping, Frank Nagel, and Patrick Elmer.

LingPy is a Python library for historical linguistics. It is being developed for Python 2.7 and Python 3.x using a single codebase.

All source code is available at: https://github.com/lingpy/lingpy.
Documentation can be found at: http://lingpy.org.
For a list of papers in which LingPy was applied, see here.

Quick Installation

For our latest stable version, you can simply use pip or easy_install for installation:

$ pip install lingpy

$ pip install lingpy

Depending on which easy_install or pip version you use, either the Python2 or the Python3 version of LingPy will be installed.

If you want to install the current GitHub version of LingPy on your system, open a terminal and type in the following:

$ git clone https://github.com/lingpy/lingpy/
$ cd lingpy
$ python setup.py install

If the last command above returns you some error regarding user permissions (usually "Errno 13"), you can install LingPy in your home Python setup:

$ python setup.py install --user

In order to use the library, start an interactive Python session and import LingPy as follows:

>>> from lingpy import *

To install LingPy to hack on it, fork the repository on GitHub, open a terminal and type:

$ git clone https://github.com/<your-github-user>/lingpy/
$ cd lingpy
$ python setup.py develop

This will install LingPy in "development mode", i.e. you will be able edit the sources in the cloned repository and import the altered code just as the regular Python package.

lingrex's People

Contributors

Stargazers

Watchers

Forkers

somiyagawa

lingrex's Issues

Method for fuzzy cognates with alignments provided by user.

Following the discussion in the lingpy/fuzzy#3, it might be worthwhile to check if/how we can take advantage of manual alignments provided within the datasets. This might solve a good part of the confused sounds which arise due to erroneous alignments.

Alternative pattern assignments per alignment site

Algorithm should take only the descendants, check for attested protoforms independently, and then assign each alignment (each cognate set) to as many proto-form patterns as it would fit.

Cleaning data prior to correspondence pattern analysis

We might need some basic checks whether a correspondence pattern analysis is useful, since I detected one pattern that causes huge problems:

    {'ID': [365, 371, 370, 367, 364, 369, 368, 366, 362],
 'taxa': ['Hachijo',
  'Hachijo',
  'Kagoshima',
  'Kochi',
  'Kyoto',
  'Oki',
  'Sado',
  'Shuri',
  'Tokyo'],
 'seqs': [['k', 'iː', '-', '-', '-', '-'],
  ['k', 'e', 'b', 'u', 'ɕ', 'o'],
  ['k', 'e', '-', '-', '-', 'i'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'eː', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'iː', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-']],
 'alignment': [['k', 'iː', '-', '-', '-', '-'],
  ['k', 'e', 'b', 'u', 'ɕ', 'o'],
  ['k', 'e', '-', '-', '-', 'i'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'eː', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'iː', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-']],
 'dataset': 'japonic',
 'seq_id': '449 ("hair")'}

Here, we have two words from Hachijo in the same cognate sets, but they differ (!). We can argue that for correspondence patterns, it is impossible for strictly cognate words to differ. So a preprocessing can in fact arbitrarily decide for one of them.

New Methods from Accepted Papers

We should add the trimming code from our study with @tarotis, and the code for borrowing detection of dominant donors based on our study with @fractaldragonflies.

sanity checks on every dataset: strict cognates

all words assigned to the same cognate set should be identical when they come from the same language
all words in the same cognate set from the same language should be aligned identically

test whether context is actually needed

Context may not be needed at all for this project, as we know well that different patterns result from different contexts, so it may well be sufficient to compute things once and for all for all alignment columns.

General alignment score derived from pattern compatibility

We can arrive at some principled score for alignment sites based on frequency of the pattern, how much of the data it explains, etc. The score could either be threshold-based, i.e.: good vs. bad, or continuous, i.e., derived from comparing across all patterns and somehow computing how good they explain the data.

Import correspondence patterns

Importing correspondence patterns is not yet provided. Ideally, we'd have a function:

CoPaR.load_patterns(patterns="patterns")

Minor fixes to setup.py

I changed the header of setup.py to:

try:
from setuptools import setup
except ImportError:
from distribute_setup import use_setuptools
use_setuptools()
from setuptools import setup

It works flawlessly.

Tracing irregularities down to the alignments

We can already annotated alignments that are potentially irregular, but we would also like to have the same feature for colexified forms annotated as cross-semantic cognates.

new format for patterns, tight to tokens?

If we define alignments by structure, we can basically get rid of the pattern annotation and display it in the form of being attached to each token rather than the alignment. This is, however, dangerous, as we also want to store gaps as parts of correspondence pattersn.

Patterns from MSA function

This function is always needed again and again, one might want to just add it to util:

from lingpy.sequence.sound_classes import token2class

def patterns_from_msa(msa, languages, missing="Ø", gap="-"):
    """
    Retrieve patterns from an msa object.
    """
    out = [["" for x in languages] for y in msa["alignment"][0]]
    for j in range(len(msa["alignment"][0])):
        for i, language in enumerate(languages):
            if language in msa["taxa"]:
                out[j][i] = msa["alignment"][msa["taxa"].index(language)][j]
            else:
                out[j][i] = missing
    return [(
        token2class(
            ([c for c in row if c not in (gap, missing)] or ["?"])[0], 
            "cv"
            ).lower(),
        tuple(row)) for row in out
            ]

The function takes an MSA object and returns the patterns. One can also do it without the structure involved, or one could do it passing the structure, as done in lingrex.

Individual word deviation score inside an alignemnts

We have tested this before in LingPy but it was less convincing. Now, we can do it with the correspondence patterns in a more principled way. The score should reflect how well a word fits into an alignment. One simple way to do this would be:

exclude the word from the alignment and see whether the alignment overall score increases, or whether it leads to a shift in pattern
flag the sound in the word which shifts the pattern

EDICTOR should then offer the possibility to display these scores inside an alignment by allowing the user to press a button and to see an alignments colored by the alignment scores.

refine workflow

workflow now is:

search for cognates
align cognate sets
merge alignmetns which are compatible using clique partitioning
compute compatible patterns using clique partitioning
order the patterns by frequency
check for alignment quality by counting how many possible patterns there are and how good they are

Especially for point 5, we should not only order by frequency, but also by how often those patterns contain missing data. There should be some simple hierarchy to compare patterns:

how many cognate sets are covered by a pattern
how many missing spots does a pattern contain

But alternatively, one could even try to find out:

how many patterns are almost compatible (< hamming distance with threshold T)
how many patterns are actually invoked in the data
how many alignment sites can actually be ambiguously assigned to patterns

All these things need to be put into a better workflow, so thtat it is clear what the method does.

As a goal, I'd like to have:

a numerical evaluation of all alignment sites regarding their regularity
an individual deviation score for each word inside an alignment

`find_colexified_alignments` fails in some cases

The reason is that the algorithm won't recognize two alignments to be the same when one has a gap more. These cases need to be handled by reducing all gaps in all alignments and THEN comparing their compatiblity.

Add structure from existing alignment

I am currently running into trouble while trying to create a CV structure based on a pre-existing alignments with manually trimmed data. Let's use for example the following data:

ID	DOCULECT	CONCEPT	VALUE	FORM	TOKEN	COGID	ALIGNMENT
1	Marubo	type of bank	kɨnã	kɨnã	k ɨ n ã	60	['k', 'ɨ', 'n', 'ã', '-', '(', '-', '-', ')']
2	Chakobo	type of bank	kɨˈnanɨ	kɨˈnanɨ	k ɨ n a n ɨ	60	['k', 'ɨ', 'n', 'a', '-', '(', 'n', 'ɨ', ')']

Is there any pre-defined way of doing this, or would I need to create a new tokens-column based on the alignments, removing all content within brackets in a dynamic way? @LinguList Maybe you can help me on this.

Add long_description_content_type to setup.py to make sure the README renders ok on PyPI

Add

long_description_content_type='text/markdown',

to metadata to fix display at
https://pypi.org/project/lingrex/

[dev] supervised reconstruction method for lingrex

This would comprise of a new class that creates reconstructions systematically, using our supervised approach used in the prediction study on Kho-Bwa languages, but in a more systematic form.

cognate ids for the same word in the same language do not pass the test for cross-semantic cognates

If we have two different words, assigned to the same cognate ID, the cross-semantic cognate detection method may fail, as we do not check for this case. What needs to be done first:

assemble cognate identifiers and check for the same cogid within the same language, that occurs twice, and discard one of them

ranked singletons

Rank singletons in the data by assigning them to the closest pattern they could match with. This will give more fine-grained possibilities for analysing what is going wrong in a particular alignment.

prediction experiments

Two aspects of prediction are important:

prediction can be used to approximate some average patterns, as it would kick out irregular data points (better than hamming distance), and we can create profiles for all words in the data, assessing divergences
prediction can be used to identify the most important languages for successful prediction (by simply counting for each language, what prediction rates it yields on average). This would allow us to assess, which languages in our sample are "archaic".

Export correspondence patterns to wordlist format

This should be added to a column "CORRESPONDENCES", and the best format seems to be a simple unique identifier in space-separated form for each column of an alignment. This format is redundant, in so far as all participants of a cognate set will have the same vector, but probably more convenient than inventing a new structure where e.g., only one item has a form.

Proto-forms can be added by adding a proto-form and assigning it's constituents to patterns. In fact, it means that upon insertion of a proto-form, only the pattern has to be cloned.

If linguists then decide to change patterns, we can offer a way to do so by manually editing the identifier of each pattern.

make a prediction experiment with gap-free alignments

Ideas are already outlined, what is needed is a new workflow example that shows how lingrex can be used for this purpose.

check if ? and ! code to missing data, and if cldf "/" is implemented

this is not clear by now...

Re-applying evaluation for pattern matches

The current introduction of information for alignments is based on a unique matching of one given pattern to different patterns. However, we want to display uncertainty for pattern matching by checking for each concrete pattern whether it could not be part of another consensus pattern. This means, before applying our "critics" to each alignment in the data, we should go through all alignments in our data and assign each column to the n patterns with which it is compatible.

set up framework to test regularity

we have this partially in the tests, but we need to set this up rigorously, defining clear-cut parameters.

handling cross-semantic cognate sets with irregularities

We can handle cross-semantic cognates in lingpy, but we need to make sure that all words in one language are in fact identical. If they are not, the patterns need to be separated from the rest and marked as problematic for the given normal cognate set.

In general, handling cross-semantic cognates is an aspect that deserves a bit more attention and needs to be handled properly. For the time being, it is still a bit shaky, how to handle ALL cross-semantic cognates within one analysis.

bug with keyword "family" in lingrex.borrowing

the Graph has the "family" set to "family", but it is passed the attribute of the function! Easy to fix, requires bugfix release.

Can LingRex get frequencies of sound correspondences?

Dear all,

I have a CLDF repository with aligned cognates and want to transform them into this type of thing:

image source
Can Lingrex do that? Or is there some other Python software out there that should preferrably be used? So far I'm using my own script, but am not keen on re-inventing the wheel.
Thank you in advance for the support

write_frequency() inflates Frequency: How to use correctly?

I am trying to compute the correspondence for the oliveiraprotopanoan dataset. However, while using CoPaR.write_patterns(), the frequency column counts many cases where the actual sound is not attested. Can this be deactivated?

cop = CoPaR("data/opp_alg.tsv", segments="tokens", transcription="tokens", ref="cogid")
cop.get_sites()
cop.cluster_sites()
cop.sites_to_pattern()
cop.add_patterns()
cop.write_patterns("data/opp_patterns.tsv", proto="Proto-Panoan")

Output:

ID | STRUCTURE | FREQUENCY | Proto-Panoan | Amawaka | Chakobo | Chaninawa | Kakataibo | Kapanawa | Katukina | Kaxarari | Kaxinawa | Korubo | Marinawa | Marubo | Matis | Mayoruna | Poyanawa | Shanenawa | Sharanawa | ShipiboKonibo | Yaminawa | Yawanawa | COGNATESETS | CONCEPTS
87-3 | V | 3 | i | i | i | i | i | i | i | a | i | e | i | ĭ | e | e | i | i | i | i | i | i | 120:3, 142:3, 220:4 | canoe / cold / cutia (species of rodent)
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

This posits an /e/ for Mayoruna, but it has no item in cognate set 120. In other cases, the gap is correctly added, but here not. Can this somehow be made stricter?

template_alignment

when a vowel is missing in the morpheme, the template_alignment reports a incorrect result:
"f ŋ ³¹³" should be "i c t" , but the alignment result is "i t".

[paper] correspondence patterns paper: evaluation

Evaluation can be based on the following ideas:

introduce fake borrowings into a datasets on a random basis and try to find them using the prediction algorithm (hamming distance)
make wrong cognate assumptions (using low-thresholds or similar) and cross bad cognates out based on:

disentangling cases where a full column is a singleton (crude method)
kicking out words from a cognate set which were identified due to the hamming distance method

estimate regularity of a dataset

counting singletons and non-singletons
counting how many words are good
making a metric that counts along the lines of what was done before, the number of words explained versus the number of cognate sets (the objective function for regularity)

identify layers

difficult, since layers are intertwined, but could be done on a language-to-language basis
can be mentioned in the discussion

improve reconstructions

introduce wrong reconstructions and see whether the algo picks the right ones

All in all: evaluation does not need to evaluate the accuracy of the clique partitioning, but instead the accuracy of what can be done on top. Clique partitioning is already good for itself, since it reflects thinking of historical linguists directly.

new methods from the evaluation study

There are three more functions to be added officially to lingrex now, for a paper accepted with minor modifications.

a function to compute cross-semantic cognate statistics
a function to compute "normal" cognates from salient cognates (annotated with a column for morpheme glosses)
a function to compute b-cubed f-scores to assess the degree of variation resulting from strict vs. loose cognate coding

proxies for borrowing detection

duplicates (check if a word has two synonyms for a meaning, tip by George Starostin)
specific sounds (à la bdproto)