Code Monkey home page Code Monkey logo

lingrex's Introduction

LingPy: A Python Library for Automatic Tasks in Historical Linguistics

This repository contains the Python package lingpy which can be used for various tasks in computational historical linguistics.

Build Status DOI PyPI version Documentation

Authors (Version 2.6.12): Johann-Mattis List and Robert Forkel

Collaborators: Christoph Rzymski, Simon J. Greenhill, Steven Moran, Peter Bouda, Johannes Dellert, Taraka Rama, Tiago Tresoldi, Gereon Kaiping, Frank Nagel, and Patrick Elmer.

LingPy is a Python library for historical linguistics. It is being developed for Python 2.7 and Python 3.x using a single codebase.

Quick Installation

For our latest stable version, you can simply use pip or easy_install for installation:

$ pip install lingpy

or

$ pip install lingpy

Depending on which easy_install or pip version you use, either the Python2 or the Python3 version of LingPy will be installed.

If you want to install the current GitHub version of LingPy on your system, open a terminal and type in the following:

$ git clone https://github.com/lingpy/lingpy/
$ cd lingpy
$ python setup.py install

If the last command above returns you some error regarding user permissions (usually "Errno 13"), you can install LingPy in your home Python setup:

$ python setup.py install --user

In order to use the library, start an interactive Python session and import LingPy as follows:

>>> from lingpy import *

To install LingPy to hack on it, fork the repository on GitHub, open a terminal and type:

$ git clone https://github.com/<your-github-user>/lingpy/
$ cd lingpy
$ python setup.py develop

This will install LingPy in "development mode", i.e. you will be able edit the sources in the cloned repository and import the altered code just as the regular Python package.

lingrex's People

Contributors

fredericblum avatar lingulist avatar xrotwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

somiyagawa

lingrex's Issues

Cleaning data prior to correspondence pattern analysis

We might need some basic checks whether a correspondence pattern analysis is useful, since I detected one pattern that causes huge problems:

    {'ID': [365, 371, 370, 367, 364, 369, 368, 366, 362],
 'taxa': ['Hachijo',
  'Hachijo',
  'Kagoshima',
  'Kochi',
  'Kyoto',
  'Oki',
  'Sado',
  'Shuri',
  'Tokyo'],
 'seqs': [['k', 'iː', '-', '-', '-', '-'],
  ['k', 'e', 'b', 'u', 'ɕ', 'o'],
  ['k', 'e', '-', '-', '-', 'i'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'eː', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'iː', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-']],
 'alignment': [['k', 'iː', '-', '-', '-', '-'],
  ['k', 'e', 'b', 'u', 'ɕ', 'o'],
  ['k', 'e', '-', '-', '-', 'i'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'eː', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-'],
  ['k', 'iː', '-', '-', '-', '-'],
  ['k', 'e', '-', '-', '-', '-']],
 'dataset': 'japonic',
 'seq_id': '449 ("hair")'}

Here, we have two words from Hachijo in the same cognate sets, but they differ (!). We can argue that for correspondence patterns, it is impossible for strictly cognate words to differ. So a preprocessing can in fact arbitrarily decide for one of them.

sanity checks on every dataset: strict cognates

  1. all words assigned to the same cognate set should be identical when they come from the same language
  2. all words in the same cognate set from the same language should be aligned identically

test whether context is actually needed

Context may not be needed at all for this project, as we know well that different patterns result from different contexts, so it may well be sufficient to compute things once and for all for all alignment columns.

General alignment score derived from pattern compatibility

We can arrive at some principled score for alignment sites based on frequency of the pattern, how much of the data it explains, etc. The score could either be threshold-based, i.e.: good vs. bad, or continuous, i.e., derived from comparing across all patterns and somehow computing how good they explain the data.

Import correspondence patterns

Importing correspondence patterns is not yet provided. Ideally, we'd have a function:

CoPaR.load_patterns(patterns="patterns")

Minor fixes to setup.py

I changed the header of setup.py to:

try:
from setuptools import setup
except ImportError:
from distribute_setup import use_setuptools
use_setuptools()
from setuptools import setup

It works flawlessly.

new format for patterns, tight to tokens?

If we define alignments by structure, we can basically get rid of the pattern annotation and display it in the form of being attached to each token rather than the alignment. This is, however, dangerous, as we also want to store gaps as parts of correspondence pattersn.

Patterns from MSA function

This function is always needed again and again, one might want to just add it to util:

from lingpy.sequence.sound_classes import token2class

def patterns_from_msa(msa, languages, missing="Ø", gap="-"):
    """
    Retrieve patterns from an msa object.
    """
    out = [["" for x in languages] for y in msa["alignment"][0]]
    for j in range(len(msa["alignment"][0])):
        for i, language in enumerate(languages):
            if language in msa["taxa"]:
                out[j][i] = msa["alignment"][msa["taxa"].index(language)][j]
            else:
                out[j][i] = missing
    return [(
        token2class(
            ([c for c in row if c not in (gap, missing)] or ["?"])[0], 
            "cv"
            ).lower(),
        tuple(row)) for row in out
            ]

The function takes an MSA object and returns the patterns. One can also do it without the structure involved, or one could do it passing the structure, as done in lingrex.

Individual word deviation score inside an alignemnts

We have tested this before in LingPy but it was less convincing. Now, we can do it with the correspondence patterns in a more principled way. The score should reflect how well a word fits into an alignment. One simple way to do this would be:

  • exclude the word from the alignment and see whether the alignment overall score increases, or whether it leads to a shift in pattern
  • flag the sound in the word which shifts the pattern

EDICTOR should then offer the possibility to display these scores inside an alignment by allowing the user to press a button and to see an alignments colored by the alignment scores.

refine workflow

workflow now is:

  1. search for cognates
  2. align cognate sets
  3. merge alignmetns which are compatible using clique partitioning
  4. compute compatible patterns using clique partitioning
  5. order the patterns by frequency
  6. check for alignment quality by counting how many possible patterns there are and how good they are

Especially for point 5, we should not only order by frequency, but also by how often those patterns contain missing data. There should be some simple hierarchy to compare patterns:

  1. how many cognate sets are covered by a pattern
  2. how many missing spots does a pattern contain

But alternatively, one could even try to find out:

  1. how many patterns are almost compatible (< hamming distance with threshold T)
  2. how many patterns are actually invoked in the data
  3. how many alignment sites can actually be ambiguously assigned to patterns

All these things need to be put into a better workflow, so thtat it is clear what the method does.

As a goal, I'd like to have:

  • a numerical evaluation of all alignment sites regarding their regularity
  • an individual deviation score for each word inside an alignment

`find_colexified_alignments` fails in some cases

The reason is that the algorithm won't recognize two alignments to be the same when one has a gap more. These cases need to be handled by reducing all gaps in all alignments and THEN comparing their compatiblity.

Add structure from existing alignment

I am currently running into trouble while trying to create a CV structure based on a pre-existing alignments with manually trimmed data. Let's use for example the following data:

ID	DOCULECT	CONCEPT	VALUE	FORM	TOKEN	COGID	ALIGNMENT
1	Marubo	type of bank	kɨnã	kɨnã	k ɨ n ã	60	['k', 'ɨ', 'n', 'ã', '-', '(', '-', '-', ')']
2	Chakobo	type of bank	kɨˈnanɨ	kɨˈnanɨ	k ɨ n a n ɨ	60	['k', 'ɨ', 'n', 'a', '-', '(', 'n', 'ɨ', ')']

Is there any pre-defined way of doing this, or would I need to create a new tokens-column based on the alignments, removing all content within brackets in a dynamic way? @LinguList Maybe you can help me on this.

ranked singletons

Rank singletons in the data by assigning them to the closest pattern they could match with. This will give more fine-grained possibilities for analysing what is going wrong in a particular alignment.

prediction experiments

Two aspects of prediction are important:

  1. prediction can be used to approximate some average patterns, as it would kick out irregular data points (better than hamming distance), and we can create profiles for all words in the data, assessing divergences
  2. prediction can be used to identify the most important languages for successful prediction (by simply counting for each language, what prediction rates it yields on average). This would allow us to assess, which languages in our sample are "archaic".

Export correspondence patterns to wordlist format

This should be added to a column "CORRESPONDENCES", and the best format seems to be a simple unique identifier in space-separated form for each column of an alignment. This format is redundant, in so far as all participants of a cognate set will have the same vector, but probably more convenient than inventing a new structure where e.g., only one item has a form.

Proto-forms can be added by adding a proto-form and assigning it's constituents to patterns. In fact, it means that upon insertion of a proto-form, only the pattern has to be cloned.

If linguists then decide to change patterns, we can offer a way to do so by manually editing the identifier of each pattern.

Re-applying evaluation for pattern matches

The current introduction of information for alignments is based on a unique matching of one given pattern to different patterns. However, we want to display uncertainty for pattern matching by checking for each concrete pattern whether it could not be part of another consensus pattern. This means, before applying our "critics" to each alignment in the data, we should go through all alignments in our data and assign each column to the n patterns with which it is compatible.

handling cross-semantic cognate sets with irregularities

We can handle cross-semantic cognates in lingpy, but we need to make sure that all words in one language are in fact identical. If they are not, the patterns need to be separated from the rest and marked as problematic for the given normal cognate set.

In general, handling cross-semantic cognates is an aspect that deserves a bit more attention and needs to be handled properly. For the time being, it is still a bit shaky, how to handle ALL cross-semantic cognates within one analysis.

Can LingRex get frequencies of sound correspondences?

Dear all,

I have a CLDF repository with aligned cognates and want to transform them into this type of thing:

freq
image source
Can Lingrex do that? Or is there some other Python software out there that should preferrably be used? So far I'm using my own script, but am not keen on re-inventing the wheel.
Thank you in advance for the support

write_frequency() inflates Frequency: How to use correctly?

I am trying to compute the correspondence for the oliveiraprotopanoan dataset. However, while using CoPaR.write_patterns(), the frequency column counts many cases where the actual sound is not attested. Can this be deactivated?

cop = CoPaR("data/opp_alg.tsv", segments="tokens", transcription="tokens", ref="cogid")
cop.get_sites()
cop.cluster_sites()
cop.sites_to_pattern()
cop.add_patterns()
cop.write_patterns("data/opp_patterns.tsv", proto="Proto-Panoan")

Output:

ID | STRUCTURE | FREQUENCY | Proto-Panoan | Amawaka | Chakobo | Chaninawa | Kakataibo | Kapanawa | Katukina | Kaxarari | Kaxinawa | Korubo | Marinawa | Marubo | Matis | Mayoruna | Poyanawa | Shanenawa | Sharanawa | ShipiboKonibo | Yaminawa | Yawanawa | COGNATESETS | CONCEPTS
87-3 | V | 3 | i | i | i | i | i | i | i | a | i | e | i | ĭ | e | e | i | i | i | i | i | i | 120:3, 142:3, 220:4 | canoe / cold / cutia (species of rodent)
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

This posits an /e/ for Mayoruna, but it has no item in cognate set 120. In other cases, the gap is correctly added, but here not. Can this somehow be made stricter?

template_alignment

when a vowel is missing in the morpheme, the template_alignment reports a incorrect result:
"f ŋ ³¹³" should be "i c t" , but the alignment result is "i t".

[paper] correspondence patterns paper: evaluation

Evaluation can be based on the following ideas:

  1. introduce fake borrowings into a datasets on a random basis and try to find them using the prediction algorithm (hamming distance)
  2. make wrong cognate assumptions (using low-thresholds or similar) and cross bad cognates out based on:
  • disentangling cases where a full column is a singleton (crude method)
  • kicking out words from a cognate set which were identified due to the hamming distance method
  1. estimate regularity of a dataset
  • counting singletons and non-singletons
  • counting how many words are good
  • making a metric that counts along the lines of what was done before, the number of words explained versus the number of cognate sets (the objective function for regularity)
  1. identify layers
  • difficult, since layers are intertwined, but could be done on a language-to-language basis
  • can be mentioned in the discussion
  1. improve reconstructions
  • introduce wrong reconstructions and see whether the algo picks the right ones

All in all: evaluation does not need to evaluate the accuracy of the clique partitioning, but instead the accuracy of what can be done on top. Clique partitioning is already good for itself, since it reflects thinking of historical linguists directly.

new methods from the evaluation study

There are three more functions to be added officially to lingrex now, for a paper accepted with minor modifications.

  • a function to compute cross-semantic cognate statistics
  • a function to compute "normal" cognates from salient cognates (annotated with a column for morpheme glosses)
  • a function to compute b-cubed f-scores to assess the degree of variation resulting from strict vs. loose cognate coding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.