Light

lingpy / lingpy-tutorial Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 2.08 MB

Tutorial for automatic sequence comparison with LingPy

License: Creative Commons Attribution 4.0 International

Jupyter Notebook 11.02% TeX 3.64% HTML 77.16% Python 8.18%

lingpy-tutorial's Introduction

LingPy: A Python Library for Automatic Tasks in Historical Linguistics

This repository contains the Python package lingpy which can be used for various tasks in computational historical linguistics.

Authors (Version 2.6.12): Johann-Mattis List and Robert Forkel

Collaborators: Christoph Rzymski, Simon J. Greenhill, Steven Moran, Peter Bouda, Johannes Dellert, Taraka Rama, Tiago Tresoldi, Gereon Kaiping, Frank Nagel, and Patrick Elmer.

LingPy is a Python library for historical linguistics. It is being developed for Python 2.7 and Python 3.x using a single codebase.

All source code is available at: https://github.com/lingpy/lingpy.
Documentation can be found at: http://lingpy.org.
For a list of papers in which LingPy was applied, see here.

Quick Installation

For our latest stable version, you can simply use pip or easy_install for installation:

$ pip install lingpy

or

$ pip install lingpy

Depending on which easy_install or pip version you use, either the Python2 or the Python3 version of LingPy will be installed.

If you want to install the current GitHub version of LingPy on your system, open a terminal and type in the following:

$ git clone https://github.com/lingpy/lingpy/
$ cd lingpy
$ python setup.py install

If the last command above returns you some error regarding user permissions (usually "Errno 13"), you can install LingPy in your home Python setup:

$ python setup.py install --user

In order to use the library, start an interactive Python session and import LingPy as follows:

>>> from lingpy import *

To install LingPy to hack on it, fork the repository on GitHub, open a terminal and type:

$ git clone https://github.com/<your-github-user>/lingpy/
$ cd lingpy
$ python setup.py develop

This will install LingPy in "development mode", i.e. you will be able edit the sources in the cloned repository and import the altered code just as the regular Python package.

lingpy-tutorial's People

Contributors

Stargazers

Watchers

Forkers

wu-urbanek simongreenhill

lingpy-tutorial's Issues

change the data which is currently being used

The current data is not in sync with the one we have in the polynesian project. There, using East Polynesian for good coverage and some closer analyses, not taking too long time, is straighfoward, and I used the following code block to get the subset:

def subset():
    eastern = ['NorthMarquesan_38', 'Austral_128', 'Austral_1213', 
            'Tahitian_173', 'Sikaiana_243', 'Maori_85', 'Hawaiian_52',
            'Mangareva_239', 'Tuamotuan_246', 'Rapanui_264'] 

    wl = Wordlist('polynesian-2.tsv')
    wl.output('tsv', filename='east-polynesian', ignore='all', subset=True,
            rows=dict(doculect = 'in '+str(eastern)), prettify=False)
    wl = Wordlist('east-polynesian.tsv')
    print(wl.height, wl.width)

The tutorial should be built in such a way that it take the most recent version of the Polynesian data built from lexibank (with corrections provided by @maryeval), and the subset-function based on coverage should be changed in the story in such a way that it is now the East Polynesian languages which are chosen, not any random subset of high coverage.

expose low-level functions rather than pass them through wordlist.output

instead of output('dst'), etc. we can write

from lingpy.convert.strings import matrix2dst, pap2nex
with open('phylip.dst', 'w') as f:
    f.write(matrix2dst(wl.distances, wl.taxa))

similar with nexus-output, will add details later

Detail requirements

We use python-newick to display a tree in the tutorial, but this functionality is only available as of python-newick>=0.8. This must be stated clearly.

refine functions defined in autocogs.py or decide to drop it

previously, I had added all things in the ipynb file to a file called "autocogs.py", which can be triggered via command line and contains the same commands. Should we stick to this practice, which may be useful for testing or users who prefer to just thest the command line, or should we drop it? If we stick to it, we need to adjust the code (commands are still specified in brackets in the ipynb file).

export data to cldf

This is another more important requirement. In fact, we face the following situation with the data now:

we started with simple ortho-profiles, proposed by @maryewal
in the meantime, Mary further worked on the data, and added both corrections for entries AND morphological segmentation

In order to keep the tutorial simple, I plan to ignore the segmentation. As this is work that @maryewal will further pursue anyway, the additional work won't be lost, but I plan deliberately to not show it in this version of the data.

What we need to consider, however, is that we have a dataset in a very excellent state now, which was NOT entirely treated by a machine only, but manually post-corrected. So our authoritative version of the data is Mary's last file, which still needs:

modified language names, Mary sent me the list already
deletion of the segment markers (to not confuse readers)
maybe deletion of some columns which are superfluous

For CLDF, there will be some additional fields, like the URL link to BVD, etc. I think this is a good example on how data can be successively improved.

Adjust for one error and modify output format

alm.get_msa('infomap')['1']

should not be a string, but an integer (this is lingpy now forcing the data type.

The output format can further be adjusted to be not "prettifed".

Thanks to Alexandru Craevschi for pointing this out.

add missing sections to be in line with the draft

The draft has two missing sections to be re-worked (with new examples): cognate detection and evaluation / additional stuff (e.g., nexus export).

The tutorial has now been renumbered, so we can directly name the sections in the paper-draft. For publication, the tutorial should additionally be provided in form of a latex-pdf document that is refined (automatic latex is ugly with ipynb).

I will later on list the missing sections in this thread and try to divide roles to make sure everybody can contribute something.

Add requirements.txt for easy setup

@tresoldi I can help with this if necessary.

add BEAST output for nexus to lingpy, to make sure we can name it in the tutorial

Beast requires modified nexus (first column of matrix are zeros), so we should slightly adjust this function for the output of presence-absence patterns to account for this already now, also to avoid that people will be confused.

make output of scoring function more explicit for replication

there is the function scorer2str in the lingpy.convert.string module which may be useful to make output of lexstat results more explicit.

Checklist for submission

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.