Code Monkey home page Code Monkey logo

lingpy-tutorial's Introduction

LingPy: A Python Library for Automatic Tasks in Historical Linguistics

This repository contains the Python package lingpy which can be used for various tasks in computational historical linguistics.

Build Status DOI PyPI version Documentation

Authors (Version 2.6.12): Johann-Mattis List and Robert Forkel

Collaborators: Christoph Rzymski, Simon J. Greenhill, Steven Moran, Peter Bouda, Johannes Dellert, Taraka Rama, Tiago Tresoldi, Gereon Kaiping, Frank Nagel, and Patrick Elmer.

LingPy is a Python library for historical linguistics. It is being developed for Python 2.7 and Python 3.x using a single codebase.

Quick Installation

For our latest stable version, you can simply use pip or easy_install for installation:

$ pip install lingpy

or

$ pip install lingpy

Depending on which easy_install or pip version you use, either the Python2 or the Python3 version of LingPy will be installed.

If you want to install the current GitHub version of LingPy on your system, open a terminal and type in the following:

$ git clone https://github.com/lingpy/lingpy/
$ cd lingpy
$ python setup.py install

If the last command above returns you some error regarding user permissions (usually "Errno 13"), you can install LingPy in your home Python setup:

$ python setup.py install --user

In order to use the library, start an interactive Python session and import LingPy as follows:

>>> from lingpy import *

To install LingPy to hack on it, fork the repository on GitHub, open a terminal and type:

$ git clone https://github.com/<your-github-user>/lingpy/
$ cd lingpy
$ python setup.py develop

This will install LingPy in "development mode", i.e. you will be able edit the sources in the cloned repository and import the altered code just as the regular Python package.

lingpy-tutorial's People

Contributors

lingulist avatar maryewal avatar simongreenhill avatar tresoldi avatar xrotwang avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

lingpy-tutorial's Issues

change the data which is currently being used

The current data is not in sync with the one we have in the polynesian project. There, using East Polynesian for good coverage and some closer analyses, not taking too long time, is straighfoward, and I used the following code block to get the subset:

def subset():
    eastern = ['NorthMarquesan_38', 'Austral_128', 'Austral_1213', 
            'Tahitian_173', 'Sikaiana_243', 'Maori_85', 'Hawaiian_52',
            'Mangareva_239', 'Tuamotuan_246', 'Rapanui_264'] 

    wl = Wordlist('polynesian-2.tsv')
    wl.output('tsv', filename='east-polynesian', ignore='all', subset=True,
            rows=dict(doculect = 'in '+str(eastern)), prettify=False)
    wl = Wordlist('east-polynesian.tsv')
    print(wl.height, wl.width)

The tutorial should be built in such a way that it take the most recent version of the Polynesian data built from lexibank (with corrections provided by @maryeval), and the subset-function based on coverage should be changed in the story in such a way that it is now the East Polynesian languages which are chosen, not any random subset of high coverage.

Detail requirements

We use python-newick to display a tree in the tutorial, but this functionality is only available as of python-newick>=0.8. This must be stated clearly.

refine functions defined in autocogs.py or decide to drop it

previously, I had added all things in the ipynb file to a file called "autocogs.py", which can be triggered via command line and contains the same commands. Should we stick to this practice, which may be useful for testing or users who prefer to just thest the command line, or should we drop it? If we stick to it, we need to adjust the code (commands are still specified in brackets in the ipynb file).

export data to cldf

This is another more important requirement. In fact, we face the following situation with the data now:

  • we started with simple ortho-profiles, proposed by @maryewal
  • in the meantime, Mary further worked on the data, and added both corrections for entries AND morphological segmentation

In order to keep the tutorial simple, I plan to ignore the segmentation. As this is work that @maryewal will further pursue anyway, the additional work won't be lost, but I plan deliberately to not show it in this version of the data.

What we need to consider, however, is that we have a dataset in a very excellent state now, which was NOT entirely treated by a machine only, but manually post-corrected. So our authoritative version of the data is Mary's last file, which still needs:

  • modified language names, Mary sent me the list already
  • deletion of the segment markers (to not confuse readers)
  • maybe deletion of some columns which are superfluous

For CLDF, there will be some additional fields, like the URL link to BVD, etc. I think this is a good example on how data can be successively improved.

Adjust for one error and modify output format

alm.get_msa('infomap')['1']

should not be a string, but an integer (this is lingpy now forcing the data type.

The output format can further be adjusted to be not "prettifed".

Thanks to Alexandru Craevschi for pointing this out.

add missing sections to be in line with the draft

The draft has two missing sections to be re-worked (with new examples): cognate detection and evaluation / additional stuff (e.g., nexus export).

The tutorial has now been renumbered, so we can directly name the sections in the paper-draft. For publication, the tutorial should additionally be provided in form of a latex-pdf document that is refined (automatic latex is ugly with ipynb).

I will later on list the missing sections in this thread and try to divide roles to make sure everybody can contribute something.

Checklist for submission

  • adding final version of data (polynesian.tsv)
  • installation section (with additional libraries, probably only "segments" and "pycldf" and python-igraph)
  • check and synchronize section 4 in tutorial-online with tutorial-draft
  • 4.5 intro on inspection with edictor (maybe add screenshot)
  • 4.6 also do this section with edictor (add screenshot)
  • 5.1 explain the structure of the "DIFF" file in bcubes / lingpy.evaluate
  • 5.3 code for "benefits of segmentation" needs update
  • 6.1 nexus export (add section, export is already in lingpy)
  • 6.2 consider dropping this part, as it is not discussed in tutorial
  • 6.3 cldf export: add section, lingpy needs function for cldf-import, export to cldf is done via @xrotwang's code
  • add references, delete links to "evobib"
  • submit lingpy-2.6
  • official release of segments package
  • small pypi version of pycldf for reference in the tutorial
  • small cldf-version (paper-less, but needed to reference in the tutorial main paper as, say Forkel et al. 2017)
  • convert to pdf via markdown and pandoc (tedious, but needs to be done to ease submission, some reviewers won't look at the notebook)
  • upload data and code to osf.io to allow for anonymous review
  • change paper by adding lumper and splitter instead of the random cognate detection, as this is more telling

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.