Code Monkey home page Code Monkey logo

autocogphylo's People

Contributors

erathorn avatar gerhardjaeger avatar ktrama avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

asaber5492

autocogphylo's Issues

selection of subsets based on coverage

Okay, I just checked PN languages, and I set up the following requirements for the data quality:

1 average coverage should be 95% and higher
2 mutual coverage should be > 100 for concept lists with more than 100 concepts, minimally 90

This leaves us with the following scores:

datasets mutual coverage average coverage languages concepts
PN 149 95% 53 / 169 183
ST 90 96% 64 / 81 110
IE 161 97% 42 / 53 207
AA 163 95% 58 / 127 200

AN is out, as it is by no means close to our criteria, it cannot be used in tthis form, unless you provide another dataset.

I can prepare data accordingly and submit reduced lists along those lines mentioned above.

adjust pmi score creation to regular wordlist output

I am just testing the online-pmi script, and it works so far, but I'd appreciate to be able to see the judgments (also in general), so having an output written to computed/opmi-*.tsv with an additional column "infferred_cognates", etc. would be ideal. In this way, we can compute all scores (like b-cubes and nexus export) with lingpy. This would make our system more coherent, as we can control that no errors are introduced due to the conversion to nexus.

In technical terms, this is easy: if you provide a python dictionary with ID of the original list and as value the cognate-set ID, I can add the remaining code to use lingpy for nexus export and for wordlist output!

Nexus: taxa and format

When LingPy generates a nexus files, some of the language names have characters such as space, parantheses where MrBayes complains. Another format is the length of the name field in the nexus file. You might want to consider to have a tab separated line such as:
GERMAN\t010101

Turchin

Turchin based Nexus file for all the datasets. I think @LinguList might do this.

reference trees

We'll need to add the glottolog reference trees. Do we still have them? I saw, we don't have glottocodes for all languages in the data. Is this something that can be done quickly, or something we need to worry about?

Pama-Nyungan

@LinguList. Do you think we should include Pama-Nyungan? Is it essential? I added the data into the data folder.

improper segmentation in aa data

we have some cases where the segmentation failed, since before using ipa2tokens, the spaces were not converted to "_", which is the usual way to use the function. Note that ipa2tokens splits as a default on white-space, which has advantages when using the function internally, but requires to remove whitespace by replacing by either nothing or by the underscore prior to initial segmentation.

I estimate that there are about 20 errors in the data (judging on eyeballing), and we can tolerate it, but it is important, in case you apply ipa2tokens in the future, that you keep in mind how the function works. I might add an additional comment to lingpy.org, but I think that basically the description is exhaustive enough already. The best way to prepare data anyway is to use orthography profiles. This is how we arrived at the segmentation of PN data and also ST data. But it requires, of course, more user-input than using lingpy for the segmentation task...

Generalized quartet distance

Hi. My research lab is currently looking for implementations of generalized quartet distance and was wondering where you obtained your implementation of GQD.

In gqd.py, I see that you call "/home/tarakark/tools/qdist/qdist," so I was wondering if you could point us to where you got qdist from.

Thanks!!

Language names to glottocodes omitted languages.

I changed the language names to glottocodes. I could not find glottocodes for some languages. Overall, this means skipping 23 languages. We end up with 253 languages in total.

Kui(Huffman1979), Kui(Sriwises1978) had wrong iso code in the data file. I am skeptic if it is the same language as Glottolog tells. I removed it in any case.

The statistics are as follows:

ST

Language with repeated glottcode Rourou nusu1239
Language with repeated glottcode Written Tibetan tibe1272
Language with repeated glottcode Xiaxe Tibetan amdo1237

AA

Language not found Kui(Huffman1979)
Language not found Kui(Sriwises1978)
Language with repeated glottcode Palaung-Kalaw ruch1235
Language with repeated glottcode So-Khammouane sooo1254
Language with repeated glottcode So-SakonNakhon sooo1254
Language not found Souei-Saravan

AN

Language with repeated glottcode ChuukeseAKATrukese chuu1238
Language with repeated glottcode Iraralay ivat1242
Language with repeated glottcode Isamorong ivat1242
Language with repeated glottcode Itbayat ivat1242
Language with repeated glottcode Ivasay ivat1242
Language with repeated glottcode Katingan ngaj1237
Language with repeated glottcode MalayBahasaIndonesia indo1316
Language with repeated glottcode NakanaiBilekiDialect naka1262
Language with repeated glottcode TagalogAnthonydelaPaz taga1270

PN

Language not found Wirangu-Nauo

IE

Language not found DANISH_FJOLDE
Language not found OLD_SWEDISH
Language with repeated glottcode OSSETIC_IRON osse1243
Language not found STAVANGERSK

LexStat

Running LexStat and generating nexus file for ABVD (400 languages), ABVD (Bouchard-Cote), IELex.

Sino-Tibetan data contains many errors and needs to be replaced

I have the impression that the ST data was not segmentized following my suggestion, but independently. What is important to know is that I made an explicit SEGMENTATION, so the data that needs to be used with lingpy is the column TOKENS, nothing else.

I'll update the file and replace it with the version which I originally prepared.

Mutual coverage reports

I am just running mutual coverage reports (remember: mutual coverage means: the minimal amount of shared concept between a pair of languages). Average coverage means: how many concepts do the languages have on average (out of the full set).

dataset minimal mutual coverage average coverage outlier
st 60 94% Bai (74%)
ie 65 92% IRISH (45%)
an 9 60% Canala (16%)
aa 0 84% Jru-Laven (1%)

This shows beyond doubt that we need to refine and clean the data substantially. While we have to tolerate a lower coverage for ST (due to the smaller number of words), I'd suggest we go only with a coverage of 100 concepts for the 200-concept lists. I'll prepare and add the PN data as well, where coverage is at times similarly low. But just imagine: there are languages in AA which have NO words in common. We can't analyse these, this is not serious anymore.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.