phylostar / autocogphylo Goto Github PK

View Code? Open in Web Editor NEW

1.0 4.0 1.0 54.2 MB

Repository for testing how good Bayesian phylogenetic algorithms fare with automated vs gold cognate judgments

License: GNU General Public License v3.0

Python 62.37% Shell 33.53% R 4.11%

lexstat pmi levenshtein cognates clustering mcmc

autocogphylo's People

Stargazers

Watchers

Forkers

asaber5492

autocogphylo's Issues

improper segmentation in aa data

we have some cases where the segmentation failed, since before using ipa2tokens, the spaces were not converted to "_", which is the usual way to use the function. Note that ipa2tokens splits as a default on white-space, which has advantages when using the function internally, but requires to remove whitespace by replacing by either nothing or by the underscore prior to initial segmentation.

I estimate that there are about 20 errors in the data (judging on eyeballing), and we can tolerate it, but it is important, in case you apply ipa2tokens in the future, that you keep in mind how the function works. I might add an additional comment to lingpy.org, but I think that basically the description is exhaustive enough already. The best way to prepare data anyway is to use orthography profiles. This is how we arrived at the segmentation of PN data and also ST data. But it requires, of course, more user-input than using lingpy for the segmentation task...

selection of subsets based on coverage

Okay, I just checked PN languages, and I set up the following requirements for the data quality:

1 average coverage should be 95% and higher
2 mutual coverage should be > 100 for concept lists with more than 100 concepts, minimally 90

This leaves us with the following scores:

datasets	mutual coverage	average coverage	languages	concepts
PN	149	95%	53 / 169	183
ST	90	96%	64 / 81	110
IE	161	97%	42 / 53	207
AA	163	95%	58 / 127	200

AN is out, as it is by no means close to our criteria, it cannot be used in tthis form, unless you provide another dataset.

I can prepare data accordingly and submit reduced lists along those lines mentioned above.

Pama-Nyungan

@LinguList. Do you think we should include Pama-Nyungan? Is it essential? I added the data into the data folder.

Sino-Tibetan data contains many errors and needs to be replaced

I have the impression that the ST data was not segmentized following my suggestion, but independently. What is important to know is that I made an explicit SEGMENTATION, so the data that needs to be used with lingpy is the column TOKENS, nothing else.

I'll update the file and replace it with the version which I originally prepared.

ASJP and DOLGO columns

The columns seem to be swapped at least in Indo-European file.

reference trees

We'll need to add the glottolog reference trees. Do we still have them? I saw, we don't have glottocodes for all languages in the data. Is this something that can be done quickly, or something we need to worry about?

Generalized quartet distance

Hi. My research lab is currently looking for implementations of generalized quartet distance and was wondering where you obtained your implementation of GQD.

In gqd.py, I see that you call "/home/tarakark/tools/qdist/qdist," so I was wondering if you could point us to where you got qdist from.

Thanks!!

Nexus: taxa and format

When LingPy generates a nexus files, some of the language names have characters such as space, parantheses where MrBayes complains. Another format is the length of the name field in the nexus file. You might want to consider to have a tab separated line such as:
GERMAN\t010101

Number of clusters and number of characters in nexus files

@LinguList When going through the nexus files generated by LingPy, I noticed that the number of unique entries under "INFERRED_CLASS" column does not match up with the number of characters given in the header block in nexus file. Is it okay?

Mutual coverage reports

I am just running mutual coverage reports (remember: mutual coverage means: the minimal amount of shared concept between a pair of languages). Average coverage means: how many concepts do the languages have on average (out of the full set).

dataset	minimal mutual coverage	average coverage	outlier
st	60	94%	Bai (74%)
ie	65	92%	IRISH (45%)
an	9	60%	Canala (16%)
aa	0	84%	Jru-Laven (1%)

This shows beyond doubt that we need to refine and clean the data substantially. While we have to tolerate a lower coverage for ST (due to the smaller number of words), I'd suggest we go only with a coverage of 100 concepts for the 200-concept lists. I'll prepare and add the PN data as well, where coverage is at times similarly low. But just imagine: there are languages in AA which have NO words in common. We can't analyse these, this is not serious anymore.

LexStat

Running LexStat and generating nexus file for ABVD (400 languages), ABVD (Bouchard-Cote), IELex.

Turchin

Turchin based Nexus file for all the datasets. I think @LinguList might do this.

adjust pmi score creation to regular wordlist output

I am just testing the online-pmi script, and it works so far, but I'd appreciate to be able to see the judgments (also in general), so having an output written to computed/opmi-*.tsv with an additional column "infferred_cognates", etc. would be ideal. In this way, we can compute all scores (like b-cubes and nexus export) with lingpy. This would make our system more coherent, as we can control that no errors are introduced due to the conversion to nexus.

In technical terms, this is easy: if you provide a python dictionary with ID of the original list and as value the cognate-set ID, I can add the remaining code to use lingpy for nexus export and for wordlist output!

Language names to glottocodes omitted languages.

I changed the language names to glottocodes. I could not find glottocodes for some languages. Overall, this means skipping 23 languages. We end up with 253 languages in total.

Kui(Huffman1979), Kui(Sriwises1978) had wrong iso code in the data file. I am skeptic if it is the same language as Glottolog tells. I removed it in any case.

The statistics are as follows:

ST

Language with repeated glottcode Rourou nusu1239
Language with repeated glottcode Written Tibetan tibe1272
Language with repeated glottcode Xiaxe Tibetan amdo1237

AA

Language not found Kui(Huffman1979)
Language not found Kui(Sriwises1978)
Language with repeated glottcode Palaung-Kalaw ruch1235
Language with repeated glottcode So-Khammouane sooo1254
Language with repeated glottcode So-SakonNakhon sooo1254
Language not found Souei-Saravan

AN

Language with repeated glottcode ChuukeseAKATrukese chuu1238
Language with repeated glottcode Iraralay ivat1242
Language with repeated glottcode Isamorong ivat1242
Language with repeated glottcode Itbayat ivat1242
Language with repeated glottcode Ivasay ivat1242
Language with repeated glottcode Katingan ngaj1237
Language with repeated glottcode MalayBahasaIndonesia indo1316
Language with repeated glottcode NakanaiBilekiDialect naka1262
Language with repeated glottcode TagalogAnthonydelaPaz taga1270

PN

Language not found Wirangu-Nauo

IE

Language not found DANISH_FJOLDE
Language not found OLD_SWEDISH
Language with repeated glottcode OSSETIC_IRON osse1243
Language not found STAVANGERSK

phylostar / autocogphylo Goto Github PK

autocogphylo's People

Stargazers

Watchers

Forkers

autocogphylo's Issues

Recommend Projects

Recommend Topics

Recommend Org