Okay, I just checked PN languages, and I set up the following requirements for the dat

The procedure you can see in the is easy: iterate over all languages, exclude t

selection of subsets based on coverage about autocogphylo HOT 32 CLOSED

phylostar commented on July 21, 2024

selection of subsets based on coverage

from autocogphylo.

Comments (32)

LinguList commented on July 21, 2024 1

Let me check the cleaning of the data first. I'll recompute the cognates then, as Sino-Tibetan was flawed, and we had some bugs in the other data as well. I'll try to add a column each in your preferred way for SCA, DOLGO, ASJP, okay?

from autocogphylo.

PhyloStar commented on July 21, 2024

The average coverage is given as percentage. I got it. The average coverage is calculated as the average of mutual coverage and converted into percentage. Average coverage seems to include almost all the languages. Average is a skewing measure that is easily influenced by some high coverage language pairs.

Minimal mutual coverage is much more reasonable since it means that some language pairs that have a mutual coverage below the threshold are excluded. What happens when a language that has low mutual coverage (<100) with another language but has high mutual coverage with the rest of the languages?

Regarding AN: I added Bouchard-Cote's dataset of 640 languoids to the repo.

from autocogphylo.

LinguList commented on July 21, 2024

Wait: average coverage is just: Sum of (concepts-of-language-n / concepts-in-data)

Minimal mutual coverage is: set(concepts-language-1).intersection(concepts-language-2)

from autocogphylo.

LinguList commented on July 21, 2024

The procedure you can see in the script is easy: iterate over all languages, exclude those with low average coverage, and see whether minimal mutual coverage increases, if this is the case, and the score is good, keep this set.

from autocogphylo.

LinguList commented on July 21, 2024

We'll need to clean the 640 languoids, which is a pain in the neck. I wonder whether we should ask pavel to look after the 400 languages that we had prepared already but split into 4 sets for the SVM paper could be combined (they are more or less cleaned), and we pick the one with highest coverage?

from autocogphylo.

PhyloStar commented on July 21, 2024

I understand the table much better. So, mutual coverage reduces PN and AA by more than 50%. One good thing is the Bayesian programs will converge faster. This is good.

I added abvd 400 languages to the repo.

from autocogphylo.

LinguList commented on July 21, 2024

excellent, I'll test right away.

from autocogphylo.

LinguList commented on July 21, 2024

Okay, this result hurts (but I believe, there is a systematic error in the data, that we have introduced when working on them first): there are 640 languages, but with a coverage of 89%, I get only 31 languages! Mutual coverage here is then 154.

from autocogphylo.

LinguList commented on July 21, 2024

If we accept 86% of average coverage, we get 65 languages, with mutual coverage of 136.
Or 45 with MC of 144 and AC of 88%.
Anyway, this hurts, right? That there are so many languages, but none fully reflected?

from autocogphylo.

PhyloStar commented on July 21, 2024

What about the other 396 languages dataset used by Gray et al. paper?

https://github.com/PhyloStar/AutoCogPhylo/blob/master/data/abvd2.tsv

from autocogphylo.

LinguList commented on July 21, 2024

ugly asjp, I don't want to touch it. But I'll have a try.

from autocogphylo.

LinguList commented on July 21, 2024

coverage is better there, 33 languages, 162 MC, 92% AC.

from autocogphylo.

PhyloStar commented on July 21, 2024

Yes. Both coverages are so low.

Sorry for the asjp. I uploaded the asjp file. I added the original file extracted by Pavel

from autocogphylo.

LinguList commented on July 21, 2024

no problem ;) we could also handle it with the SVM approach, so no deal.

from autocogphylo.

PhyloStar commented on July 21, 2024

Okay. This is much better. 33 languages for ABVD is not bad given the coverage.

from autocogphylo.

PhyloStar commented on July 21, 2024

I extracted a Oceanic subset of 160 languages from ABVD full. Do you want to test the coverage on that dataset also?

from autocogphylo.

PhyloStar commented on July 21, 2024

The 160 languages come from the punctuational bursts paper of Atkinson.

from autocogphylo.

LinguList commented on July 21, 2024

26 languages, MC 157, 89% AC for oceanic...

from autocogphylo.

PhyloStar commented on July 21, 2024

Oceanic can be thrown out then.
This issue of mutual coverage is actually problematic for dating of language phylogenies. If the Bayesian softwares treat the "?" symbol as ambiguous and do not exclude it from their calculation, then the dates would be old. Just a side point and not relevant for the current paper since we dont do dating. This is how Felsenstein treats missing data: encode them as "1" for all the states.
This is the main reason why Bouckaert's paper gives 8000 years as median root age of Indo-European whereas, excluding languages with high number of missing points pulls the date towards 6000 years. I will have to check how MrBayes treats missing data in their software.

from autocogphylo.

PhyloStar commented on July 21, 2024

What is the command to run the coverage test? I want to check the mutual coverage for a lower average coverage of 70% something.

from autocogphylo.

LinguList commented on July 21, 2024

you run:

$ python check_data.py an coverage cutout 100

The 100 means: retain only those languages with coverage of more than 100 different concepts for which there is a word.

It returns some scores, first for the original word list, then for the derived one. You need to change the "IPA" in the column to tokens, unless I have already done so. Gloss -> concept, this works better.

I usually use cutout=180 or cutout=170 for 200-concept lists.

from autocogphylo.

LinguList commented on July 21, 2024

I'm glad you immediately grasped the point about mutual pairwise coverage. It is just one score, for the worst language pair, but this is a warning. There are more refined algos in lingpy, but this is the easiest way to at least see whether you have a problem in the data.

It's funny: we always thought, having 200-concept lists should be fine for lingpy-lexstat, but we never checked the actual number of concepts for which there is a word in the data. Only when I realized it, I could understand, why lingpy constantly performs so bad on AN languages. In our ST dataset (not the one here, but our own freshly collected data), we said that we only retain languages with coverage of 80% of our base list. Later we reduced the base list, so we now have 89% of coverage, which is a good score, I think, but ideally, it should be 90% and higher. I discussed this a lot with colleagues, many did not believe me that it is a problem if you have a couple of languages with low mutual coverage, or low average coverage. While it is evident, why lingpy strikes, it was only my gut feeling that told me this should also impact on phylogenies. In fact, ASJP retains languages with 32 words. This means, the potential lowest mutual coverage of two languages is 40 - 8 - 8, as the list may be skewed, which is just 24 words, and the average coverage may be below 70%. If you allow each language to have only 70% of the concepts in the sample, you may end up with a mutual coverage of 40%! This deserves more attention in general in computational historical linguistics!

from autocogphylo.

PhyloStar commented on July 21, 2024

A coverage of 70% means that the worst case mutual coverage is 40%. This is really terrifying for longer word lists. Expect what can happen with ABVD. What we need to know is the average mutual coverage after the initial pruning threshold of X%. This depends on the dataset. The minimal mutual coverage then needs to be at least 70% of the size of the concept list.

I tried to look for minimal mutual coverage with 70% of the dataset. So, here are the statistics for different datasets.

datasets	mutual coverage	average coverage	languages	concepts
PN	135	155 (84%)	80 / 169	183
ST	77	90 (81%)	76 / 81	110
IE	154	175 (84%)	43 / 52	207
AA	146	170 (85%)	82 / 127	200

It seems like we can salvage more languages with 70% minimal mutual coverage cutoff.

from autocogphylo.

LinguList commented on July 21, 2024

I'd say, as a rule of thumb, for lexstat-like operations, everything above 100 should in principle work, but I'd prefer even more. In ST we don't have a chance and don't need to bother, so 90 is about the best we can get, but we can also go for the 70% thingy. The good thing with MMC is that we know that this is the worst case, so PN, IE, and AA are above this threshold, and for ST, we don't have a chance. So we could go with that. Should we just discard AN, or is there a simple way to find out whether the concepts there are equally bad distributed? E.g., reducing the number of concepts initially, concentrating on the best ones, taking some 180 (as I did in PN, where I only took Swadesh and ABVD concepts, which is why we have 183, compare pn-full.tsv where we have all concepts, very skewed): could we increase MMC and ACC to arrive at a better AAC + MMC? I have the gut feeling that this is a hard problem to optimize, was thinking a lot about it, but couldn't come up with a deterministic algorithmic solution.

from autocogphylo.

LinguList commented on July 21, 2024

In general, I think, this is quite an interesting problem: find the partition of the dataset, by deleting languages and concepts, which maximizes the number of languages in the sample and the number of concepts per language for which there is a word. It is difficult to counterbalance this, and it is not trivial, which concepts to delete...

from autocogphylo.

PhyloStar commented on July 21, 2024

We will be left with 54 languages and a MMC of 153 which is more than 70% for abvd2 dataset. Reviewers might ask (with a high chance) why we discarded abvd2.

Lot of effort went into ABVD and it is still in such a shape. :(

from autocogphylo.

LinguList commented on July 21, 2024

It's something nobody considered some 10 years ago, I'm afraid. I am so glad we did better now with our ST database. I was initially insisting on 80% coverage, knowing we could still discard some meanings.

If we say one sentence, quoting lingpy-2.6, that our choice of test sets is based on coverage, as we know that this may influence cognate detection, we should be fine, though. So should we start from there? I can tomorrow add a statement to output the datasets in the revised form.

But it's also comforting to see that ABVD of Bouchard-Côté was also not better. This shows me and Pavel didn't mess up things when cleaning ABVD for the SVM paper...

from autocogphylo.

PhyloStar commented on July 21, 2024

Yes. 80% threshold is a simple way to prune languages with low coverage. I agree that quoting LingPy should be sufficient. I will run Turchin, PMI, and Levenshtein systems and generate nexus files. Getting the gold standard trees should be fast. I will do it if Johannes is on vacation. Gerhard should be back from vacation tomorrow and he can get SVM nexus files. I will put the Bayesian runs on the server. The runs should be fast since we work with less number of languages and do not perform dating.

I was thinking of a simple procedure to test the effect of coverage of LexStat. On a dataset, apply coverage threshold ranging from 10% to 100% and prune those languages. Estimate LexStat parameters on each of the datasets. Use a trained system to cluster the unpruned datasets. When a language pair is missing from LexStat use SCA to calculate the word similarities for that language pair. This would demonstrate the effect of missing languages on estimation of sound alignment probabilities in LexStat. This is not for the current paper, but for an alternate paper where we see the effect of these hyperparameters such as weighing parameter in LexStat. In case of PMI it would be number of word pairs to process in each batch. Also, the effect of input sound class is also need to be compared. This can go to LREC or workshop paper for SIGMORPHON.

from autocogphylo.

PhyloStar commented on July 21, 2024

Somewhat relevant topic to our discussion -- on looking into data -- on NAACL blog by COLING Chair:

https://naacl2018.wordpress.com/2017/12/19/putting-the-linguistics-in-computational-linguistics/

At least the first point is quite relevant for us.

from autocogphylo.

LinguList commented on July 21, 2024

Yes, I think, we should consider making this a little spin-off project, looking at the degree to which phylogenetic reconstruction algos suffer from distorted data. In fact: this is easy to simulate. Just take a high-coverage dataset, compute the trees and dates, and then delete data-points randomly, recompute, and compare. Problem: Bayesian approaches take a long time. Running many analyses will be a pain in the neck. But one could probably approximate by using distance measures.

from autocogphylo.

erathorn commented on July 21, 2024

If you consider making this spin-off project you should have a look at Igor's work. https://www.lorentzcenter.nl/lc/web/2015/767/abstracts.pdf (p. 35) According to his webpage there is a corresponding paper currently under review.

There is another thing about MMC we should consider. How bad is it for a particular data set? Hereby I mean, is it just one pair of languages with a low MMC. Maybe the mean of the remaining language pairs is much higher, i.e. how does the distribution of MC scores look like. This may be another point to add to the paper, to convince reviewers why there is the need to throw out certain languages or even entire data sets.

from autocogphylo.

LinguList commented on July 21, 2024

Yes, I mean, it is trivial to even average MC (AMC) across all languages. I just did refrain from this, as it may hide a particularly bad dataset. LingPy offers both scores, but so far, I only considered MMC since it's faster to compute, and ACC (average concept coverage) additionally shows us whether we have a problem (AN HAS a problem, this is clear now, as ACC is very low as well). We might add the AMC to our calculations, in lingpy it's just:

>>> from lingpy.compare.sanity import mutual_coverage
>>> mc = mutual_coverage(wordlist)
>>> amc = sum(mc.values()) / len(mc)

As to Igor's paper: I remember I read it, but it's a pity it was never published, as in this dense form, I really don't know what to do with it. But you're right that it may point into a similar direction, although Igor is less pessimistic about humans messing up the coding...

from autocogphylo.

selection of subsets based on coverage about autocogphylo HOT 32 CLOSED

Comments (32)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent