sigmorphon / 2021task0 Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 8.0 44.18 MB

Python 34.20% Scilab 61.84% R 3.59% Shell 0.37%

2021task0's People

Contributors

Stargazers

Watchers

Forkers

echodroff ryskina rlevy elejalea shijie-wu mguzmann dylanmo815 inkyubeytor

2021task0's Issues

English IPA data

The English IPA data released for part 2 of this task contains a lot of errors. There are many instances of spurious vowel changes between the lemma and inflected forms (e.g., [əfɪlieɪt] 'affiliate' has the 3sg present listed as [əfɪliəts], which is a plural not the intended verbal form, [θɹɛt] 'threat' has the present participle listed as [θɹiːtɪŋ]). Affricates are inconsistently transcribed (e.g., [t͡ʃɔk] 'chalk' has the 3sg presented list as [tʃæks], note also the spurious vowel change). There are some past and other forms that I have not encountered before and suspect are not real, such as fret -> frate. (I haven't tried to catalogue the errors comprehensively.)

I have cleaned up the transcriptions by drawing them from an IPA version of the CMU dictionary (available here, https://github.com/menelik3/cmudict-ipa) whenever possible, using the original lemma pronunciations when these weren't available in CMU-IPA, applying the regular rules (with their standard allomorphy) to generate inflected forms that are not in CMU-IPA, and doing some very basic data checking / cleaning. I simplified the transcriptions by removing tie-bars from affricates, syllabic-consonant diacritics, and length marks on vowels; these were inconsistent and aren't needed in a broad transcription. CMU unfortunately has pervasive vacillation between some vowel pairs (e.g., ə ~ ɪ, ɑː ~ ɔ, ɝ ~ ɛr) and I have tried to normalize these within rows (but not across them). There are some remaining entries that I have marked to be checked -- they are relatively few in number and their treatment probably won't matter too much for the task, but it would be nice to eliminate them if they are truly spurious. More generally it would be good for an organizer to look over the revisions that I have made.

Two questions. (1) Who should I send the revised data set to? (2) Many of the errors were located with simple string comparisons (principally, testing whether the lemma is a prefix of an inflected form). Are similar tests going to be applied to ensure a basic level of accuracy for the other languages?

How to check phoneme inventories against CLTS

Assuming your sigmorphon data in IPA are sound-segmented (space used for segmentation), and your data is a simple table, you can use the following code to check if your data conform to CLTS (https://clts.clld.org).

To get started, download clts and pyclts:

$ git clone https://github.com/cldf-clts/clts
$ git clone https://github.com/cldf-clts/pyclts

Install pyclts:

$ pip install -e pyclts

Now, you create the following Python script:

from collections import defaultdict
from pyclts import CLTS
from sys import argv
from tqdm import tqdm as progressbar

bipa = CLTS(argv[1]).bipa

def evaluate(data, indices):
    sounds = defaultdict(list)
    for i, row in progressbar(enumerate(data)):
        for idx in indices:
            for j, t in enumerate(row[idx].split()):
                sounds[t] += [(i, idx, j, row[idx])]
    out = {}
    for t, values in sounds.items():
        sound = bipa[t]
        if sound.type == 'unknownsound':
            out[t, 'unknownsound', '?'] = values
        elif str(sound) != t:
            out[t, sound.name, str(sound)] = values
        else:
            out[t, sound.name, str(sound)] = values
    return out

with open(argv[2]) as f:
    data = [[cell.strip() for cell in row.split('\t')] for row in f]
analysis = evaluate(data, [int(x) for x in argv[3:]])
with open(argv[2]+'-analysis', 'w') as f:
    f.write('SOUND\tCLTS\tALIAS\tFREQUENCY\tNAME\tEXAMPLE\n')
    for (token, name, sound), values in sorted(
            analysis.items(),
            key=lambda x: (len(x[1]), x[0][1], x[0][0]),
            reverse=True):
        f.write('\t'.join([
            token, 
            sound, 
            '*' if token != sound else '',
            str(len(values)),
            name, 
            values[0][3],
            ])+'\n')

With this script (save it as clts.py, you can now check your file as follows:

$ python clts.py PATH/TO/CLTS INPUT/FILE IDXA [IDXB]

Here, PATH/TO/CLTS is the path to the folder clts, which you downloaded via git. INPUT/FILE is your file. IDXA is the first index in your file (Python style, count from 0), where segments can be found, you can also check more indices.

Inconsistencies in German IPA?

It is not clear to me if this is intended, but I find the German IPA inconsistent in many points. E.g., you use the nasalis sonans in many cases for infinitive endings, like in abaʁbaɪ̯tn̩, but then the participle (abarbeitend) is abaʁbaɪ̯tənt. Here, the nasalis sonans should also be used in the second form, if one starts using it consequently, as abaʁbaɪ̯tn̩t would be the normal pronunciation here (and in both cases, we have variance, between n̩ and ən, but this variance is due to care on speech and education, not due to a following consonant, which the spelling suggests).

Other cases are olʏmpiastaːdioːn where the st is in fact ʃt, following the normal rule by which st > ʃt unless in the end of a syllable, and the final o is in fact short (only dialectally, it may be long, but I find this unlikely).

Furthermore toːdəzɔpfɐ should be toːdəsɔpfɐ, the genitive s in to:dəs is at a morpheme break, so the rule by which intervocalic s > z does not apply.

toːləʁants misses the bar across ts, which is then given in the derived forms (toːləʁant͡sn̩).

All in all this looks like it would be very useful to have this double-checked, as these cases may very likely have an impact on machine learning approaches, as these problems invite to regularize what is not a fact of the language.

I haven't checked the other systems, but in general, I recommend to have a look at rather standardized systems of German, like the CELEX data, where they have a large lexicon, to check for these cases.

As to the number of inconsistencies: I am afraid I find them in every fifth item, based on checking the data during the last ten minutes while writing.

sigmorphon / 2021task0 Goto Github PK

2021task0's People

Contributors

Stargazers

Watchers

Forkers

2021task0's Issues

English IPA data

How to check phoneme inventories against CLTS

Inconsistencies in German IPA?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent