Code Monkey home page Code Monkey logo

2021task0's People

Contributors

bleonar5 avatar echodroff avatar elejalea avatar garrettnicolai avatar ivri avatar ryancotterell avatar ryskina avatar shijie-wu avatar tpimentelms avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

2021task0's Issues

English IPA data

The English IPA data released for part 2 of this task contains a lot of errors. There are many instances of spurious vowel changes between the lemma and inflected forms (e.g., [əfɪlieɪt] 'affiliate' has the 3sg present listed as [əfɪliəts], which is a plural not the intended verbal form, [θɹɛt] 'threat' has the present participle listed as [θɹiːtɪŋ]). Affricates are inconsistently transcribed (e.g., [t͡ʃɔk] 'chalk' has the 3sg presented list as [tʃæks], note also the spurious vowel change). There are some past and other forms that I have not encountered before and suspect are not real, such as fret -> frate. (I haven't tried to catalogue the errors comprehensively.)

I have cleaned up the transcriptions by drawing them from an IPA version of the CMU dictionary (available here, https://github.com/menelik3/cmudict-ipa) whenever possible, using the original lemma pronunciations when these weren't available in CMU-IPA, applying the regular rules (with their standard allomorphy) to generate inflected forms that are not in CMU-IPA, and doing some very basic data checking / cleaning. I simplified the transcriptions by removing tie-bars from affricates, syllabic-consonant diacritics, and length marks on vowels; these were inconsistent and aren't needed in a broad transcription. CMU unfortunately has pervasive vacillation between some vowel pairs (e.g., ə ~ ɪ, ɑː ~ ɔ, ɝ ~ ɛr) and I have tried to normalize these within rows (but not across them). There are some remaining entries that I have marked to be checked -- they are relatively few in number and their treatment probably won't matter too much for the task, but it would be nice to eliminate them if they are truly spurious. More generally it would be good for an organizer to look over the revisions that I have made.

Two questions. (1) Who should I send the revised data set to? (2) Many of the errors were located with simple string comparisons (principally, testing whether the lemma is a prefix of an inflected form). Are similar tests going to be applied to ensure a basic level of accuracy for the other languages?

How to check phoneme inventories against CLTS

Assuming your sigmorphon data in IPA are sound-segmented (space used for segmentation), and your data is a simple table, you can use the following code to check if your data conform to CLTS (https://clts.clld.org).

To get started, download clts and pyclts:

$ git clone https://github.com/cldf-clts/clts
$ git clone https://github.com/cldf-clts/pyclts

Install pyclts:

$ pip install -e pyclts

Now, you create the following Python script:

from collections import defaultdict
from pyclts import CLTS
from sys import argv
from tqdm import tqdm as progressbar

bipa = CLTS(argv[1]).bipa

def evaluate(data, indices):
    sounds = defaultdict(list)
    for i, row in progressbar(enumerate(data)):
        for idx in indices:
            for j, t in enumerate(row[idx].split()):
                sounds[t] += [(i, idx, j, row[idx])]
    out = {}
    for t, values in sounds.items():
        sound = bipa[t]
        if sound.type == 'unknownsound':
            out[t, 'unknownsound', '?'] = values
        elif str(sound) != t:
            out[t, sound.name, str(sound)] = values
        else:
            out[t, sound.name, str(sound)] = values
    return out

with open(argv[2]) as f:
    data = [[cell.strip() for cell in row.split('\t')] for row in f]
analysis = evaluate(data, [int(x) for x in argv[3:]])
with open(argv[2]+'-analysis', 'w') as f:
    f.write('SOUND\tCLTS\tALIAS\tFREQUENCY\tNAME\tEXAMPLE\n')
    for (token, name, sound), values in sorted(
            analysis.items(),
            key=lambda x: (len(x[1]), x[0][1], x[0][0]),
            reverse=True):
        f.write('\t'.join([
            token, 
            sound, 
            '*' if token != sound else '',
            str(len(values)),
            name, 
            values[0][3],
            ])+'\n')

With this script (save it as clts.py, you can now check your file as follows:

$ python clts.py PATH/TO/CLTS INPUT/FILE IDXA [IDXB]

Here, PATH/TO/CLTS is the path to the folder clts, which you downloaded via git. INPUT/FILE is your file. IDXA is the first index in your file (Python style, count from 0), where segments can be found, you can also check more indices.

Inconsistencies in German IPA?

It is not clear to me if this is intended, but I find the German IPA inconsistent in many points. E.g., you use the nasalis sonans in many cases for infinitive endings, like in abaʁbaɪ̯tn̩, but then the participle (abarbeitend) is abaʁbaɪ̯tənt. Here, the nasalis sonans should also be used in the second form, if one starts using it consequently, as abaʁbaɪ̯tn̩t would be the normal pronunciation here (and in both cases, we have variance, between and ən, but this variance is due to care on speech and education, not due to a following consonant, which the spelling suggests).

Other cases are olʏmpiastaːdioːn where the st is in fact ʃt, following the normal rule by which st > ʃt unless in the end of a syllable, and the final o is in fact short (only dialectally, it may be long, but I find this unlikely).

Furthermore toːdəzɔpfɐ should be toːdəsɔpfɐ, the genitive s in to:dəs is at a morpheme break, so the rule by which intervocalic s > z does not apply.

toːləʁants misses the bar across ts, which is then given in the derived forms (toːləʁant͡sn̩).

All in all this looks like it would be very useful to have this double-checked, as these cases may very likely have an impact on machine learning approaches, as these problems invite to regularize what is not a fact of the language.

I haven't checked the other systems, but in general, I recommend to have a look at rather standardized systems of German, like the CELEX data, where they have a large lexicon, to check for these cases.

As to the number of inconsistencies: I am afraid I find them in every fifth item, based on checking the data during the last ten minutes while writing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.