2021task0's People
2021task0's Issues
English IPA data
The English IPA data released for part 2 of this task contains a lot of errors. There are many instances of spurious vowel changes between the lemma and inflected forms (e.g., [əfɪlieɪt] 'affiliate' has the 3sg present listed as [əfɪliəts], which is a plural not the intended verbal form, [θɹɛt] 'threat' has the present participle listed as [θɹiːtɪŋ]). Affricates are inconsistently transcribed (e.g., [t͡ʃɔk] 'chalk' has the 3sg presented list as [tʃæks], note also the spurious vowel change). There are some past and other forms that I have not encountered before and suspect are not real, such as fret -> frate. (I haven't tried to catalogue the errors comprehensively.)
I have cleaned up the transcriptions by drawing them from an IPA version of the CMU dictionary (available here, https://github.com/menelik3/cmudict-ipa) whenever possible, using the original lemma pronunciations when these weren't available in CMU-IPA, applying the regular rules (with their standard allomorphy) to generate inflected forms that are not in CMU-IPA, and doing some very basic data checking / cleaning. I simplified the transcriptions by removing tie-bars from affricates, syllabic-consonant diacritics, and length marks on vowels; these were inconsistent and aren't needed in a broad transcription. CMU unfortunately has pervasive vacillation between some vowel pairs (e.g., ə ~ ɪ, ɑː ~ ɔ, ɝ ~ ɛr) and I have tried to normalize these within rows (but not across them). There are some remaining entries that I have marked to be checked -- they are relatively few in number and their treatment probably won't matter too much for the task, but it would be nice to eliminate them if they are truly spurious. More generally it would be good for an organizer to look over the revisions that I have made.
Two questions. (1) Who should I send the revised data set to? (2) Many of the errors were located with simple string comparisons (principally, testing whether the lemma is a prefix of an inflected form). Are similar tests going to be applied to ensure a basic level of accuracy for the other languages?
How to check phoneme inventories against CLTS
Assuming your sigmorphon data in IPA are sound-segmented (space used for segmentation), and your data is a simple table, you can use the following code to check if your data conform to CLTS (https://clts.clld.org).
To get started, download clts and pyclts:
$ git clone https://github.com/cldf-clts/clts
$ git clone https://github.com/cldf-clts/pyclts
Install pyclts:
$ pip install -e pyclts
Now, you create the following Python script:
from collections import defaultdict
from pyclts import CLTS
from sys import argv
from tqdm import tqdm as progressbar
bipa = CLTS(argv[1]).bipa
def evaluate(data, indices):
sounds = defaultdict(list)
for i, row in progressbar(enumerate(data)):
for idx in indices:
for j, t in enumerate(row[idx].split()):
sounds[t] += [(i, idx, j, row[idx])]
out = {}
for t, values in sounds.items():
sound = bipa[t]
if sound.type == 'unknownsound':
out[t, 'unknownsound', '?'] = values
elif str(sound) != t:
out[t, sound.name, str(sound)] = values
else:
out[t, sound.name, str(sound)] = values
return out
with open(argv[2]) as f:
data = [[cell.strip() for cell in row.split('\t')] for row in f]
analysis = evaluate(data, [int(x) for x in argv[3:]])
with open(argv[2]+'-analysis', 'w') as f:
f.write('SOUND\tCLTS\tALIAS\tFREQUENCY\tNAME\tEXAMPLE\n')
for (token, name, sound), values in sorted(
analysis.items(),
key=lambda x: (len(x[1]), x[0][1], x[0][0]),
reverse=True):
f.write('\t'.join([
token,
sound,
'*' if token != sound else '',
str(len(values)),
name,
values[0][3],
])+'\n')
With this script (save it as clts.py
, you can now check your file as follows:
$ python clts.py PATH/TO/CLTS INPUT/FILE IDXA [IDXB]
Here, PATH/TO/CLTS is the path to the folder clts, which you downloaded via git. INPUT/FILE is your file. IDXA is the first index in your file (Python style, count from 0), where segments can be found, you can also check more indices.
Inconsistencies in German IPA?
It is not clear to me if this is intended, but I find the German IPA inconsistent in many points. E.g., you use the nasalis sonans in many cases for infinitive endings, like in abaʁbaɪ̯tn̩
, but then the participle (abarbeitend) is abaʁbaɪ̯tənt
. Here, the nasalis sonans should also be used in the second form, if one starts using it consequently, as abaʁbaɪ̯tn̩t
would be the normal pronunciation here (and in both cases, we have variance, between n̩
and ən
, but this variance is due to care on speech and education, not due to a following consonant, which the spelling suggests).
Other cases are olʏmpiastaːdioːn
where the st
is in fact ʃt
, following the normal rule by which st > ʃt
unless in the end of a syllable, and the final o is in fact short (only dialectally, it may be long, but I find this unlikely).
Furthermore toːdəzɔpfɐ
should be toːdəsɔpfɐ
, the genitive s
in to:dəs
is at a morpheme break, so the rule by which intervocalic s > z
does not apply.
toːləʁants
misses the bar across ts
, which is then given in the derived forms (toːləʁant͡sn̩
).
All in all this looks like it would be very useful to have this double-checked, as these cases may very likely have an impact on machine learning approaches, as these problems invite to regularize what is not a fact of the language.
I haven't checked the other systems, but in general, I recommend to have a look at rather standardized systems of German, like the CELEX data, where they have a large lexicon, to check for these cases.
As to the number of inconsistencies: I am afraid I find them in every fifth item, based on checking the data during the last ten minutes while writing.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.