phoible / dev Goto Github PK
View Code? Open in Web Editor NEWPHOIBLE data and development.
Home Page: https://phoible.org/
License: GNU General Public License v3.0
PHOIBLE data and development.
Home Page: https://phoible.org/
License: GNU General Public License v3.0
Add phonological inventory data from the Woi-Bhalaga dialect of Loom (Guinean)
See: 45mishchenko.pdf
Something to think about for the long-term. It is possible to release R packages that are data sets only. It would be a nice distribution model (worldwide mirrors) with a built-in versioning and upgrade mechanism. We could also release some purpose-built functions (trump, feature reduction, etc) along with it. Thoughts?
out of 2211 phonemes, 221 are not merging properly, although at least some of them are definitely present in feature table.
1747 Ramaswami knn Konkani 1 64141 2856 d̤z consonant c-d-c 3
1318 CASL cce Copi 1 46057 2509 d̤z̤ consonant c-d-c-d 4
S. Moisik points out that our feature set probably can't distinguish between a pre-stopped nasal and a stop with nasal release. This is possibly even a within-language contrast in Wolof? Will need to think about how to distinguish these segment types featurally.
There are some remarks in there that don't seem to be true (e.g., about having merged the Africa and SE Asia data into one source file). Fix after #28 is merged.
of Guinea, source from Vydrin.
Also a good source:
http://www.deniscreissels.fr/public/Creissels-lexique_mandinka_2012.pdf
Not sure if this is an issue @drammock but I'm in the middle of the aspiration cleaning so wanted to file a quick issue as not to forget what I came across (Source, LanguageCode, Phoneme, Allophone):
ra awa cçʰ NA
ra ben cçʰ NA
ra bft cçʰ NA
ra bkk cçʰ NA
ra bns cçʰ NA
ra cdn cçʰ NA
gm xuu bʰ bʰ
ph ahk cçʰ cçʰ
spa aka cçʰ cçʰ
http://phoible.org/languages/daf
has been split into into Dan [dnj] and Kla-Dan [lda]
Several SAPHON languages list two ISO codes for the same inventory:
Chorote lists crq crt
Bolivian Quechua lists qul quh
Ancash Quechua lists qwa qws
Huaylas-Conchucos Quechua lists qxn qwh
Any idea why this is, or how we should choose the "correct" one, @bambooforest ?
This issue is for keeping track of inventories that are known to have pitch accents among their tonemes, and how the pitch accents were encoded. It should not be considered exhaustive.
nld
, dialect: Hasselt, source: Peters 2006, accent 1: no underlying tone, accent 2: low tone (cf. pp 121-123 and uzling/phoible-data#7)nld
, dialect: Maastricht, source: Gussenhoven 1999, accent 1: no tone, accent 2: high tone (cf. p 162 and uzling/phoible-data#8)y -> j
all c-cedillas should be pre-composed. SM: double check.
line 22611:
ISO Name Dialect Phonemes Allophones Allophones Allophones Allophones Allophones gm 0050+0068+006F+006E+0065+006D+0065+0073 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
The "gold standard" document lists language azb
"Azerbaijani" (SPA). The current output of the aggregation script has the same data (from SPA) under ISO code azj
. Which is correct? Did you change this in the gold standard doc or in the SPA raw data file at some point, @bambooforest? I couldn't find any open or closed issues mentioning those ISO codes or language name.
Additional info: according to the ISO table, azb
is "North Azerbaijani" and azj
is "Sourth Azerbaijani", and there is also a macrolanguage aze
.
not sure why yet, but 343 of the ISO codes are showing more than one value in the Source
column.
Notes have been added, but the corresponding IPA graphs haven't been updated yet.
Add inventory of Fataluku (Papuan, East Timor).
The SAPHON raw data has two columns at the end of the phoneme block: tone
and nasal harmony
. They are boolean columns just like all the phoneme columns. Nasal harmony
is a phonological rule, so we don't need to include it at this stage of phoible. But how should we handle their boolean "tone / no tone" distinction? Is it generally true that South American languages that have tone all have the same number and type of tonemes?
The sounds from nmn:
dʼkxʼ : not in PDF... should be dtʼkxʼ I think? described as
prevoicing followed by sequence of two ejectives (three-part contour
seg)
d̪ʼkxʼ : not in PDF... should be d̪t̪ʼkxʼ I think? (same as above)
ɡʼkxʼ : not in PDF... I do see ɡkxʼ which ought to be a prevoiced
velar ejective (two-part contour segment: ɡ features + kxʼ features)
pʼkxʼ : this ought to be fine, should be treated as two-part contour
(the features for pʼ (bilabial ejective stop) followed by the features
for kxʼ (velar ejective affricate))
tʼkxʼ : ditto
t̪ʼkxʼ : ditto
As to why they are crashing the script... the first three might be
crashing because the ejective marker is on a voiced segment (ejectives
can't be voiced)... ?
Regarding the prenasalized ones, a contour segment with the features
you've mentioned (+nas, +son, 0delrel) sounds right to me. The rest
it sounds like you already figured out right?
Found it. I was looking in the wrong thread... it got buried in the
other thread after richard's question about "round" and "tense".
Short answer is that there is not really enough info in that resource
to really know all the phonemes. My best guesses:
Honestly I don't have time to try to puzzle out the rest of this
inventory at the moment... there's not enough information there for
me to do anything other than make wild guesses, and I'd rather focus
on the Clements scripts.
ɲɟʝ is fine
ndʑ is fine
ɲdʑ does not obey our rules for homorganic place for prenasalization.
Should check the grammars.
Both zoc and ind have two entries with the same ISO code (and the same Glottocode) and the trump value of 1.
Add the newly collected sources (UZ isolates, Iranian languages, JIPA, etc.).
Review known issues and then run through Unicode IPA checker.
I noticed on the release that it was tagged with version "v2014". I understand this was probably motivated by Martin's preferences for CLLD releases being once a year at most. However, phoible is still software, and as such I think it's worth considering, at least, whether we want a more fine-grained versioning scheme (personally I like semantic versioning). The advantage of this is that if we (or others) use phoible in publications, and want to use the most current version (because of new data or features added since last CLLD release, for example), then we can create a new intermediate release that we can reference in that publication. Thoughts?
This diacritic currently has no features associated with it, so phonemes with and without the diacritic that are otherwise identical will be featurally indistinct.
@drammock - for the data aggregation script, Forkel and I thought it made sense to generate unique IDs by taking the decimal number of each character in each glyph and concatenating them together, e.g.
pʼ == 112700
This would be more stable than my previous approach.
According to our old diacritics spreadsheet, ˤ
and a̙
(without the a
) are featurally synonymous (+RTR -ATR), but the modifier pharyngeal is for consonants and the combining tack is for vowels. This convention is not followed in the UPSID data, where segments like ãõ̞ˤ
occur in !Xu. However, since the latter half of that segment already has a lower diacritic (combining down tack), adding a combining right tack makes it hard/impossible to read.
Maybe this is a non-issue since there seems to be some changes afoot in the UPSID data (possibly done by one of @bambooforest's UZH minions?) that includes getting rid of the downtacks for vowels described simply as "mid". So two related questions: what is the story with those UPSID changes? What to do about the modifier pharyngeal on vowels in UPSID?
add phoneme conventions, etc.
perhaps add addressed decisions in separate section via the paper trail
https://github.com/uzling/phoible-data/blob/master/agg-vs-gold-mismatches.tsv
Add Gumer from Völlmin's description.
Völlmin, Sascha. forthcoming. Towards a Grammar of Gumer: Phonology and Morphology of a Western Gurage variety. PhD thesis, University of Zurich.
Also to look at:
Banksira, Degif Petros. 2000. Sound mutations. The morphophonology of Chaha. (Te 11 080)
Fix the remaining 6 missing geo-coordinates.
The allophone code in aggregate-raw-data.R
is a bit fragile, given that the different data sources encode allophone information differently. This was discussed at some length here. This issue is here to remind us to revisit the allophone issue once the rest of the aggregation is stable.
Add inventories for languages Hadoti [hoj] and Bhadarwahi [bhd]:
Dwivedi, A. V. (2012). A Grammar of Hadoti. München: LINCOM EUROPA.
http://www.worldcat.org/title/grammar-of-hadoti/oclc/822017249
Dwivedi, A. V. (2013). Agrammar of Bhadarwahi. Muenchen: LINCOM Europa.
http://www.worldcat.org/title/agrammar-of-bhadarwahi/oclc/861960466
There are some languages that are showing duplicate phonemes. Not sure if this is caused by the denormRenorm
function, or if there are errors in the source data files, or some other cause. Affects 19 languages in SAPHON, 10 languages in AA, 3 languages in RA.
Languages ahk
(Akha) and pib
(Yine) in the UW data have affricates where the stop and fricative parts don't match in place of articulation: tç
and tçʰ
AA raw data has three different lects listed under ISO code izi
:
LanguageName
"ẹzaa"LanguageName
"ikwo"LanguageName
"izi"This breaks the trump selection procedure, which currenly can only handle:
LanguageCode
) but different Source
Source
but different SpecificDialect
Any idea what's going on here @bambooforest? Can you (or one of your colleagues) examine the raw AA data and figure out what to do here? Possible solutions I can see are:
SpecificDialect
information to each (in AA data, there is no SpecificDialect
column, so this would take the form of parenthetical info in the LanguageName
column)LanguageName
field of the entry that is kept.palatal diacritic ʲ
currently overwrites features +dorsal +high -low +front -back
. There are some phonemes whose base glyphs are already palatal and yet have that diacritic, like cçʲ
and cʲ
which at present are featurally indistinct from their base glyph counterparts cç
and c
.
Now, there are only 6 languages lacking geo coordinates:
I should also check if the language codes are the latest or have been updated in the interim.
some weird things come out of the aggregation script that i'm discovering via the R code on multivariate variables i emailed you. for example:
filter(multivariate, coronal==FALSE)
returns:
fan
ksf
mky
sgw
thm
xbr
filter(final.data, LanguageCode=="ksf") --> 7 phonemes
filter(final.data, LanguageCode=="fan") --> 1 phoneme
filter(final.data, LanguageCode=="mky") --> 1 phoneme
filter(final.data, LanguageCode=="sgw") --> 3 phonemes
filter(final.data, LanguageCode=="thm") --> 1 phoneme
filter(final.data, LanguageCode=="xbr") --> 1 phoneme
table(final.data$LanguageCode)
shows the phoneme counts are off...
This language code [zoh] is wrong. It should be [zoc], as per WALS identification of Wonderly 1951a
http://wals.info/refdb/record/Wonderly-1951b
which would also align the SPA inventory with the UPSID inventory under the same reference.
The code change has been updated in the SPA langname codes file
https://github.com/phoible/phoible/blob/master/data/SPA/SPA_LangNamesCodes.tsv
and the aggregated and phoneme level files will be updated when they are newly generated.
phoible / data / README.md
fix me.
http://phoible.org/languages/sgw
(probably affects all Sebat Bet Gurage inventories and possibly other Ethiopian languages)
/q/ -> kʼ
/q/ -> kʲʼ
/q/ -> kʷʼ
as per Hetzron, Robert 1977.
@drammock - we also have /kʼʲ/, but conventions say to use /kʲʼ/, etc. Is there a qualitative difference that we should capture (I'm told in Ethiopianist tradition the ejective should come before the secondary articulation). Either way, something needs to be fixed.
I also need to look at <hʲ> which may be better represented as <ç>. Further investigation needed.
the aggregation script doesn't pull SpecificDialect
out of the SAPHON raw data.
http://phoible.org/languages/blk
should be further to the north (pc Mathias)
There should never be a standard aspiration diacritic ʰ
on a base glyph that is voiced. All such instances of ʰ
in the data sources should be converted to ʱ
. The features table will need to be updated as well, to make sure that the revised phonemes are still getting assigned a feature vector.
Let me know if this doesn't work for the SPA LanguageName - to - ISO639.3 mappings, @drammock
https://github.com/phoible/phoible/blob/master/data/METADATA/phoible_index.tsv
@bammbooforest check that these codes are up-to-date (from the 1241 version of phoible) -- also merge in data from inventoryids_filenames.tsv
lmn
is named as Lamani on line 60 and Banjara on line 13.
iru
is named as Irula on line 33 and Kasaba on line 40.
Need to determine if this is just an error of the ISO code (in which case need to correct), or if these are actually lects that are classified with the same ISO code (in which case need to figure out how to apply trump to them; easiest is probably adding parenthetical info to the LanguageName
field).
Blocks resolution of #53.
Changing things like
%tʃ%
%dʒ%
to phoible conventions would require that the glyph ids are also updated
See Vydrin 2007, which contains phonemic inventories for South Mande languages (known at that time).
In particular, on the p. 8, there is a vocalic inventory of Dan-Gweetaa. It should be modified: the semi-closed vowels (ɩ, ʋ, ʋ̈) are not separate phonemes but allophones of e, o, ɤ respectively under extra-high tone; the semi-closed nasal vowels (given in brackets) should be also eliminated (they are allophones, rather than phonemes). Phoneme ɒ is not necessary long, it can be short. A third modulated tone (extrahigh-extralow) has been discovered.
I'm told there are corrections to be done on other South Mande languages as well.
Further: there are two "Dan"'s in the database: Dan (GM) and Dan (UPSID). Dan (GM) is an early and pioneering study by Bearth and Zemp. The latter seems to refer to a Liberian variety (Vydrin, pc).
The former needs to be updated to reflect what has been learned about the language. The latter contains inexactitudes (ibid).
Probably want separate columns for mixed, pidgin, creole, signed, ancient, extinct. Could also conceivably do a single column that had one of those strings or NA
for each language. That would be more compact, but a bit harder to deal with if we had a language that was creole + signed, mixed + extinct, etc.
Closes #67
the "Kxoe" language citing Christa König and Bernd Heine. 2008. A concise dictionary of Northwestern !Xun. Ruediger Koppe. > This should be ǃXun, not Kxoe/Khwe!
@bambooforest : check the language name and code
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.