phoible / dev Goto Github PK

View Code? Open in Web Editor NEW

108.0 13.0 30.0 59.36 MB

PHOIBLE data and development.

Home Page: https://phoible.org/

License: GNU General Public License v3.0

R 4.99% TeX 95.01%

linguistics phonology typology

dev's Introduction

PHOIBLE

PHOIBLE is a database of phonological inventories and distinctive features, encompassing more than 3000 phonological inventories (doculects), representing more than 2100 ISO 639-3 language codes. PHOIBLE data is published in browsable form online at PHOIBLE Online, which corresponds with the most recent release of this repository.

Data in machine-readable form is available in this repository. It is not guaranteed to exactly match what is published at PHOIBLE Online, due to the occasional discovery and correction of errors, and the addition of new languages to the database. For this reason, it is recommended that you make use of the most recent release in your own analyses, rather than working from the tip of the master branch.

Documentation for PHOIBLE is hosted at at http://phoible.github.io/, including notational conventions, departures from official IPA usage, citation information, etc.

How to use this repository

Most people will not need to look beyond the data folder of this repository, which contains a phoneme-level data file (one row per languoid-phoneme pair) and a BibTeX file of all the data sources. The rest of the repo contains scripts used in the development and testing of PHOIBLE, such as code to aggregate the raw data files from the various donor databases. These are probably not of general interest or utility. The raw-data directory contains the raw data from the various donor data sources, as well as the feature mapping tables. This is also probably not what you want, so if in doubt, stick to the data directory.

Citing PHOIBLE

If you are citing the database as a whole, or making use of the phonological distinctive feature systems in PHOIBLE, please cite as follows:

Moran, Steven & McCloy, Daniel (eds.) 2019. PHOIBLE 2.0. Jena: Max Planck Institute for the Science of Human History. (Available online at http://phoible.org). DOI: 10.5281/zenodo.2626687

If you are citing phoneme inventory data for a particular language or languages, please use the name of the language as the title, and include the original data source as an element within PHOIBLE. For example:

UCLA Phonological Segment Inventory Database. 2019. Lelemi sound inventory (UPSID). In: Moran, Steven & McCloy, Daniel (eds.) PHOIBLE 2.0. Jena: Max Planck Institute for the Science of Human History. (Available online at http://phoible.org/inventories/view/441)

If you are using the raw data from this repository but are not using a labeled release, we recommend citing using the last commit hash at the time of your most recent cloning/forking of the repository, so that others can reproduce your work starting from the same snapshot of the repository that you are using. For example:

Moran, Steven & McCloy, Daniel (eds.) 2019. PHOIBLE. https://github.com/phoible/phoible/commit/444a46c9a94641d6c99f5c8bbe85b8ae1c6ce65f

History

PHOIBLE was originally developed as an SQL database and RDF knowledgebase for Moran’s dissertation, which explains many of the technical details and developmental challenges:

Moran, Steven. 2012. Phonetics Information Base and Lexicon. PhD thesis, University of Washington. http://hdl.handle.net/1773/22452

Here is a brief list of some publications that we have used PHOIBLE data for:

Blasi, Damián, Steven Moran, Scott Moisik, Paul Widmer, Dan Dediu, & Balthasar Bickel. 2019. Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363(6432), eaav3218. doi:10.1126/science.aav3218
Cysouw, Michael, Dan Dediu and Steven Moran. 2012. Still No Evidence for an Ancient Language Expansion from Africa. Science, 335(6069):657. doi:10.1126/science.1208841
Moran, Steven. 2012. Using Linked Data to Create a Typological Knowledge Base. In Christian Chiarcos, Sebastian Nordhoff and Sebastian Hellmann (eds), Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata. Springer, Heidelberg.
Moran, Steven, Daniel McCloy, and Richard Wright. 2012. Revisiting Population Size vs. Phoneme Inventory Size. Language 88(4): 877-893. doi:10.1353/lan.2012.0087
Moran, Steven and Damián Blasi. 2014. Cross-linguistic Comparison of Complexity Measures in Phonological Systems. In Frederick J. Newmeyer and Laurel Preston (eds), Measuring Grammatical Complexity. Oxford UP, Oxford.

A more complete list of research papers using PHOIBLE can be found on Google Scholar.

dev's People

Stargazers

Watchers

dev's Issues

SAPHON specific dialect

the aggregation script doesn't pull SpecificDialect out of the SAPHON raw data.

Update language codes

http://www.language-archives.org/checks/phoible.org

Old issue to close

The sounds from nmn:
dʼkxʼ : not in PDF... should be dtʼkxʼ I think? described as
prevoicing followed by sequence of two ejectives (three-part contour
seg)
d̪ʼkxʼ : not in PDF... should be d̪t̪ʼkxʼ I think? (same as above)
ɡʼkxʼ : not in PDF... I do see ɡkxʼ which ought to be a prevoiced
velar ejective (two-part contour segment: ɡ features + kxʼ features)
pʼkxʼ : this ought to be fine, should be treated as two-part contour
(the features for pʼ (bilabial ejective stop) followed by the features
for kxʼ (velar ejective affricate))
tʼkxʼ : ditto
t̪ʼkxʼ : ditto
As to why they are crashing the script... the first three might be
crashing because the ejective marker is on a voiced segment (ejectives
can't be voiced)... ?

Regarding the prenasalized ones, a contour segment with the features
you've mentioned (+nas, +son, 0delrel) sounds right to me. The rest
it sounds like you already figured out right?

Found it. I was looking in the wrong thread... it got buried in the
other thread after richard's question about "round" and "tense".
Short answer is that there is not really enough info in that resource
to really know all the phonemes. My best guesses:

treat fortis stops as in !Xoo, namely as a prevoiced aspirated stop
like dtʰ or ɡkʰ
use unicode point 203c (‼) for the retroflex click
the square brackets in the chart on page 5 indicate the IPA
interpretation of the orthography, in cases where the authors thought
it was unclear.
the complex stops are a piece of work... sighting down the EGR-AL
column on page 5, here is my best guess at what I see:
l t d dtʰ tʼ tʰ d̤ tx dx

Honestly I don't have time to try to puzzle out the rest of this
inventory at the moment... there's not enough information there for
me to do anything other than make wild guesses, and I'd rather focus
on the Clements scripts.

ɲɟʝ is fine

ndʑ is fine

ɲdʑ does not obey our rules for homorganic place for prenasalization.

Should check the grammars.

How to represent pitch accents glyphically?

This issue is for keeping track of inventories that are known to have pitch accents among their tonemes, and how the pitch accents were encoded. It should not be considered exhaustive.

nld, dialect: Hasselt, source: Peters 2006, accent 1: no underlying tone, accent 2: low tone (cf. pp 121-123 and uzling/phoible-data#7)
nld, dialect: Maastricht, source: Gussenhoven 1999, accent 1: no tone, accent 2: high tone (cf. p 162 and uzling/phoible-data#8)

markup screwed up in README.md

phoible / data / README.md

fix me.

add conventions text file to root level

add phoneme conventions, etc.

perhaps add addressed decisions in separate section via the paper trail

https://github.com/uzling/phoible-data/blob/master/agg-vs-gold-mismatches.tsv

add indicator columns for unusual language types

Probably want separate columns for mixed, pidgin, creole, signed, ancient, extinct. Could also conceivably do a single column that had one of those strings or NA for each language. That would be more compact, but a bit harder to deal with if we had a language that was creole + signed, mixed + extinct, etc.

Closes #67

New style glyph IDs

@drammock - for the data aggregation script, Forkel and I thought it made sense to generate unique IDs by taking the decimal number of each character in each glyph and concatenating them together, e.g.

pʼ == 112700

This would be more stable than my previous approach.

Add new inventory data

Add the newly collected sources (UZ isolates, Iranian languages, JIPA, etc.).

Review known issues and then run through Unicode IPA checker.

Fix remaining 6 geo-coordinates

Fix the remaining 6 missing geo-coordinates.

Duplicate languages `lmn` and `iru` in Ramaswami raw data

lmn is named as Lamani on line 60 and Banjara on line 13.

iru is named as Irula on line 33 and Kasaba on line 40.

Need to determine if this is just an error of the ISO code (in which case need to correct), or if these are actually lects that are classified with the same ISO code (in which case need to figure out how to apply trump to them; easiest is probably adding parenthetical info to the LanguageName field).

Blocks resolution of #53.

Check diacritic consistency

1747 Ramaswami knn Konkani 1 64141 2856 d̤z consonant c-d-c 3
1318 CASL cce Copi 1 46057 2509 d̤z̤ consonant c-d-c-d 4

Missing geo-coordinates

Now, there are only 6 languages lacking geo coordinates:

Pisamira (ID 2143)
Yãroamë of Serra do Pacu/Ajarani (ID 2150)
Arara do Acre (ID 1999)
Parkateje (ID 1970)
Günün Yajich (ID 1906)
Dinka (ID 1398)

I should also check if the language codes are the latest or have been updated in the interim.

Add Mandinka

of Guinea, source from Vydrin.

Also a good source:

http://www.deniscreissels.fr/public/Creissels-lexique_mandinka_2012.pdf

Sebat Bet Gurage -- transcription mistakes

http://phoible.org/languages/sgw

(probably affects all Sebat Bet Gurage inventories and possibly other Ethiopian languages)

/q/ -> kʼ
/q/ -> kʲʼ
/q/ -> kʷʼ

as per Hetzron, Robert 1977.

@drammock - we also have /kʼʲ/, but conventions say to use /kʲʼ/, etc. Is there a qualitative difference that we should capture (I'm told in Ethiopianist tradition the ejective should come before the secondary articulation). Either way, something needs to be fixed.

I also need to look at <hʲ> which may be better represented as <ç>. Further investigation needed.

Glyphs IDs out-of-line

Changing things like

%tʃ%
%dʒ%

to phoible conventions would require that the glyph ids are also updated

Fix GM README

There are some remarks in there that don't seem to be true (e.g., about having merged the Africa and SE Asia data into one source file). Fix after #28 is merged.

make phoible an R package?

Something to think about for the long-term. It is possible to release R packages that are data sets only. It would be a nice distribution model (worldwide mirrors) with a built-in versioning and upgrade mechanism. We could also release some purpose-built functions (trump, feature reduction, etc) along with it. Thoughts?

Azerbaijani ISO code

The "gold standard" document lists language azb "Azerbaijani" (SPA). The current output of the aggregation script has the same data (from SPA) under ISO code azj. Which is correct? Did you change this in the gold standard doc or in the SPA raw data file at some point, @bambooforest? I couldn't find any open or closed issues mentioning those ISO codes or language name.

Additional info: according to the ISO table, azb is "North Azerbaijani" and azj is "Sourth Azerbaijani", and there is also a macrolanguage aze.

Check sources: affricate place mismatch

Languages ahk (Akha) and pib (Yine) in the UW data have affricates where the stop and fricative parts don't match in place of articulation: tç and tçʰ

trump reduction not working

not sure why yet, but 343 of the ISO codes are showing more than one value in the Source column.

feature vectors for some phonemes not getting properly merged

out of 2211 phonemes, 221 are not merging properly, although at least some of them are definitely present in feature table.

Wari

y -> j

Wrong language

the "Kxoe" language citing Christa König and Bernd Heine. 2008. A concise dictionary of Northwestern !Xun. Ruediger Koppe. > This should be ǃXun, not Kxoe/Khwe!

@bambooforest : check the language name and code

wrong geo-location

http://phoible.org/languages/blk

should be further to the north (pc Mathias)

aspiration in source data

There should never be a standard aspiration diacritic ʰ on a base glyph that is voiced. All such instances of ʰ in the data sources should be converted to ʱ. The features table will need to be updated as well, to make sure that the revised phonemes are still getting assigned a feature vector.

Allophone handling

The allophone code in aggregate-raw-data.R is a bit fragile, given that the different data sources encode allophone information differently. This was discussed at some length here. This issue is here to remind us to revisit the allophone issue once the rest of the aggregation is stable.

Language code update

http://phoible.org/languages/daf

has been split into into Dan [dnj] and Kla-Dan [lda]

http://www-01.sil.org/iso639-3/documentation.asp?id=daf

Nasal release vs. pre-stopped nasals

S. Moisik points out that our feature set probably can't distinguish between a pre-stopped nasal and a stop with nasal release. This is possibly even a within-language contrast in Wolof? Will need to think about how to distinguish these segment types featurally.

SAPHON questionable graphemes

Notes have been added, but the corresponding IPA graphs haven't been updated yet.

features: palatal diacritic

palatal diacritic ʲ currently overwrites features +dorsal +high -low +front -back. There are some phonemes whose base glyphs are already palatal and yet have that diacritic, like cçʲ and cʲ which at present are featurally indistinct from their base glyph counterparts cç and c.

Duplicate languages `izi` and `xsm` in AA raw data

AA raw data has three different lects listed under ISO code izi:

Lines 2480-2517 are LanguageName "ẹzaa"
Lines 3716-3755 are LanguageName "ikwo"
Lines 3757-3826 are LanguageName "izi"

This breaks the trump selection procedure, which currenly can only handle:

same ISO code (LanguageCode) but different Source
same ISO code and same Source but different SpecificDialect

Any idea what's going on here @bambooforest? Can you (or one of your colleagues) examine the raw AA data and figure out what to do here? Possible solutions I can see are:

keep all three, but add SpecificDialect information to each (in AA data, there is no SpecificDialect column, so this would take the form of parenthetical info in the LanguageName column)
remove two of them, if the three are genuine duplicates (no differences in inventory), and incorporate the alternate names into the LanguageName field of the entry that is kept.

aggregation script aggregating header row from somewhere

line 22611:

ISO Name Dialect Phonemes Allophones Allophones Allophones Allophones Allophones gm 0050+0068+006F+006E+0065+006D+0065+0073 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

semantic versioning

I noticed on the release that it was tagged with version "v2014". I understand this was probably motivated by Martin's preferences for CLLD releases being once a year at most. However, phoible is still software, and as such I think it's worth considering, at least, whether we want a more fine-grained versioning scheme (personally I like semantic versioning). The advantage of this is that if we (or others) use phoible in publications, and want to use the most current version (because of new data or features added since last CLLD release, for example), then we can create a new intermediate release that we can reference in that publication. Thoughts?

SPA LanguageNames to ISO639.3 codes

Let me know if this doesn't work for the SPA LanguageName - to - ISO639.3 mappings, @drammock

https://github.com/phoible/phoible/blob/master/data/METADATA/phoible_index.tsv

@bammbooforest check that these codes are up-to-date (from the 1241 version of phoible) -- also merge in data from inventoryids_filenames.tsv

aggregating script dropping data?

some weird things come out of the aggregation script that i'm discovering via the R code on multivariate variables i emailed you. for example:

filter(multivariate, coronal==FALSE)

returns:

fan
ksf
mky
sgw
thm
xbr

filter(final.data, LanguageCode=="ksf") --> 7 phonemes
filter(final.data, LanguageCode=="fan") --> 1 phoneme
filter(final.data, LanguageCode=="mky") --> 1 phoneme
filter(final.data, LanguageCode=="sgw") --> 3 phonemes
filter(final.data, LanguageCode=="thm") --> 1 phoneme
filter(final.data, LanguageCode=="xbr") --> 1 phoneme

table(final.data$LanguageCode)

shows the phoneme counts are off...

Languages with duplicate phonemes

There are some languages that are showing duplicate phonemes. Not sure if this is caused by the denormRenorm function, or if there are errors in the source data files, or some other cause. Affects 19 languages in SAPHON, 10 languages in AA, 3 languages in RA.

Updates / additions to Mande languages

See Vydrin 2007, which contains phonemic inventories for South Mande languages (known at that time).

In particular, on the p. 8, there is a vocalic inventory of Dan-Gweetaa. It should be modified: the semi-closed vowels (ɩ, ʋ, ʋ̈) are not separate phonemes but allophones of e, o, ɤ respectively under extra-high tone; the semi-closed nasal vowels (given in brackets) should be also eliminated (they are allophones, rather than phonemes). Phoneme ɒ is not necessary long, it can be short. A third modulated tone (extrahigh-extralow) has been discovered.

I'm told there are corrections to be done on other South Mande languages as well.

Further: there are two "Dan"'s in the database: Dan (GM) and Dan (UPSID). Dan (GM) is an early and pioneering study by Bearth and Zemp. The latter seems to refer to a Liberian variety (Vydrin, pc).

The former needs to be updated to reflect what has been learned about the language. The latter contains inexactitudes (ibid).

SPA zoc -> zoh

This language code [zoh] is wrong. It should be [zoc], as per WALS identification of Wonderly 1951a

http://wals.info/refdb/record/Wonderly-1951b

which would also align the SPA inventory with the UPSID inventory under the same reference.

The code change has been updated in the SPA langname codes file

https://github.com/phoible/phoible/blob/master/data/SPA/SPA_LangNamesCodes.tsv

and the aggregated and phoneme level files will be updated when they are newly generated.

Fix two languages that both have same trump order

Both zoc and ind have two entries with the same ISO code (and the same Glottocode) and the trump value of 1.

features: combining plus sign below

This diacritic currently has no features associated with it, so phonemes with and without the diacritic that are otherwise identical will be featurally indistinct.

Glyphs: modifier turned glottal stop vs combining right tack

According to our old diacritics spreadsheet, ˤ and a̙ (without the a) are featurally synonymous (+RTR -ATR), but the modifier pharyngeal is for consonants and the combining tack is for vowels. This convention is not followed in the UPSID data, where segments like ãõ̞ˤ occur in !Xu. However, since the latter half of that segment already has a lower diacritic (combining down tack), adding a combining right tack makes it hard/impossible to read.

Maybe this is a non-issue since there seems to be some changes afoot in the UPSID data (possibly done by one of @bambooforest's UZH minions?) that includes getting rid of the downtacks for vowels described simply as "mid". So two related questions: what is the story with those UPSID changes? What to do about the modifier pharyngeal on vowels in UPSID?

NAs in Allophones

Not sure if this is an issue @drammock but I'm in the middle of the aspiration cleaning so wanted to file a quick issue as not to forget what I came across (Source, LanguageCode, Phoneme, Allophone):

ra awa cçʰ NA
ra ben cçʰ NA
ra bft cçʰ NA
ra bkk cçʰ NA
ra bns cçʰ NA
ra cdn cçʰ NA
gm xuu bʰ bʰ
ph ahk cçʰ cçʰ
spa aka cçʰ cçʰ

Add Gumer

Add Gumer from Völlmin's description.

Völlmin, Sascha. forthcoming. Towards a Grammar of Gumer: Phonology and Morphology of a Western Gurage variety. PhD thesis, University of Zurich.

Also to look at:

Banksira, Degif Petros. 2000. Sound mutations. The morphophonology of Chaha. (Te 11 080)

SAPHON languages missing tone information

The SAPHON raw data has two columns at the end of the phoneme block: tone and nasal harmony. They are boolean columns just like all the phoneme columns. Nasal harmony is a phonological rule, so we don't need to include it at this stage of phoible. But how should we handle their boolean "tone / no tone" distinction? Is it generally true that South American languages that have tone all have the same number and type of tonemes?