Code Monkey home page Code Monkey logo

dev's Introduction

DOI

PHOIBLE

PHOIBLE is a database of phonological inventories and distinctive features, encompassing more than 3000 phonological inventories (doculects), representing more than 2100 ISO 639-3 language codes. PHOIBLE data is published in browsable form online at PHOIBLE Online, which corresponds with the most recent release of this repository.

Data in machine-readable form is available in this repository. It is not guaranteed to exactly match what is published at PHOIBLE Online, due to the occasional discovery and correction of errors, and the addition of new languages to the database. For this reason, it is recommended that you make use of the most recent release in your own analyses, rather than working from the tip of the master branch.

Documentation for PHOIBLE is hosted at at http://phoible.github.io/, including notational conventions, departures from official IPA usage, citation information, etc.

How to use this repository

Most people will not need to look beyond the data folder of this repository, which contains a phoneme-level data file (one row per languoid-phoneme pair) and a BibTeX file of all the data sources. The rest of the repo contains scripts used in the development and testing of PHOIBLE, such as code to aggregate the raw data files from the various donor databases. These are probably not of general interest or utility. The raw-data directory contains the raw data from the various donor data sources, as well as the feature mapping tables. This is also probably not what you want, so if in doubt, stick to the data directory.

Citing PHOIBLE

If you are citing the database as a whole, or making use of the phonological distinctive feature systems in PHOIBLE, please cite as follows:

Moran, Steven & McCloy, Daniel (eds.) 2019. PHOIBLE 2.0. Jena: Max Planck Institute for the Science of Human History. (Available online at http://phoible.org). DOI: 10.5281/zenodo.2626687

If you are citing phoneme inventory data for a particular language or languages, please use the name of the language as the title, and include the original data source as an element within PHOIBLE. For example:

UCLA Phonological Segment Inventory Database. 2019. Lelemi sound inventory (UPSID). In: Moran, Steven & McCloy, Daniel (eds.) PHOIBLE 2.0. Jena: Max Planck Institute for the Science of Human History. (Available online at http://phoible.org/inventories/view/441)

If you are using the raw data from this repository but are not using a labeled release, we recommend citing using the last commit hash at the time of your most recent cloning/forking of the repository, so that others can reproduce your work starting from the same snapshot of the repository that you are using. For example:

Moran, Steven & McCloy, Daniel (eds.) 2019. PHOIBLE. https://github.com/phoible/phoible/commit/444a46c9a94641d6c99f5c8bbe85b8ae1c6ce65f

History

PHOIBLE was originally developed as an SQL database and RDF knowledgebase for Moran’s dissertation, which explains many of the technical details and developmental challenges:

Moran, Steven. 2012. Phonetics Information Base and Lexicon. PhD thesis, University of Washington. http://hdl.handle.net/1773/22452

Here is a brief list of some publications that we have used PHOIBLE data for:

  • Blasi, Damián, Steven Moran, Scott Moisik, Paul Widmer, Dan Dediu, & Balthasar Bickel. 2019. Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363(6432), eaav3218. doi:10.1126/science.aav3218

  • Cysouw, Michael, Dan Dediu and Steven Moran. 2012. Still No Evidence for an Ancient Language Expansion from Africa. Science, 335(6069):657. doi:10.1126/science.1208841

  • Moran, Steven. 2012. Using Linked Data to Create a Typological Knowledge Base. In Christian Chiarcos, Sebastian Nordhoff and Sebastian Hellmann (eds), Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata. Springer, Heidelberg.

  • Moran, Steven, Daniel McCloy, and Richard Wright. 2012. Revisiting Population Size vs. Phoneme Inventory Size. Language 88(4): 877-893. doi:10.1353/lan.2012.0087

  • Moran, Steven and Damián Blasi. 2014. Cross-linguistic Comparison of Complexity Measures in Phonological Systems. In Frederick J. Newmeyer and Laurel Preston (eds), Measuring Grammatical Complexity. Oxford UP, Oxford.

A more complete list of research papers using PHOIBLE can be found on Google Scholar.

dev's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dev's Issues

Old issue to close

The sounds from nmn:
dʼkxʼ : not in PDF... should be dtʼkxʼ I think? described as
prevoicing followed by sequence of two ejectives (three-part contour
seg)
d̪ʼkxʼ : not in PDF... should be d̪t̪ʼkxʼ I think? (same as above)
ɡʼkxʼ : not in PDF... I do see ɡkxʼ which ought to be a prevoiced
velar ejective (two-part contour segment: ɡ features + kxʼ features)
pʼkxʼ : this ought to be fine, should be treated as two-part contour
(the features for pʼ (bilabial ejective stop) followed by the features
for kxʼ (velar ejective affricate))
tʼkxʼ : ditto
t̪ʼkxʼ : ditto
As to why they are crashing the script... the first three might be
crashing because the ejective marker is on a voiced segment (ejectives
can't be voiced)... ?

Regarding the prenasalized ones, a contour segment with the features
you've mentioned (+nas, +son, 0delrel) sounds right to me. The rest
it sounds like you already figured out right?

Found it. I was looking in the wrong thread... it got buried in the
other thread after richard's question about "round" and "tense".
Short answer is that there is not really enough info in that resource
to really know all the phonemes. My best guesses:

  1. treat fortis stops as in !Xoo, namely as a prevoiced aspirated stop
    like dtʰ or ɡkʰ
  2. use unicode point 203c (‼) for the retroflex click
  3. the square brackets in the chart on page 5 indicate the IPA
    interpretation of the orthography, in cases where the authors thought
    it was unclear.
  4. the complex stops are a piece of work... sighting down the EGR-AL
    column on page 5, here is my best guess at what I see:
    l t d dtʰ tʼ tʰ d̤ tx dx

Honestly I don't have time to try to puzzle out the rest of this
inventory at the moment... there's not enough information there for
me to do anything other than make wild guesses, and I'd rather focus
on the Clements scripts.

ɲɟʝ is fine

ndʑ is fine

ɲdʑ does not obey our rules for homorganic place for prenasalization.

Should check the grammars.

How to represent pitch accents glyphically?

This issue is for keeping track of inventories that are known to have pitch accents among their tonemes, and how the pitch accents were encoded. It should not be considered exhaustive.

  • nld, dialect: Hasselt, source: Peters 2006, accent 1: no underlying tone, accent 2: low tone (cf. pp 121-123 and uzling/phoible-data#7)
  • nld, dialect: Maastricht, source: Gussenhoven 1999, accent 1: no tone, accent 2: high tone (cf. p 162 and uzling/phoible-data#8)

add indicator columns for unusual language types

Probably want separate columns for mixed, pidgin, creole, signed, ancient, extinct. Could also conceivably do a single column that had one of those strings or NA for each language. That would be more compact, but a bit harder to deal with if we had a language that was creole + signed, mixed + extinct, etc.

Closes #67

New style glyph IDs

@drammock - for the data aggregation script, Forkel and I thought it made sense to generate unique IDs by taking the decimal number of each character in each glyph and concatenating them together, e.g.

pʼ == 112700

This would be more stable than my previous approach.

Add new inventory data

Add the newly collected sources (UZ isolates, Iranian languages, JIPA, etc.).

Review known issues and then run through Unicode IPA checker.

Duplicate languages `lmn` and `iru` in Ramaswami raw data

lmn is named as Lamani on line 60 and Banjara on line 13.

iru is named as Irula on line 33 and Kasaba on line 40.

Need to determine if this is just an error of the ISO code (in which case need to correct), or if these are actually lects that are classified with the same ISO code (in which case need to figure out how to apply trump to them; easiest is probably adding parenthetical info to the LanguageName field).

Blocks resolution of #53.

Check diacritic consistency

1747 Ramaswami knn Konkani 1 64141 2856 d̤z consonant c-d-c 3
1318 CASL cce Copi 1 46057 2509 d̤z̤ consonant c-d-c-d 4

Missing geo-coordinates

Now, there are only 6 languages lacking geo coordinates:

  • Pisamira (ID 2143)
  • Yãroamë of Serra do Pacu/Ajarani (ID 2150)
  • Arara do Acre (ID 1999)
  • Parkateje (ID 1970)
  • Günün Yajich (ID 1906)
  • Dinka (ID 1398)

I should also check if the language codes are the latest or have been updated in the interim.

Sebat Bet Gurage -- transcription mistakes

http://phoible.org/languages/sgw

(probably affects all Sebat Bet Gurage inventories and possibly other Ethiopian languages)

/q/ -> kʼ
/q/ -> kʲʼ
/q/ -> kʷʼ

as per Hetzron, Robert 1977.

@drammock - we also have /kʼʲ/, but conventions say to use /kʲʼ/, etc. Is there a qualitative difference that we should capture (I'm told in Ethiopianist tradition the ejective should come before the secondary articulation). Either way, something needs to be fixed.

I also need to look at <hʲ> which may be better represented as <ç>. Further investigation needed.

Glyphs IDs out-of-line

Changing things like

%tʃ%
%dʒ%

to phoible conventions would require that the glyph ids are also updated

Fix GM README

There are some remarks in there that don't seem to be true (e.g., about having merged the Africa and SE Asia data into one source file). Fix after #28 is merged.

make phoible an R package?

Something to think about for the long-term. It is possible to release R packages that are data sets only. It would be a nice distribution model (worldwide mirrors) with a built-in versioning and upgrade mechanism. We could also release some purpose-built functions (trump, feature reduction, etc) along with it. Thoughts?

Azerbaijani ISO code

The "gold standard" document lists language azb "Azerbaijani" (SPA). The current output of the aggregation script has the same data (from SPA) under ISO code azj. Which is correct? Did you change this in the gold standard doc or in the SPA raw data file at some point, @bambooforest? I couldn't find any open or closed issues mentioning those ISO codes or language name.

Additional info: according to the ISO table, azb is "North Azerbaijani" and azj is "Sourth Azerbaijani", and there is also a macrolanguage aze.

Check sources: affricate place mismatch

Languages ahk (Akha) and pib (Yine) in the UW data have affricates where the stop and fricative parts don't match in place of articulation: and tçʰ

Wrong language

the "Kxoe" language citing Christa König and Bernd Heine. 2008. A concise dictionary of Northwestern !Xun. Ruediger Koppe. > This should be ǃXun, not Kxoe/Khwe!

@bambooforest : check the language name and code

aspiration in source data

There should never be a standard aspiration diacritic ʰ on a base glyph that is voiced. All such instances of ʰ in the data sources should be converted to ʱ. The features table will need to be updated as well, to make sure that the revised phonemes are still getting assigned a feature vector.

Allophone handling

The allophone code in aggregate-raw-data.R is a bit fragile, given that the different data sources encode allophone information differently. This was discussed at some length here. This issue is here to remind us to revisit the allophone issue once the rest of the aggregation is stable.

Nasal release vs. pre-stopped nasals

S. Moisik points out that our feature set probably can't distinguish between a pre-stopped nasal and a stop with nasal release. This is possibly even a within-language contrast in Wolof? Will need to think about how to distinguish these segment types featurally.

features: palatal diacritic

palatal diacritic ʲ currently overwrites features +dorsal +high -low +front -back. There are some phonemes whose base glyphs are already palatal and yet have that diacritic, like cçʲ and which at present are featurally indistinct from their base glyph counterparts and c.

Duplicate languages `izi` and `xsm` in AA raw data

AA raw data has three different lects listed under ISO code izi:

  • Lines 2480-2517 are LanguageName "ẹzaa"
  • Lines 3716-3755 are LanguageName "ikwo"
  • Lines 3757-3826 are LanguageName "izi"

This breaks the trump selection procedure, which currenly can only handle:

  • same ISO code (LanguageCode) but different Source
  • same ISO code and same Source but different SpecificDialect

Any idea what's going on here @bambooforest? Can you (or one of your colleagues) examine the raw AA data and figure out what to do here? Possible solutions I can see are:

  1. keep all three, but add SpecificDialect information to each (in AA data, there is no SpecificDialect column, so this would take the form of parenthetical info in the LanguageName column)
  2. remove two of them, if the three are genuine duplicates (no differences in inventory), and incorporate the alternate names into the LanguageName field of the entry that is kept.

aggregation script aggregating header row from somewhere

line 22611:

ISO Name Dialect Phonemes Allophones Allophones Allophones Allophones Allophones gm 0050+0068+006F+006E+0065+006D+0065+0073 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

semantic versioning

I noticed on the release that it was tagged with version "v2014". I understand this was probably motivated by Martin's preferences for CLLD releases being once a year at most. However, phoible is still software, and as such I think it's worth considering, at least, whether we want a more fine-grained versioning scheme (personally I like semantic versioning). The advantage of this is that if we (or others) use phoible in publications, and want to use the most current version (because of new data or features added since last CLLD release, for example), then we can create a new intermediate release that we can reference in that publication. Thoughts?

aggregating script dropping data?

some weird things come out of the aggregation script that i'm discovering via the R code on multivariate variables i emailed you. for example:

filter(multivariate, coronal==FALSE)

returns:

fan
ksf
mky
sgw
thm
xbr

filter(final.data, LanguageCode=="ksf") --> 7 phonemes
filter(final.data, LanguageCode=="fan") --> 1 phoneme
filter(final.data, LanguageCode=="mky") --> 1 phoneme
filter(final.data, LanguageCode=="sgw") --> 3 phonemes
filter(final.data, LanguageCode=="thm") --> 1 phoneme
filter(final.data, LanguageCode=="xbr") --> 1 phoneme

table(final.data$LanguageCode)

shows the phoneme counts are off...

Languages with duplicate phonemes

There are some languages that are showing duplicate phonemes. Not sure if this is caused by the denormRenorm function, or if there are errors in the source data files, or some other cause. Affects 19 languages in SAPHON, 10 languages in AA, 3 languages in RA.

Updates / additions to Mande languages

See Vydrin 2007, which contains phonemic inventories for South Mande languages (known at that time).

In particular, on the p. 8, there is a vocalic inventory of Dan-Gweetaa. It should be modified: the semi-closed vowels (ɩ, ʋ, ʋ̈) are not separate phonemes but allophones of e, o, ɤ respectively under extra-high tone; the semi-closed nasal vowels (given in brackets) should be also eliminated (they are allophones, rather than phonemes). Phoneme ɒ is not necessary long, it can be short. A third modulated tone (extrahigh-extralow) has been discovered.

I'm told there are corrections to be done on other South Mande languages as well.

Further: there are two "Dan"'s in the database: Dan (GM) and Dan (UPSID). Dan (GM) is an early and pioneering study by Bearth and Zemp. The latter seems to refer to a Liberian variety (Vydrin, pc).

The former needs to be updated to reflect what has been learned about the language. The latter contains inexactitudes (ibid).

features: combining plus sign below

This diacritic currently has no features associated with it, so phonemes with and without the diacritic that are otherwise identical will be featurally indistinct.

Glyphs: modifier turned glottal stop vs combining right tack

According to our old diacritics spreadsheet, ˤ and (without the a) are featurally synonymous (+RTR -ATR), but the modifier pharyngeal is for consonants and the combining tack is for vowels. This convention is not followed in the UPSID data, where segments like ãõ̞ˤ occur in !Xu. However, since the latter half of that segment already has a lower diacritic (combining down tack), adding a combining right tack makes it hard/impossible to read.

Maybe this is a non-issue since there seems to be some changes afoot in the UPSID data (possibly done by one of @bambooforest's UZH minions?) that includes getting rid of the downtacks for vowels described simply as "mid". So two related questions: what is the story with those UPSID changes? What to do about the modifier pharyngeal on vowels in UPSID?

NAs in Allophones

Not sure if this is an issue @drammock but I'm in the middle of the aspiration cleaning so wanted to file a quick issue as not to forget what I came across (Source, LanguageCode, Phoneme, Allophone):

ra awa cçʰ NA
ra ben cçʰ NA
ra bft cçʰ NA
ra bkk cçʰ NA
ra bns cçʰ NA
ra cdn cçʰ NA
gm xuu bʰ bʰ
ph ahk cçʰ cçʰ
spa aka cçʰ cçʰ

Add Gumer

Add Gumer from Völlmin's description.

Völlmin, Sascha. forthcoming. Towards a Grammar of Gumer: Phonology and Morphology of a Western Gurage variety. PhD thesis, University of Zurich.

Also to look at:

Banksira, Degif Petros. 2000. Sound mutations. The morphophonology of Chaha. (Te 11 080)

SAPHON languages missing tone information

The SAPHON raw data has two columns at the end of the phoneme block: tone and nasal harmony. They are boolean columns just like all the phoneme columns. Nasal harmony is a phonological rule, so we don't need to include it at this stage of phoible. But how should we handle their boolean "tone / no tone" distinction? Is it generally true that South American languages that have tone all have the same number and type of tonemes?

c-cedilla

all c-cedillas should be pre-composed. SM: double check.

Add Looma data

Add phonological inventory data from the Woi-Bhalaga dialect of Loom (Guinean)

See: 45mishchenko.pdf

SAPHON languages with two ISO codes

Several SAPHON languages list two ISO codes for the same inventory:

Chorote lists crq crt
Bolivian Quechua lists qul quh
Ancash Quechua lists qwa qws
Huaylas-Conchucos Quechua lists qxn qwh

Any idea why this is, or how we should choose the "correct" one, @bambooforest ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.