phoible / dev Goto Github PK

View Code? Open in Web Editor NEW

109.0 13.0 30.0 59.36 MB

PHOIBLE data and development.

Home Page: https://phoible.org/

License: GNU General Public License v3.0

R 4.99% TeX 95.01%

linguistics phonology typology

dev's Issues

Add Looma data

Add phonological inventory data from the Woi-Bhalaga dialect of Loom (Guinean)

See: 45mishchenko.pdf

Something to think about for the long-term. It is possible to release R packages that are data sets only. It would be a nice distribution model (worldwide mirrors) with a built-in versioning and upgrade mechanism. We could also release some purpose-built functions (trump, feature reduction, etc) along with it. Thoughts?

Update language codes

http://www.language-archives.org/checks/phoible.org

feature vectors for some phonemes not getting properly merged

out of 2211 phonemes, 221 are not merging properly, although at least some of them are definitely present in feature table.

Check diacritic consistency

1747 Ramaswami knn Konkani 1 64141 2856 d̤z consonant c-d-c 3
1318 CASL cce Copi 1 46057 2509 d̤z̤ consonant c-d-c-d 4

Nasal release vs. pre-stopped nasals

S. Moisik points out that our feature set probably can't distinguish between a pre-stopped nasal and a stop with nasal release. This is possibly even a within-language contrast in Wolof? Will need to think about how to distinguish these segment types featurally.

Fix GM README

There are some remarks in there that don't seem to be true (e.g., about having merged the Africa and SE Asia data into one source file). Fix after #28 is merged.

Add Mandinka

of Guinea, source from Vydrin.

Also a good source:

http://www.deniscreissels.fr/public/Creissels-lexique_mandinka_2012.pdf

NAs in Allophones

Not sure if this is an issue @drammock but I'm in the middle of the aspiration cleaning so wanted to file a quick issue as not to forget what I came across (Source, LanguageCode, Phoneme, Allophone):

ra awa cçʰ NA
ra ben cçʰ NA
ra bft cçʰ NA
ra bkk cçʰ NA
ra bns cçʰ NA
ra cdn cçʰ NA
gm xuu bʰ bʰ
ph ahk cçʰ cçʰ
spa aka cçʰ cçʰ

Language code update

http://phoible.org/languages/daf

has been split into into Dan [dnj] and Kla-Dan [lda]

http://www-01.sil.org/iso639-3/documentation.asp?id=daf

SAPHON languages with two ISO codes

Several SAPHON languages list two ISO codes for the same inventory:

Chorote lists crq crt
Bolivian Quechua lists qul quh
Ancash Quechua lists qwa qws
Huaylas-Conchucos Quechua lists qxn qwh

Any idea why this is, or how we should choose the "correct" one, @bambooforest ?

How to represent pitch accents glyphically?

This issue is for keeping track of inventories that are known to have pitch accents among their tonemes, and how the pitch accents were encoded. It should not be considered exhaustive.

nld, dialect: Hasselt, source: Peters 2006, accent 1: no underlying tone, accent 2: low tone (cf. pp 121-123 and uzling/phoible-data#7)
nld, dialect: Maastricht, source: Gussenhoven 1999, accent 1: no tone, accent 2: high tone (cf. p 162 and uzling/phoible-data#8)

Wari

y -> j

c-cedilla

all c-cedillas should be pre-composed. SM: double check.

aggregation script aggregating header row from somewhere

line 22611:

ISO Name Dialect Phonemes Allophones Allophones Allophones Allophones Allophones gm 0050+0068+006F+006E+0065+006D+0065+0073 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Azerbaijani ISO code

The "gold standard" document lists language azb "Azerbaijani" (SPA). The current output of the aggregation script has the same data (from SPA) under ISO code azj. Which is correct? Did you change this in the gold standard doc or in the SPA raw data file at some point, @bambooforest? I couldn't find any open or closed issues mentioning those ISO codes or language name.

Additional info: according to the ISO table, azb is "North Azerbaijani" and azj is "Sourth Azerbaijani", and there is also a macrolanguage aze.

trump reduction not working

not sure why yet, but 343 of the ISO codes are showing more than one value in the Source column.

SAPHON questionable graphemes

Notes have been added, but the corresponding IPA graphs haven't been updated yet.

Add Fataluku (Papuan, East Timor)

Add inventory of Fataluku (Papuan, East Timor).

SAPHON languages missing tone information

The SAPHON raw data has two columns at the end of the phoneme block: tone and nasal harmony. They are boolean columns just like all the phoneme columns. Nasal harmony is a phonological rule, so we don't need to include it at this stage of phoible. But how should we handle their boolean "tone / no tone" distinction? Is it generally true that South American languages that have tone all have the same number and type of tonemes?

Old issue to close

The sounds from nmn:
dʼkxʼ : not in PDF... should be dtʼkxʼ I think? described as
prevoicing followed by sequence of two ejectives (three-part contour
seg)
d̪ʼkxʼ : not in PDF... should be d̪t̪ʼkxʼ I think? (same as above)
ɡʼkxʼ : not in PDF... I do see ɡkxʼ which ought to be a prevoiced
velar ejective (two-part contour segment: ɡ features + kxʼ features)
pʼkxʼ : this ought to be fine, should be treated as two-part contour
(the features for pʼ (bilabial ejective stop) followed by the features
for kxʼ (velar ejective affricate))
tʼkxʼ : ditto
t̪ʼkxʼ : ditto
As to why they are crashing the script... the first three might be
crashing because the ejective marker is on a voiced segment (ejectives
can't be voiced)... ?

Regarding the prenasalized ones, a contour segment with the features
you've mentioned (+nas, +son, 0delrel) sounds right to me. The rest
it sounds like you already figured out right?

Found it. I was looking in the wrong thread... it got buried in the
other thread after richard's question about "round" and "tense".
Short answer is that there is not really enough info in that resource
to really know all the phonemes. My best guesses:

treat fortis stops as in !Xoo, namely as a prevoiced aspirated stop
like dtʰ or ɡkʰ
use unicode point 203c (‼) for the retroflex click
the square brackets in the chart on page 5 indicate the IPA
interpretation of the orthography, in cases where the authors thought
it was unclear.
the complex stops are a piece of work... sighting down the EGR-AL
column on page 5, here is my best guess at what I see:
l t d dtʰ tʼ tʰ d̤ tx dx

Honestly I don't have time to try to puzzle out the rest of this
inventory at the moment... there's not enough information there for
me to do anything other than make wild guesses, and I'd rather focus
on the Clements scripts.

ɲɟʝ is fine

ndʑ is fine

ɲdʑ does not obey our rules for homorganic place for prenasalization.

Should check the grammars.

Fix two languages that both have same trump order

Both zoc and ind have two entries with the same ISO code (and the same Glottocode) and the trump value of 1.

Add new inventory data

Add the newly collected sources (UZ isolates, Iranian languages, JIPA, etc.).

Review known issues and then run through Unicode IPA checker.

semantic versioning

I noticed on the release that it was tagged with version "v2014". I understand this was probably motivated by Martin's preferences for CLLD releases being once a year at most. However, phoible is still software, and as such I think it's worth considering, at least, whether we want a more fine-grained versioning scheme (personally I like semantic versioning). The advantage of this is that if we (or others) use phoible in publications, and want to use the most current version (because of new data or features added since last CLLD release, for example), then we can create a new intermediate release that we can reference in that publication. Thoughts?

features: combining plus sign below

This diacritic currently has no features associated with it, so phonemes with and without the diacritic that are otherwise identical will be featurally indistinct.

New style glyph IDs

@drammock - for the data aggregation script, Forkel and I thought it made sense to generate unique IDs by taking the decimal number of each character in each glyph and concatenating them together, e.g.

pʼ == 112700

This would be more stable than my previous approach.

Glyphs: modifier turned glottal stop vs combining right tack

According to our old diacritics spreadsheet, ˤ and a̙ (without the a) are featurally synonymous (+RTR -ATR), but the modifier pharyngeal is for consonants and the combining tack is for vowels. This convention is not followed in the UPSID data, where segments like ãõ̞ˤ occur in !Xu. However, since the latter half of that segment already has a lower diacritic (combining down tack), adding a combining right tack makes it hard/impossible to read.

Maybe this is a non-issue since there seems to be some changes afoot in the UPSID data (possibly done by one of @bambooforest's UZH minions?) that includes getting rid of the downtacks for vowels described simply as "mid". So two related questions: what is the story with those UPSID changes? What to do about the modifier pharyngeal on vowels in UPSID?

add conventions text file to root level

add phoneme conventions, etc.

perhaps add addressed decisions in separate section via the paper trail

https://github.com/uzling/phoible-data/blob/master/agg-vs-gold-mismatches.tsv

Add Gumer

Add Gumer from Völlmin's description.

Völlmin, Sascha. forthcoming. Towards a Grammar of Gumer: Phonology and Morphology of a Western Gurage variety. PhD thesis, University of Zurich.

Also to look at:

Banksira, Degif Petros. 2000. Sound mutations. The morphophonology of Chaha. (Te 11 080)

Fix remaining 6 geo-coordinates

Fix the remaining 6 missing geo-coordinates.

Allophone handling

The allophone code in aggregate-raw-data.R is a bit fragile, given that the different data sources encode allophone information differently. This was discussed at some length here. This issue is here to remind us to revisit the allophone issue once the rest of the aggregation is stable.

Add hoj and bhd

Add inventories for languages Hadoti [hoj] and Bhadarwahi [bhd]:

Dwivedi, A. V. (2012). A Grammar of Hadoti. München: LINCOM EUROPA.
http://www.worldcat.org/title/grammar-of-hadoti/oclc/822017249

Dwivedi, A. V. (2013). Agrammar of Bhadarwahi. Muenchen: LINCOM Europa.
http://www.worldcat.org/title/agrammar-of-bhadarwahi/oclc/861960466

Languages with duplicate phonemes

There are some languages that are showing duplicate phonemes. Not sure if this is caused by the denormRenorm function, or if there are errors in the source data files, or some other cause. Affects 19 languages in SAPHON, 10 languages in AA, 3 languages in RA.

Check sources: affricate place mismatch

Languages ahk (Akha) and pib (Yine) in the UW data have affricates where the stop and fricative parts don't match in place of articulation: tç and tçʰ

Duplicate languages `izi` and `xsm` in AA raw data

AA raw data has three different lects listed under ISO code izi:

Lines 2480-2517 are LanguageName "ẹzaa"
Lines 3716-3755 are LanguageName "ikwo"
Lines 3757-3826 are LanguageName "izi"

This breaks the trump selection procedure, which currenly can only handle:

same ISO code (LanguageCode) but different Source
same ISO code and same Source but different SpecificDialect

Any idea what's going on here @bambooforest? Can you (or one of your colleagues) examine the raw AA data and figure out what to do here? Possible solutions I can see are:

keep all three, but add SpecificDialect information to each (in AA data, there is no SpecificDialect column, so this would take the form of parenthetical info in the LanguageName column)
remove two of them, if the three are genuine duplicates (no differences in inventory), and incorporate the alternate names into the LanguageName field of the entry that is kept.

features: palatal diacritic

palatal diacritic ʲ currently overwrites features +dorsal +high -low +front -back. There are some phonemes whose base glyphs are already palatal and yet have that diacritic, like cçʲ and cʲ which at present are featurally indistinct from their base glyph counterparts cç and c.

Missing geo-coordinates

Now, there are only 6 languages lacking geo coordinates:

Pisamira (ID 2143)
Yãroamë of Serra do Pacu/Ajarani (ID 2150)
Arara do Acre (ID 1999)
Parkateje (ID 1970)
Günün Yajich (ID 1906)
Dinka (ID 1398)

I should also check if the language codes are the latest or have been updated in the interim.

aggregating script dropping data?

some weird things come out of the aggregation script that i'm discovering via the R code on multivariate variables i emailed you. for example:

filter(multivariate, coronal==FALSE)

returns:

fan
ksf
mky
sgw
thm
xbr

filter(final.data, LanguageCode=="ksf") --> 7 phonemes
filter(final.data, LanguageCode=="fan") --> 1 phoneme
filter(final.data, LanguageCode=="mky") --> 1 phoneme
filter(final.data, LanguageCode=="sgw") --> 3 phonemes
filter(final.data, LanguageCode=="thm") --> 1 phoneme
filter(final.data, LanguageCode=="xbr") --> 1 phoneme

table(final.data$LanguageCode)

shows the phoneme counts are off...

SPA zoc -> zoh

This language code [zoh] is wrong. It should be [zoc], as per WALS identification of Wonderly 1951a

http://wals.info/refdb/record/Wonderly-1951b

which would also align the SPA inventory with the UPSID inventory under the same reference.

The code change has been updated in the SPA langname codes file

https://github.com/phoible/phoible/blob/master/data/SPA/SPA_LangNamesCodes.tsv

and the aggregated and phoneme level files will be updated when they are newly generated.

markup screwed up in README.md

phoible / data / README.md

fix me.

Sebat Bet Gurage -- transcription mistakes

http://phoible.org/languages/sgw

(probably affects all Sebat Bet Gurage inventories and possibly other Ethiopian languages)

/q/ -> kʼ
/q/ -> kʲʼ
/q/ -> kʷʼ

as per Hetzron, Robert 1977.

@drammock - we also have /kʼʲ/, but conventions say to use /kʲʼ/, etc. Is there a qualitative difference that we should capture (I'm told in Ethiopianist tradition the ejective should come before the secondary articulation). Either way, something needs to be fixed.

I also need to look at <hʲ> which may be better represented as <ç>. Further investigation needed.

SAPHON specific dialect

the aggregation script doesn't pull SpecificDialect out of the SAPHON raw data.

wrong geo-location

http://phoible.org/languages/blk

should be further to the north (pc Mathias)

aspiration in source data

There should never be a standard aspiration diacritic ʰ on a base glyph that is voiced. All such instances of ʰ in the data sources should be converted to ʱ. The features table will need to be updated as well, to make sure that the revised phonemes are still getting assigned a feature vector.

SPA LanguageNames to ISO639.3 codes

Let me know if this doesn't work for the SPA LanguageName - to - ISO639.3 mappings, @drammock

https://github.com/phoible/phoible/blob/master/data/METADATA/phoible_index.tsv

@bammbooforest check that these codes are up-to-date (from the 1241 version of phoible) -- also merge in data from inventoryids_filenames.tsv

Duplicate languages `lmn` and `iru` in Ramaswami raw data

lmn is named as Lamani on line 60 and Banjara on line 13.

iru is named as Irula on line 33 and Kasaba on line 40.

Need to determine if this is just an error of the ISO code (in which case need to correct), or if these are actually lects that are classified with the same ISO code (in which case need to figure out how to apply trump to them; easiest is probably adding parenthetical info to the LanguageName field).

Blocks resolution of #53.

Glyphs IDs out-of-line

Changing things like

%tʃ%
%dʒ%

to phoible conventions would require that the glyph ids are also updated

Updates / additions to Mande languages

See Vydrin 2007, which contains phonemic inventories for South Mande languages (known at that time).

In particular, on the p. 8, there is a vocalic inventory of Dan-Gweetaa. It should be modified: the semi-closed vowels (ɩ, ʋ, ʋ̈) are not separate phonemes but allophones of e, o, ɤ respectively under extra-high tone; the semi-closed nasal vowels (given in brackets) should be also eliminated (they are allophones, rather than phonemes). Phoneme ɒ is not necessary long, it can be short. A third modulated tone (extrahigh-extralow) has been discovered.

I'm told there are corrections to be done on other South Mande languages as well.

Further: there are two "Dan"'s in the database: Dan (GM) and Dan (UPSID). Dan (GM) is an early and pioneering study by Bearth and Zemp. The latter seems to refer to a Liberian variety (Vydrin, pc).

The former needs to be updated to reflect what has been learned about the language. The latter contains inexactitudes (ibid).

add indicator columns for unusual language types

Probably want separate columns for mixed, pidgin, creole, signed, ancient, extinct. Could also conceivably do a single column that had one of those strings or NA for each language. That would be more compact, but a bit harder to deal with if we had a language that was creole + signed, mixed + extinct, etc.

Closes #67

Wrong language

the "Kxoe" language citing Christa König and Bernd Heine. 2008. A concise dictionary of Northwestern !Xun. Ruediger Koppe. > This should be ǃXun, not Kxoe/Khwe!

@bambooforest : check the language name and code

phoible / dev Goto Github PK

dev's Issues

Recommend Projects

Recommend Topics

Recommend Org