The bdproto from bdproto

bib-file

add TimeDepthSource and HomelandSource to bib-file

10/70 Proto-Hlai

Check if it's Proto-Hlai or Proto-Hlaic.

Ancient Egyptian

Asked Harald for additional Glottocodes for the different stages of Ancient Egyptian.

Akkadian

We could have more datapoints for Akkadian if we want, with good datings. Basically there are a few main dialects that are attested over a few thousand years. Do we want that?

Uralic

Check all the Uralic entries, there's some fishy stuff there.

time depth

time depth for geographic area
- general minimum time depth (8000 bp/bc in the fertile crescent and china; more recent developments 1000-2000 bc in the Americas)
we need dates for the proto-languages

Standardized names (common, Proto-)

Some reconstructed languages are called "Common X" rather than "Proto-X." Keep or find a better term?

Make family checklist?

Hi, can we make a unified list of proto-languages, as well as a list of proto-languages (by family) to be hunted down? The latter can be made manually and added to.

Transform the aggregation dump to CLDF 1.0

Dump the inventories data into CLDF format
Add the CLDF JSON-DL metadata file (see pycldf)

6:pittayaporn2009_phon, 9:wiki_ostapirat2009 10:norquest2007_phon

The structure of the three proto-tones is not reconstructable, but how represent then the tonal phonemes?

163/1076 Zapotec/Zapotecan?

Which is the right group for this inventory?

Smaller areas (Urban)

Ask Matthias Urban about the smaller areas he uses.

1015 - Ijo/Ijoid

Check to see which the source refers to.

Coverage?

I think it would be good to have a summary of the coverage per area, so we can know what areas we should target in searches, rather than focusing on particular areas.

Need a way to distinguish duplicate entries

this could be by adding a column LanguageName with a standardized way of naming the inventories (so as to be able to easily identify duplicate entries) and replacing the current LanguageName with something like SourceLanguageName
or it could be by assigning a Glottocode to each and every inventory

this should be done in the g-spreadsheet for the time being

Create metadata spreadsheet in sheets

so that we can fill in missing fields and re-merge into the csv file later

I think we need a special annotation for low-confidence segments, with two possibilities: low confidence in the phoneme as a distinctive unit, and low confidence in its phonological ID. This can be useful because if people think there is a distinctive phoneme but don't agree on its interpretation, we can use it for total inventory counts but might want to exclude it from other analyses. If people don't know if it is necessary at all, we can exclude it or include it, depending on whether we want to be conservative or not.

Clean up directory

make a docs folder for scratch stuff and dump everything there. root folder should be:

bdproto.bib
bdproto-inventories.tsv
bdproto-metadata.tsv
README.md

Source for 16 etc Proto-Alor-Pantar

Add macroareas to metadata sheets

36, 87 Sources

Fix incorrect bibtex keys in UZ

1020 - Mon

Check if it's Proto-Mon or Old Mon.

New column for Type

This is to distinguish reconstructed from ancient languages, because we might want to exclude one type for some analyses.

161 - Eastern Oceanic

No Glottocode, no node, check to see if valid classification.

Update READMEs with LREJ paper

Inventory types

Todo: automatically detect whether an inventory is all Cs or Vs.

Duplicate columns?

What's the difference between LanguageFamily and Classification? Do we need both?

85 - Proto-Tanoan

Is this a real grouping? No Glottocode. Check source.

Add missing feature vectors

1003 - Proto-Lakkia

Is Lakkia a language or a family? Check source

Add columns to metadata spreadsheet

in g-sheets that bins that proto-languages in rough ages
if we can't get complete glottocode coverage, we'll need to add some rough geo-coordinates

Delete inventories

There are a few inventories of proto-languages that are prob too controversial/fringe to keep in the db.

This isn't a final list, but right now I'd get rid of BDPROTO ID numbers:

1055 - Nostratic
1053 - Proto-Altaic -- does Robeets have a sound inventory yet?
1059 - Proto-Australian -- not sure but prob to be got rid of
1061 - Proto-Afroasiatic -- v skeptical of any proposed concrete inventory; Nichols says it looks like a pseudo-phylum.
148 - Proto-Nilo-Saharan - not sure about this but looks like the evidence is for the splitters.
97/1089 - Ob-Ugric is sketchy, but I will check out the whole Uralic story, so maybe leave this for now.
20 - Is Proto-TNG really a thing?
1114 - Uralo-Siberian
14 - Proto-Dene-Caucasian

1106 Coptic ANE

Probably better to find another source.

2010 Kassite

There isn't a single connected text in Kassite, so the phonology might be extremely iffy. Suggest deleting.

Source for HUJI Old Egyptian 153

Convert segments to PHOIBLE conventions

Each segment in BDPROTO should conform to phoible conventions:

http://phoible.github.io/conventions/

This includes:

converting to strict/valid IPA (e.g. keyboard g, apostrophe, clicks)
clean up legacy characters introduce during conversion (zero-width character)

Fields for BDPROTO

ID - We should merge them all, but later. UZ has 1-15, we stated new ones from 16. Since we should go through the original data too, we can merge them once they've been vetted.
LanguageFamilyRoot/Family/Classification: we should drop all but Family unless there is a reason to keep them all. This should be taken automatically if possibly from Glottolog.
TimeDepth etc - I think that we only need one field for this, and another for the reference source. As for how to proceed, I suggest we ask experts for a reliable ref or pc. I can handle this, as I bother people regularly by mail.
Homeland - same.
BibTexKey, FileName, Squib - why not a single entry?
LanguageCode - why do we need this if we have the Glottocode? Can we drop it?
Syllable structure - I don't think we'll have this for many languages, so I suggest we drop it, or otherwise have a fixed choice of data to be entered, otherwise the data is likely to be messy.
Region - how is this different from homeland? Can we drop it? If it is different, can we move it next to homeland?
Allophone - in many cases, some variation is given (g~gh), could this be used for this, and if so, how does one select the variant? Here the issue isn't really allophony, though, it's generally uncertainty as which is to be reconstructed.
Another - we probably need a tag for marginal phonemes (i.e., phonemes in parentheses in the doculect).

update aggregation script
regenerate data
publish on Zenodo

bdproto / bdproto Goto Github PK

bdproto's People

Contributors

Stargazers

Watchers

Forkers

bdproto's Issues

Recommend Projects

Recommend Topics

Recommend Org