loanpydatahub / ronataswestoldturkic Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 5.88 MB

CLDF dataset derived from 'West Old Turkic' by András Róna-Tas and Árpád Berta from 2011

Home Page: https://www.harrassowitz-verlag.de/title_4002.ahtml

License: Creative Commons Attribution 4.0 International

TeX 0.01% Python 99.98% Shell 0.01%

dataset etymology loanwords hungarian turkic

ronataswestoldturkic's Issues

Remove quotation marks from etc/comments.tsv

"We can make this file beautiful and searchable if this error is corrected: Illegal quoting in line 11."

Add glotto-code of Old Hungarian to languages.csv after next glottolog release

See glottolog/glottolog#899

Create a full version

In the current version I have extracted only data that is relevant for my own research, especially cols WOT, H and years. There is, however, much more interesting information in the pdf that could be included, like

including Cumanian words from raw in CLDF as well
adding the 39 Alanic borrowings on p. 1331-1339.
Mongolic and sometimes Chinese etymologies for West Old Turkic words,
the exact year and form of Hungarian words that appear in the written sources (phonetic transciptions of the various old-Hungarian orthographies are already provided in the pdf!),
related words in the Turkic and other language families
List of sources where it has been dealt with the given etymology

This could be a future project idea, maybe a good idea to get in touch with the authors of the original work and check if a database still exists somewhere and is available? Currently too much work for one person to parse + manually clean the pdf.

connect to own concept-list

https://concepticon.clld.org/contributions/RonaTas-2011-431

Follow-up Check

@LinguList I added you as a collaborator for a general follow-up check to see what's left to do to make this repository ready to add it to the lexibank community :)

FB_vowel_harmony

stick to this:

(https://en.wikipedia.org/wiki/Hungarian_phonology)

EAH vowel inventory: {'y', 'yː', 'ø', 'øː', 'i', 'iː', 'e', 'eː', 'ɛ', 'a', 'aː', 'ɒ', 'ɯ', 'u', 'uː', 'o' }

replace tʹ by c in col IPA

replace γ by ɣ in col IPA

Add col "year"

delete col FB_segments

correct geminates kk, ll, mm, tt, t͡ʃt͡ʃ in phon. transcr.

They are messing with the alignment and should not be there, e.g. "kk" should be "k:" They occur in very few words so can be ignored for now

Remove col Orthography and include in forms/values

have to identify into which one of the two the columns it should be integrated and make sure everything else stays in place

correct col "loans"

should only be "true" for recipient words (i.e. all lgs except WOT, since that's the donor lg)

trying to add borrowings.csv

I'm having troubles creating the borrowings-table. I have to connect EAH (the recipient words) and WOT (the donor words), but EAH is sometimes empty within a cognateset, so those I'd need to skip. I was experimenting around for a while now, trying to use a stack, but I assume there must be a much easier solution which I'm not aware of

remove shell file wot.sh

outsource functions in lexibankscript to loanpy

Add clusterwise segmentation

Currently segmentation happens through orthography.py. All it would take to apply the clusterwise segmentation is from ipatok import clusterise and to replace tokenise with clusterise in line 12. But I don't know how to make that column appear in forms.csv eventually

Add short YouTube explanation video

how to find BIPA errors

There seems to be a small BIPA error hidden, but I can't find out which one. If I make a set of all the IPA-characters used, this comes out: ['', 'a', 'aː', 'b', 'c', 'd', 'd͡z', 'd͡ʒ', 'e', 'eː', 'f', 'h', 'i', 'iː', 'j', 'k', 'l', 'm', 'n', 'o', 'oː', 'p', 'r', 's', 't', 't͡s', 't͡ʃ', 'u', 'uː', 'v', 'w', 'y', 'yː', 'z', 'ø', 'øː', 'ŋ', 'ɐ', 'ɒ', 'ɛ', 'ɟ', 'ɡ', 'ɣ', 'ɥ', 'ɯ', 'ɲ', 'ʃ', 'ʎ', 'δ', 'χ']. I checked for every element if it's in the master list and it says yes. Which other strategies would there exist to pin down bad characters?

change Python version to 3.7

Since CLDF-validation is failing: Change 3.6 to 3.7 in https://github.com/martino-vic/ronataswestoldturkic/blob/165acbba26265bcfc51d8f498bce6fe7f71a5cc5/.github/workflows/python-package.yml
long term support version must have incremented by now from 3.6 to 3.7

insert comments

Will need a comments.csv in the folder etc. referring to specific entrires, as well as general comments that go to the readme.

Dump from my personal notes:

Replace etc/concepts.tsv with new mappings

new mappings found here: https://github.com/martino-vic/rtbwestoldturkic/blob/main/etc/concepts.tsv
then re-run the lexibank script

borrowings.csv is wrong

half of the lines are wrong/superfluous: like EAH-0_carpenter-1 EAH-0_carpenter-1

convert the .py files from /raw to cli-commands

Add Cuman to languages

Figure out the remaining 2 BIPA errors

col Frequency

I'm trying to add a column to forms.csv that counts how often each prosodic structure occurs in total in column "ProsodicStructure" in forms.csv and I can't figure out how to add this to the lexibank script. It's similar to this issue: I can't add the info from within the loop because I can count the number of occurrences only after the loop has ended. But somehow I don't manage to start a second loop at the bottom where I insert this info. Is there some kind of workaround for this @LinguList ?

loanpydatahub / ronataswestoldturkic Goto Github PK

ronataswestoldturkic's People

Contributors

Watchers

ronataswestoldturkic's Issues

Recommend Projects

Recommend Topics

Recommend Org