Code Monkey home page Code Monkey logo

ronataswestoldturkic's People

Contributors

dependabot[bot] avatar martino-vic avatar

Watchers

 avatar  avatar

ronataswestoldturkic's Issues

Create a full version

In the current version I have extracted only data that is relevant for my own research, especially cols WOT, H and years. There is, however, much more interesting information in the pdf that could be included, like

  • including Cumanian words from raw in CLDF as well
  • adding the 39 Alanic borrowings on p. 1331-1339.
  • Mongolic and sometimes Chinese etymologies for West Old Turkic words,
  • the exact year and form of Hungarian words that appear in the written sources (phonetic transciptions of the various old-Hungarian orthographies are already provided in the pdf!),
  • related words in the Turkic and other language families
  • List of sources where it has been dealt with the given etymology

This could be a future project idea, maybe a good idea to get in touch with the authors of the original work and check if a database still exists somewhere and is available? Currently too much work for one person to parse + manually clean the pdf.

Follow-up Check

@LinguList I added you as a collaborator for a general follow-up check to see what's left to do to make this repository ready to add it to the lexibank community :)

correct col "loans"

should only be "true" for recipient words (i.e. all lgs except WOT, since that's the donor lg)

trying to add borrowings.csv

I'm having troubles creating the borrowings-table. I have to connect EAH (the recipient words) and WOT (the donor words), but EAH is sometimes empty within a cognateset, so those I'd need to skip. I was experimenting around for a while now, trying to use a stack, but I assume there must be a much easier solution which I'm not aware of

Add clusterwise segmentation

Currently segmentation happens through orthography.py. All it would take to apply the clusterwise segmentation is from ipatok import clusterise and to replace tokenise with clusterise in line 12. But I don't know how to make that column appear in forms.csv eventually

how to find BIPA errors

There seems to be a small BIPA error hidden, but I can't find out which one. If I make a set of all the IPA-characters used, this comes out: ['', 'a', 'aː', 'b', 'c', 'd', 'd͡z', 'd͡ʒ', 'e', 'eː', 'f', 'h', 'i', 'iː', 'j', 'k', 'l', 'm', 'n', 'o', 'oː', 'p', 'r', 's', 't', 't͡s', 't͡ʃ', 'u', 'uː', 'v', 'w', 'y', 'yː', 'z', 'ø', 'øː', 'ŋ', 'ɐ', 'ɒ', 'ɛ', 'ɟ', 'ɡ', 'ɣ', 'ɥ', 'ɯ', 'ɲ', 'ʃ', 'ʎ', 'δ', 'χ']. I checked for every element if it's in the master list and it says yes. Which other strategies would there exist to pin down bad characters?

insert comments

Will need a comments.csv in the folder etc. referring to specific entrires, as well as general comments that go to the readme.

Dump from my personal notes:

  • - kåråmbēl: the form with "n" must be a typo (kårånbēl),
  • - changed csipa acc. to corrigenda, dzsartagan und dzarta zu yartagan und yarta,
  • - ige: doesn't explicitely say WOT, but better putting it under WOT than skipping it all together.
  • - add to readme: pages are not provided, entries must be looked up through the index that is provided in the end of the monograph WOT.
  • - csigolya checken, gyón, köpönyeg, oktat, örmény
  • - comment on the "V" like in čükV, esV -- loanpy should be able to handle "V"? Btw: Maybe this is causing the BIPA errors?
  • - quote: "Berta suggested a T starting point for the word. However, its first documented appearance is fr the mid-17ʰ c. – certainly a weak point in a T etymology." -- Could use this to confirm the idea that col year can serve as a filter
  • - böszörmény -- insert comment later that "WOT" must be just missing as a typo from the pdf
  • - bǖbāy: explain why omitted from data
  • explain in readme why dʹårmåt, tábor, teng were no taken into the data set

col Frequency

I'm trying to add a column to forms.csv that counts how often each prosodic structure occurs in total in column "ProsodicStructure" in forms.csv and I can't figure out how to add this to the lexibank script. It's similar to this issue: I can't add the info from within the loop because I can count the number of occurrences only after the loop has ended. But somehow I don't manage to start a second loop at the bottom where I insert this info. Is there some kind of workaround for this @LinguList ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.