Getting started,about loanpydatahub/streitberggothic

Comments (26)

martino-vic commented on September 26, 2024 1

I'm still trying to follow the blogpost, I think it would be a cool skill to be able to create these on my own, but I have just started so I have to see how it is going and will write an update here as soon as I can (probably today still), thanks for the great support!

Edit: Connected issues:

from streitberggothic.

LinguList commented on September 26, 2024 1

Your glottolog repository probably needs to be freshly installed / downloaded, sonce you have a glottolog version with local changes. No idea why, but you best git clone glottolog again and point to this glottolog version. Or, to jus tavoid this step, run with the --dev flag, which will lead to NOT using glottolog, so this step won't throw an error.

from streitberggothic.

LinguList commented on September 26, 2024 1

Closing this, because the code is running without error now.

from streitberggothic.

martino-vic commented on September 26, 2024

Still getting the same error. If I run
cd C:\Users\Viktor\OneDrive\PhD\concepticon
concepticon --repos=C:\Users\Viktor\OneDrive\PhD\concepticon\concepticon-data map_concepts C:\Users\Viktor\OneDrive\Git\streitberggothic\raw\Van_Loon-2004-3659.tsv
I get following error written into test.tsv:

On Windows you must specify an output file since printing to the terminal may not work
--
usage: concepticon map_concepts [-h] [--reference-list REFLIST]
[--full-search] [--language LANGUAGE]
[--skip_multimatch] [--output OUTPUT]
CONCEPTLIST

Attempt an automatic mapping for a new concept list.
--
 
Notes
-----
In order for the automatic mapping to work, the new list has to be
well-formed, i.e. in line with the requirments of Concepticon
(GLOSS/ENGLISH column, see also CONTRIBUTING.md).
 
positional arguments:
CONCEPTLIST           Path to (or ID of) concept list in TSV format
 
options:
-h, --help            show this help message and exit
--reference-list REFLIST
Another concept list to be used as reference for the
gloss mapping (default: None)
--full-search         select between approximate search (default) and full
search (default: False)
--language LANGUAGE   specify your desired language for mapping (default:
en)
--skip_multimatch
--output OUTPUT       specify output file (default: None)

from streitberggothic.

LinguList commented on September 26, 2024

So would you not just have to specify an output file via --output ?

from streitberggothic.

martino-vic commented on September 26, 2024

Yes, now it worked, thanks :) I had tried the --output flag earlier but had put it in the wrong place conceptivon --repos --output map_conceptlist instead of concepticon --repos map_conceptlist --output

I now uploaded concepts.tsv but I don't know how to resolve the conflicts - for the multiple mappings it is impossible for me to decide which one is correct, so do I pick a random one in those cases? - and the ones with question marks, do I really have to delete them all? I mean I can see that some of the meanings seem nonsensical (like Aai -> ???), but some are perfectly fine but somehow weren't recognised (like Nachkomme -> ???), do I have to add those manually? see concepticon/concepticon-data#1157

from streitberggothic.

martino-vic commented on September 26, 2024

still experimenting, current update:

C:\Users\Viktor\OneDrive\Git\streitberggothic>cldfbench lexibank.makecldf lexibank_streitgerggothic.py --concepticon=C:\Users\Viktor\OneDrive\PhD\lexibank\concepticon\concepticon\concepticon-data --glottolog=C:\Users\Viktor\OneDrive\PhD\lexibank\glottolog --clts=C:\Users\Viktor\OneDrive\PhD\lexibank\clts --concepticon-version=v2.5.0 --glottolog-version=v4.5 --clts-version=v2.2.0

Traceback (most recent call last):
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\Scripts\cldfbench.exe\__main__.py", line 7, in <module>
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\site-packages\cldfbench\__main__.py", line 68, in main
    stack.enter_context(
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\contextlib.py", line 492, in enter_context
    result = _cm_type.__enter__(cm)
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\site-packages\cldfcatalog\catalog.py", line 69, in __enter__
    self.checkout(self.tag)
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\site-packages\cldfcatalog\repository.py", line 105, in checkout
    return self.repo.git.checkout(spec)
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\site-packages\git\cmd.py", line 638, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\site-packages\git\cmd.py", line 1183, in _call_process
    return self.execute(call, **exec_kwargs)
  File "C:\Users\Viktor\AppData\Local\Programs\Python\Python310\lib\site-packages\git\cmd.py", line 983, in execute
    raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(1)
  cmdline: git checkout v4.5
  stderr: 'error: Your local changes to the following files would be overwritten by checkout:
        languoids/tree/atla1278/volt1241/kwav1236/nato1234/lele1262/likp1239/sekp1241/md.ini
        languoids/tree/atla1278/volt1241/nort3149/buak1234/adam1257/goul1243/goul1244/zank1234/kula1285/fani1244/md.ini
        languoids/tree/aust1307/mala1545/cele1242/grea1299/east2488/sout2928/bung1268/east2489/east2490/baho1237/md.ini
        languoids/tree/aust1307/mala1545/cele1242/grea1299/east2488/sout2928/muna1246/nucl1573/muni1256/buso1238/md.ini
        languoids/tree/aust1307/mala1545/cele1242/grea1299/tomi1242/sout2925/damp1237/md.ini
        languoids/tree/aust1307/mala1545/cele1242/kail1255/nort2898/kail1254/comm1248/bara1371/md.ini
        languoids/tree/aust1307/mala1545/cent2237/cent2245/cent2254/east2466/east2741/seti1249/beng1287/md.ini
        languoids/tree/aust1307/mala1545/cent2237/east2712/ocea1241/admi1239/east2459/manu1262/east2460/koro1314/bowa1234/papi1254/md.ini
        languoids/tree/aust1307/mala1545/cent2237/east2712/ocea1241/admi1239/east2459/sout2879/loup1244/balu1257/md.ini
        languoids/tree/aust1307/mala1545/cent2237/east2712/ocea1241/sout3173/newc1243/main1286/sout3313/extr1245/nume1242/md.ini
        languoids/tree/aust1307/mala1545/cent2237/east2712/ocea1241/west2818/nort3206/sarm1241/sarm1242/anus1238/anus1237/md.ini
        languoids/tree/aust1307/mala1545/cent2237/east2712/sout2850/sout3229/bomb1263/bedo1237/md.ini
        languoids/tree/aust1307/mala1545/cent2237/east2712/sout2850/sout3229/morm1235/md.ini
        languoids/tree/aust1307/mala1545/cent2237/east2712/sout2850/sout3229/raja1255/ambe1249/biga1238/md.ini
        languoids/tree/aust1307/mala1545/cent2237/east2712/sout2850/sout3229/raja1255/asss1237/md.ini
        languoids/tree/aust1307/mala1545/cham1312/md.ini
        languoids/tree/aust1307/mala1545/nort3238/meso1254/sout2905/md.ini
        languoids/tree/aust1307/mala1545/nort3253/sara1342/puna1279/puna1280/aput1239/sian1255/md.ini
        languoids/tree/aust1307/mala1545/sout2923/nort2894/pitu1237/dakk1238/md.ini
        languoids/tree/aust1307/mala1545/sout2923/ramp1244/bada1260/beso1237/md.ini
        languoids/tree/aust1307/mala1545/sout2923/ramp1244/seko1241/pana1302/budo1241/md.ini
        languoids/tree/coch1271/yuma1250/gene1244/paii1252/paip1241/md.ini
        languoids/tree/geel1240/bura1294/bura1276/md.ini
        languoids/tree/geel1240/demi1242/md.ini
        languoids/tree/inan1242/duri1243/md.ini
        languoids/tree/indo1319/clas1257/indo1320/indo1321/indo1324/chit1278/kala1372/md.ini
        languoids/tree/iroq1247/nort2947/moha1257/moha1258/md.ini
        languoids/tree/japo1237/ryuk1243/ryuk1244/miya1259/md.ini
        languoids/tree/kawe1237/nort1506/qawa1238/md.ini
        languoids/tree/lake1255/east2500/tawo1244/md.ini
        languoids/tree/lake1255/tari1255/east2502/dout1239/dout1240/md.ini
        languoids/tree/nakh1245/dagh1238/avar1255/andi1254/botl1243/botl1242/md.ini
        languoids/tree/pama1250/dese1234/ngum1251/ngum1256/jaru1256/jaru1254/md.ini
        languoids/tree/sali1297/sali1298/md.ini
        languoids/tree/sape1238/md.ini
        languoids/tree/sino1245/kuki1245/kuki1246/oldk1252/chir1298/chir1283/md.ini
        languoids/tree/sino1245/mish1241/idum1241/md.ini
        languoids/tree/taik1256/kamt1241/daic1238/daic1237/cent2251/wenm1239/sapa1255/sout3184/sout2743/blac1269/thai1259/md.ini
        languoids/tree/taik1256/kamt1241/daic1238/daic1237/cent2251/wenm1239/sapa1255/sout3184/sout2743/shan1276/assa1264/aito1238/md.ini
        languoids/tree/taik1256/kamt1241/daic1238/daic1237/cent2251/wenm1239/sapa1255/sout3184/sout2743/shan1276/assa1264/kham1290/md.ini
        languoids/tree/toro1256/tora1268/coas1312/bone1255/md.ini
        languoids/tree/tupi1275/mawe1252/awet1245/tupi1276/tupi1281/waya1271/zoee1241/zoee1240/md.ini
        languoids/tree/tuuu1241/kwii1241/nuuu1241/md.ini
        languoids/tree/ural1272/saam1281/west2390/cent2240/nort2671/md.ini
        languoids/tree/west2604/kara1499/md.ini
Please commit your changes or stash them before you switch branches.
Aborting'

and if I rerun it I get:

warning: templates not found in C:/Users/Viktor/.git-template
error: remote origin already exists.
Everything up-to-date

from streitberggothic.

martino-vic commented on September 26, 2024

Running cldfbench lexibank.makecldf lexibank_streitberggothic.py --concepticon=C:\Users\Viktor\OneDrive\PhD\lexibank\concepticon\concepticon\concepticon-data --glottolog=C:\Users\Viktor\OneDrive\PhD\lexibank\glottolog --clts=C:\Users\Viktor\OneDrive\PhD\lexibank\clts --concepticon-version=v2.5.0 --glottolog-version=v4.5 --clts-version=v2.2.0 I'm getting the output:

ERROR:
Invalid dataset spec: <lexibank.dataset> lexibank_streitberggothic.py

This is the Python script that seems to cause the error, but I'm not quite sure how.

Maybe I should also mention that I ipa-transcribed and cleaned the data during preprocessing with this script from this xml-file (cf. the html version)

Update: Currently reading through the documentation of cldfbench to understand the mechanics, got an error during the tutorial: cldf/cldfbench#73

from streitberggothic.

martino-vic commented on September 26, 2024

Update: Managed to push the repo via GitHub Desktop. I assume the validation part with the yml-file is not part of my todo-list anymore, right? Otherwise next up for me would be:

taking out df.explode from xml2csv.py since pylexibank convers that part
adding parameters.csv
adding replacements through form_spec
transcribing to IPA
creating an orthography profile, improving transcriptions, checking the data
maybe try pysem one day to add links to concepticon

from streitberggothic.

LinguList commented on September 26, 2024

Nice. For the orthoprofile, you may not need to create IPA first, but can use the orthoprofile to convert to IPA.

from streitberggothic.

LinguList commented on September 26, 2024

If you check this website, providing a JS implementation of orthoprofiles: https://digling.org/calc/profile/ you can find that we have already one Gothic profile there. You could compare with your data. The profiles are here: https://github.com/orthograpy/orthograpy

from streitberggothic.

martino-vic commented on September 26, 2024

I got stuck again while creating parameters.csv. My first assumption was to simply modify the current Script by adding args.writer.objects['ParameterTable'].append({...}) to the loop but that gave me the error pycldf.dataset.SchemaError: 'Dataset has no table "ParameterTable"'. So I decided to abandon the cldfbench tutorial and try it with the pylexibank one from the blogpost again. But add_languages() doesn't work, it just gets stuck and throws no error. This is the corresponding Script:

import pathlib

from pylexibank import Dataset as BaseDataset
from pylexibank import Language


class CustomLanguage(Language):
    pass


class Dataset(BaseDataset):
    dir = pathlib.Path(__file__).parent
    id = "streitberggothic"
    
    language_class = CustomLanguage

    def cmd_makecldf(self, args):

        # add bib
        args.writer.add_sources()
        args.log.info("added sources")
        
        # add language
        print(self.languages)
        args.writer.add_languages()
        print("hi")

It does write sources.bib correctly to the folder cldf, prints INFO running _cmd_makecldf on streitberggothic ... INFO added sources to the console. Then it prints [OrderedDict([('ID', 'goth1244'), ('Name', 'Gothic'), ('Macroarea', 'Eurasia'), ('Latitude', '46.9304'), ('Longitude', '29.9786'), ('Glottocode', 'goth1244'), ('ISO639P3code', 'got'), ('Countries', 'UA'), ('Family_ID', 'indo1319'), ('Language_ID', '0')])] and then it gets stuck, i.e. "hi" never gets printed.

I tried looking into the source code to understand what exactly is happening but it's difficult for me to understand and I guess that's also not the supposed way to fix this, so that's why I thought I'd write yet another comment here. From what I see the function add_languages() is a loop where add_language() is applied to multiple languages. Since I have only one language in my data I thought add_language() might work instead and then got following error: ValueError: invalid CLDF identifier LanguageTable-ID:. I found the line in the source-code that is triggering this but can't figure out what is happening. I also tried to keep the wrapper from the tutorial around class CustomLanguage, and also tried to delete class CustomLanguage over all, but it all leads to the same results. Sorry for writing so many comments and thanks for bearing with me.

from streitberggothic.

LinguList commented on September 26, 2024

from clldutils.misc import slug

concepts = {}
for i, concept in enumerate(self.raw_dir.read_csv("concepts.tsv", delimiter="\t", dicts=True)):
    idx = str(i+1)+"_"+slug(concept["gloss"])
    args.writer.add_concept(ID=idx, Name=concept["gloss"])
    concepts[concept["gloss"]] = idx

The dictionary concepts gives you the reference to the concept ID (Parameter_ID) in args.writer.add_form or args.writer.add_forms_with_segments to get the respective ID.

from streitberggothic.

LinguList commented on September 26, 2024

Adding the concepts = parameters should be done with the lexibank commands, that is add_concept (as also done in the example file for Vietic languages).

from streitberggothic.

martino-vic commented on September 26, 2024

I pushed the current state of affairs to this repository. The good news is that "added concept" does get printed to my console now, which is an improvement, but then again only empty files with headers get written to the folder cldf if I run the following:

import pathlib

from clldutils.misc import slug
from pylexibank import Dataset as BaseDataset
from pylexibank import Language
from pylexibank import FormSpec
import attr

class CustomLanguage(Language):
    pass

class Dataset(BaseDataset):
    dir = pathlib.Path(__file__).parent
    id = "streitberggothic"
    
    language_class = CustomLanguage
    
    form_spec = FormSpec(separators=",", first_form_only=True)

    def cmd_makecldf(self, args):
        # add bib
        args.writer.add_sources()
        args.log.info("added sources")

        # add concept
        concepts = {}
        for i, concept in enumerate(self.concepts):
            idx = f"{i}_{slug(concept['sense'])}"
            concepts[concept["sense"]] = idx
            args.writer.add_concept(
                    ID=idx,
                    Name=concept["sense"]
                    )
        args.log.info("added concepts")

I wish I could run the entire script, as indicated in the tutorial, but adding just one line, namely args.writer.add_languages() already causes it to get stuck without throwing an error, which makes debugging hard for me. I tried to describe the details as good as I could in the previous comment. I followed the instructions of the blogpost as closely as possible - only making changes where the nature of my own data compared to the one from the tutorial required so, like for example omitting the column "source" in languages.tsv. I still have the feeling it must be something trivial that I'm overlooking, but I've been playing around with this today the whole day and just can't find out what it is.

from streitberggothic.

LinguList commented on September 26, 2024

You are missing decorators in your language class:

@attr.s
class CustomLanguage

But maybe uncomment it and also uncomment the assignment in the Dataset?

from streitberggothic.

martino-vic commented on September 26, 2024

I also tried to keep the wrapper from the tutorial around class CustomLanguage

Sorry, I meant to say decorator. Adding the decorator doesn't fix the problem. And I left it out on purpose because my data has no extra column "Sources" to add like the Vietic one.

But maybe uncomment it and also uncomment the assignment in the Dataset?

Ah I think you're looking at the cldfbench_streitberggothic.py script, I'm sorry, that's my bad, I had left it in the folder and it's from the cldfbench-tutorial where it's still commented out, but currently I am running the other script, lexibank_streitberggothic.py

Looks like the problem has to do something with line 279 in cldf.py:

        if (not getattr(self.args, 'dev', False)) and 'Glottocode' in kw \
                and hasattr(self.args, 'glottolog') \
                and kw['Glottocode'] in self.args.glottolog.api.cached_languoids:

This if clause never gets entered, and the parser doesn't reach the return statement that comes after it. I'm just in the process of figuring out

Update: It takes 6 minutes to reach the return statement after this if-clause and the if-clause itself is still false. The files written to the cldf-folder are still empty.

from streitberggothic.

LinguList commented on September 26, 2024

I think you should follow our script on Vietic, as the cldfbench tutorial is more generic, and lexibank handles many complex aspects directly, so you don't have to bother. In any case: not using the decorator is wrong. But you can avoid this by uncommenting the "language_class = CustomLanguage".

from streitberggothic.

martino-vic commented on September 26, 2024

Yes yes, I'm doing all of those: "language_class = CustomLanguage" is uncommented, the decorator is added, I'm following the script for Vietic as closely as possible. But the if-clause in line 279 in cldf.py takes 6-7 minutes to run on my computer - even if I truncate my data - which I guess shouldn't be the case. Anyways, I'm still in the process of playing around and figuring out why exactly this happens

from streitberggothic.

LinguList commented on September 26, 2024

It is because you load all of glottolog, and this takes time. Run with the `--dev` flag, and this line won't run. For testing, this is what you need. Later, you may have to run it one time with glottolog. This depends on the hardware you have: SSDs favor fast access, as it reads many files. It crucially depends on your computer power, but you do not have to run it all the time, but should flag it out during development ;)

from streitberggothic.

martino-vic commented on September 26, 2024

Ahh okay this explains a lot, thanks :)
Update: the conversion seems to be correct now & t new repo has been pushed 🎉

from streitberggothic.

martino-vic commented on September 26, 2024

Okay, now I finished the Vietic tutorial from A-Z and I think the repository is ready. I assume there's still room for improvement, as always, e.g. the CLTS-validation-badge looks a bit sad, so ideas and suggestions for improvement are very welcome; but in general I think that's it, right? Because if yes then I could slowly start writing the blog entry and I'll also keep an eye on pysem, so that I can add the more eclectic links to concepticon once that part is ready :) Thank you again for the great tutorial and all the help and patience with my questions.

from streitberggothic.

LinguList commented on September 26, 2024

Did you push your changes? Then I'd review these days.

from streitberggothic.

martino-vic commented on September 26, 2024

Yes, I pushed them. Thank you.

from streitberggothic.

martino-vic commented on September 26, 2024

Ah, sorry realised only now that the pysem-tutorial came out already. So I'll tackle that next!

from streitberggothic.

martino-vic commented on September 26, 2024

Alright, I added the links to concepticon - 31% is the max that I could squeeze out without manual curation, which is quite good I think. Also improved the orthography by the procedure suggested in #3. pep8 stylechecked all scripts. I think the repo should be ready.

from streitberggothic.

Getting started about streitberggothic HOT 26 CLOSED

Comments (26)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent