I am currently running into trouble while trying to create a CV structure based on a p

But in order to make them, you can iterate over the alignment: <div class="highlig

And please check the import statement in the on align

<div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto

Add structure from existing alignment,about lingpy/lingrex

Comments (46)

FredericBlum commented on September 13, 2024 1

This is visually misleading due to the way I print the output. There are no spaces, and the alignments look fine in Edictor. There are no spaces

from lingrex.

LinguList commented on September 13, 2024

I think you need to make the immediate strings that are all trimmed.

from lingrex.

LinguList commented on September 13, 2024

But in order to make them, you can iterate over the alignment:

for cogid, msa in alms.msa["cogid"].items():
    print(msa["alignment"])
    print(msa)

The msa-object that is generated for you here stores the alignments, I think it even automatically only retains those that are not inside brackets!

from lingrex.

LinguList commented on September 13, 2024

But I'd ask you to test for this.

from lingrex.

LinguList commented on September 13, 2024

Please check also this function, that could otherwise be applied to the alignment object: https://lingpy.org/reference/lingpy.align.html#lingpy.align.sca.Alignments.reduce_alignments

from lingrex.

LinguList commented on September 13, 2024

So, please check alms.reduce_alignments and how it behaves, as I do not remember if it returns a list of reduced alignments or if it reduces alignments in the alms.msa["cogid"] attribute, etc.

from lingrex.

FredericBlum commented on September 13, 2024

Not much happens. It applies directly to alms and adds a _alignment entry in the Dictionary, but nothing is reduced. Even after looking at the code, I do not understand where this would proceed.

from lingrex.

LinguList commented on September 13, 2024

Please use a minimal example and share it here (you can zip it). WE only need 3 cogids with 3 alignments that are "to be reduced".

from lingrex.

LinguList commented on September 13, 2024

And please check the import statement in the script on align/sca.py in lingpy:

from lingpy.read.qlc import read_msa, normalize_alignment, reduce_alignment

So you have the function in read.qlc!

from lingrex.

LinguList commented on September 13, 2024

def reduce_alignment(alignment):
    """
    Function reduces a given alignment.
    
    Notes
    -----
    Reduction here means that the output alignment consists only of those parts
    which have not been marked to be ignored by the user (parts in brackets).
    It requires that all data is properly coded. If reduction fails, this will
    throw a warning, and all brackets are simply removed in the output
    alignment.
    """

    # check for bracket indices in all columns
    cols = misc.transpose(alignment)

    ignore_indices = []
    ignore = False
    for i, col in enumerate(cols):
        reduced_col = sorted(set(col))

        if '(' in reduced_col:
            if len(reduced_col) == 1:
                ignore_indices += [i]
                ignore = True
            else:
                ignore = False
        elif ')' in reduced_col:
            if len(reduced_col) == 1:
                ignore_indices += [i]
                ignore = False
            else:
                ignore_indices = []
        elif ignore:
            ignore_indices += [i]

    if ignore_indices:
        new_cols = []
        for i, col in enumerate(cols):
            if i not in ignore_indices:
                new_cols += [col]
    else:
        new_cols = cols

    new_alm = misc.transpose(new_cols)

    for i, alm in enumerate(new_alm):
        for j, char in enumerate(alm):
            if char in '()':
                new_alm[i][j] = '-'

    return new_alm

from lingrex.

LinguList commented on September 13, 2024

The alignment here is a simple array, so it is like the msa["alignment"] what you can pass it. This should reduce the alignment, and you can from the reduced alignment then store tokesn [x for x in alm if x != "-"] for each reduced alignment as well.

It is probably easier to integrate all of this later into the Sites class in lingrex, but it is not that difficult to make a small preprocessing with pure lingpy here.

from lingrex.

LinguList commented on September 13, 2024

from lingpy.read.qlc import reduce_alignment
from lingpy import basictypes

dct = {}
for idx, msa in alms.msa["cogid"].items():
    reduced = reduce_alignment(msa["alignment"])
    for i, row in enumerate(reduced):
        dct[msa["ID"][i] = row

alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)

# same for alignemnts

from lingrex.

FredericBlum commented on September 13, 2024

minimum_example.zip

Here comes the minimum example based on your code, showcasing the problem. There is no reduction apparently.

from lingrex.

LinguList commented on September 13, 2024

You DO realize that you have python lists typed into the alignment column, not space-segmented strings?

from lingrex.

LinguList commented on September 13, 2024

['m', 'ã', 'n', '(', '-', ')', 'ã', '-', '(', '-', '-', '-', '-', ')']

from lingrex.

LinguList commented on September 13, 2024

That is the first alignment ;-)

from lingrex.

FredericBlum commented on September 13, 2024

But isn't that how reduce_alignment works, taking the whole `msa["alignment"] as input? If I run the following code:

for idx, msa in alms.msa["cogid"].items():
    for alg in msa["alignment"]:
        print("Alignment:", alg)
        reduce_alignment(alg)

I receive an error like this:

lingreg) blum@lingn45 example % python regularity.py
Alignment: ['m', 'ã', 'n', '(', '-', ')', 'ã', '-', '(', '-', '-', '-', '-', ')']
Traceback (most recent call last):
  File "/Users/blum/Projects/lingreg/example/regularity.py", line 42, in <module>
    reduce_alignment(alg)
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/read/qlc.py", line 23, in reduce_alignment
    cols = misc.transpose(alignment)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in transpose
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in <listcomp>
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in <listcomp>
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
            ~~~~~~~~~^^^
IndexError: string index out of range

This does not happen if I use the whole msa as input.

from lingrex.

LinguList commented on September 13, 2024

    cols = misc.transpose(alignment)

is a function that only pertains to one matrix (2-dim list). The requirement is that the length of all rows is identical per alignment. If it throws an index error, it means one of your rows (your strings, your aligned words) is of different length.

The function clearly does not take an msa-dictionary as input.

from lingrex.

LinguList commented on September 13, 2024

The error is again in your data.

In [1]: from lingpy.read.qlc import reduce_alignment

In [3]: reduce_alignment([["1", "(", "-", ")"], ["2", "(", "2", ")"]])
Out[3]: [['1'], ['2']]

from lingrex.

LinguList commented on September 13, 2024

To avoid that such errors occur, you must test:

if len(set([len(row) for row in msa["alignment"])) != 1:
    print("problem in alignment {0}".format(cogid))
else:
    ...

from lingrex.

FredericBlum commented on September 13, 2024

You DO realize that you have python lists typed into the alignment column, not space-segmented strings?

Now I understood what you meant! And why I was so confused. I adapted the CLDF conversion of the dataset so that now space-segmented strings are added, not a python list. Adding the structure based on the reduced alignment works now:

dct = {}
for idx, msa in alms.msa["cogid"].items():
    msa_reduced = []
    for site in msa["alignment"]:
        # print("Alignment:", site)
        reduced = reduce_alignment([site])[0]
        msa_reduced.append(reduced)
    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = "".join(row)

alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: [y for y in x], override=True)
alms.add_entries("structure", "tokens", lambda x: " ".join(Sites([x]).soundclasses))

from lingrex.

FredericBlum commented on September 13, 2024

However, I now have problems matching alignment, tokens, and structure. I played around with the lambda expressions and come up with the following:

alms.add_entries("tokens", dct, lambda x: [y for y in x if y != "-"], override=True)
alms.add_entries("alignment", dct, lambda x: [y for y in x], override=True)
alms.add_entries("structure", "alignment", lambda x: " ".join(Sites([x]).soundclasses))

Now I get errors that the alignment and the structure do not match:

2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6245
6251 C V + C V C V | ʂ o - t o k o | ʂ o t o k o
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6251
6258 C V + C V C V | ʃ u - t a k u | ʃ u t a k u
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6258
3601 C V + + | m ã - - | m ã
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 3601
3603 C V + + | m ã - - | m ã
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 3603
3605 C V + + | m ã - - | m ã

I strongly suspect that this is due to the "+" in the structure column. I do not understand how I need to adapt the "add_entries" command to succesfully match both. Any hints at what could solve this?

from lingrex.

FredericBlum commented on September 13, 2024

Adding "0" instead of "+" results int he same error.

from lingrex.

LinguList commented on September 13, 2024

Yes, you need to re-compute the structure, sorry.

from lingrex.

LinguList commented on September 13, 2024

But since structure is just CV, it is not that difficult:

struc = tokens2class(tokens, "cv")

from lingrex.

FredericBlum commented on September 13, 2024

alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True)
alms.add_entries("alignment", dct, lambda x: "".join(y for y in x), override=True)
alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

cop = get_copar(alms, ref="cogid", structure="structure", min_refs=3)

keeps throwing me the same errors, for all data points.

2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 6258
2558 C V C C V | k a tʃ i | k a tʃ i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2558
2567 C C V 0 V C C V 0 V | k !á/a r !í/i | k !á/a r !í/i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2567
3601 C V 0 0 | m ã - - | m ã
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 3601
3603 C V 0 0 | m ã - - | m ã
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 3603
3605 C V 0 0 | m ã - - | m ã

The data itself looks fine to me:

['Yaminawa', 'woman, wife', 'ʂotokoɸakɨ̃', 'ʂotokoɸakɨ̃', 'ʂ o  t o k o', 'None', 375, 'ʂ o - t o k o', 'C V 0 C V C V']
['Yawanawa', 'young woman', 'ʃutaku_βakɨ', 'ʃutaku_βakɨ', 'ʃ u  t a k u', 'None', 375, 'ʃ u - t a k u', 'C V 0 C V C V']
['Amawaka', 'yam', 'kari', 'kari', 'k a r i', 'II', 81, 'k a r i', 'C V C V']
['Chakobo', 'yam', 'ˈkari', 'ˈkari', 'k a r i', 'None', 81, 'k a r i', 'C V C V']
['Chaninawa', 'yam', 'kaɾi', 'kaɾi', 'k a ɾ i', 'None', 81, 'k a ɾ i', 'C V C V']

All three of tokens, alignment, and structure are space-segmented strings, not lists.

from lingrex.

FredericBlum commented on September 13, 2024

The current setup has another fundamental problem. that does not even surface as of yet: Segments such as "ts" are separated when adding the structure. And it seems like the algorithm has problems with the slash annotation as well:

2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 6258
2558 C V C C V | k a tʃ i | k a tʃ i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2558
2567 C C V 0 V C C V 0 V | k !á/a r !í/i | k !á/a r !í/i

Should I get back to lists?

from lingrex.

LinguList commented on September 13, 2024

@tarotis, if the warning says that alignment and structure do not match, it means they don't match. So even if the data looks fine to you, it is wrong, and you should have a look where the problem lies.

from lingrex.

LinguList commented on September 13, 2024

And the data are not fine, I mean, check this string 'ʂ o t o k o', it has two spaces! If you have a 0 in the CV sound class conversion, it means lingpy does not know the sound. This points ot a problem in the data.

from lingrex.

LinguList commented on September 13, 2024

And your problem is this line:

alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True)

It should be:

alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x.split() if y != "-"]), override=True)

Assuming that your alignment is a string, space-segmented!

from lingrex.

LinguList commented on September 13, 2024

    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = "".join(row)

This should be:

        dct[msa["ID"][i]] = row

Then you have a list and not a string, which also does not really make sense.

from lingrex.

FredericBlum commented on September 13, 2024

The spaces were in the tokens, not in alignments/structure, so they should not have caused any problems. I have removed them, thanks for highlighting that. They were introduced due to the joining taking overhand, trying to fix this.

The 0's in the structure get inserted based on the gaps, not from any sound, which I guess I did not communicate well The data produced by this command:

alms.add_entries("structure", dct, lambda x: [token2class(y, "cv") for y in x])

Alignment: ['m', 'ã', '-', '-']
Structure: ['C', 'V', '0', '0']

returns the following error :

2023-06-07 10:58:33,449 [WARNING] alignment and structure do not match in 3603
3605 C V 0 0 | m ã - - | m ã

So, the basic question is: Assuming that the presence of "-" in the alignments is correct (which I do) - what is the correct way of representing them in the structures column?

from lingrex.

LinguList commented on September 13, 2024

Misunderstanding here is that the strucutre mimics the tokens, since the alignment can always be changed, so the tokens are the orientation point for the structure, and the comparison if structure fits the alignment is done by adding gaps internally where they are needed.

from lingrex.

LinguList commented on September 13, 2024

So the error in the line is:

alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

which should be

alms.add_entries("structure", "tokens", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

assuming that tokens is already the new entry.

from lingrex.

LinguList commented on September 13, 2024

But even there, the lambda is a bit problematic, better use:

alms.add_entries("structure", "tokens", lambda x: tokens2class(x, "cv")))

Strucutre is exptected to be a list internally, and it will be explicitly checked.

from lingrex.

LinguList commented on September 13, 2024

I thought I made this clear with my example, where I used tokens and not alignment.

from lingrex.

FredericBlum commented on September 13, 2024

Sent another two hours on this, without success (but I made small progress). Please let's go through this step by step and make sure that I am using the correct formats. I am starting to seriously doubt myself, spending so many hours on this rather small problem, but with some fundamental concepts behind it.

I have space-segmented strings for my reduced alignments, stored in a dictionary.
My tokens are space-segmented strings. As I build them from the dictionar, I get a list when calling dict() that I have to join. I also eliminate gaps.

alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x if y != "-"]), override=True)

I create the structure based on those new tokens. The structure is a list. As tokens2class requires a list as input, I split the tokens based on space.

alms.add_entries("structure", "tokens", lambda x: tokens2class(x.split(" "), "cv"))

I use the new, reduced alignments.

alms.add_entries("alignment", dct, lambda x: " ".join([y for y in x]), override=True)

Tokens: a j a
Alignment: - a j a
Structure: ['V', 'C', 'V']
---
Tokens: a j a
Alignment: - a j a
Structure: ['V', 'C', 'V']

Which leads to the rather new error:

cop = CoPaR(alms, segments="tokens", transcription="ipa", ref="cogid", structure="structure", min_refs=2)

Traceback (most recent call last):                                                                                                                                                           
  File "/Users/blum/Projects/lingreg/example/regularity.py", line 77, in <module>
    cop.get_sites()
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 298, in get_sites
    positions = self.positions_from_prostrings(cogid, _wlid, _alms, _strucs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 219, in positions_from_prostrings
    row = [x[i] for x in strucs if x[i] != "-"]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 219, in <listcomp>
    row = [x[i] for x in strucs if x[i] != "-"]
                                   ~^^^
IndexError: list index out of range

But we are some 16 code lines advanced, so I assume this is the correct way to progress. But where to now?

I have tried to add print-statements to the copar.py code to see where the error comes from, but I do not understand this part of the code. Some list index cannot be accessed - so there seems to be something wrong with the formats of my data. But where? Did I turn the wrong way with any of my assumptions 0-3?

I attach a new minimum example.

example.zip

from lingrex.

LinguList commented on September 13, 2024

But this example is again with erroneous alignments, which are preceded by a space!

from lingrex.

LinguList commented on September 13, 2024

If you check the file in edictor, you would see this directly.

from lingrex.

LinguList commented on September 13, 2024

from lingpy import Wordlist, Alignments
from lingrex.copar import CoPaR
from lingrex.util import prep_wordlist
from lingpy.read.qlc import reduce_alignment
from lingpy.sequence.sound_classes import tokens2class
from lingpy import basictypes
from lingrex.util import add_structure

data = Wordlist("minimum_data.tsv")
wordlist = prep_wordlist(data)
alms = Alignments(wordlist, ref="cogid", transcription="tokens")


dct = {}
for idx, msa in alms.msa["cogid"].items():
    msa_reduced = []
    for site in msa["alignment"]:
        reduced = reduce_alignment([site])[0]
        msa_reduced.append(reduced)
    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = row

alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)
#alms.add_entries("ipa", dct, lambda x: "".join([y for y in x if y != "-"]), override=True)
alms.add_entries("alignment", dct, lambda x: basictypes.lists(x), override=True)
#alms.add_entries("structure", "tokens", lambda x:
#                 basictypes.lists(tokens2class(x, "cv")))
add_structure(alms)
alms.add_alignments()

for x in alms:
    print("Tokens:", alms[x, "tokens"])
    print("Alignment:", alms[x, "alignment"])
    print("Structure:", alms[x, "structure"])
    print(alms[x])
    print("---")

alms.output("tsv", filename="tmp")

cop = CoPaR(alms, segments="tokens", transcription="tokens", ref="cogid", structure="structure")
cop.get_sites()

cop2 = CoPaR("tmp.tsv", segments="tokens", transcription="tokens", ref="cogid",
             structure="structure")
cop2.get_sites()

This works and illustrates the problem.

from lingrex.

LinguList commented on September 13, 2024

you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this.

from lingrex.

LinguList commented on September 13, 2024

You also had wrong representations of things as strings. And I think my remark on your file holds, even if lingpy does strip off spaces at the end and the beginning.

from lingrex.

FredericBlum commented on September 13, 2024

you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this.

Thanks for sticking with me through this, that did the trick. If you agree, I'd propose to create a PR modifying the docstring of CoPaR, which currently reads as this:

class CoPaR(Alignments):
    """Correspondence Pattern Recognition class

    Parameters
    ----------
    wordlist : ~lingpy.basic.wordlist.Wordlist
        A wordlist object which should have a column for segments and a column
        for cognate sets. Since the class inherits from LingPy's
        Alignments-class, the same kind of data should be submitted.

For me, this reads as if it takes a python object (Wordlist/Alignment), not a file. Is this due to my reading, or due to a potentially confusing description?

from lingrex.

LinguList commented on September 13, 2024

This should definitely be changed, but I should also see if I cannot fix this internally, since it should then either throw an error if one does not load from file, or one should make sure to fix the problem with the types.

from lingrex.

LinguList commented on September 13, 2024

So it is an issue that is annoying in lingrex and we should find ways to avoid it in general. The passing of a wordlist to classes derived from wordlists like Alignments and CoPaR is generally difficult and has been questioned, specifically since we have one init-function for all.

from lingrex.

LinguList commented on September 13, 2024

So we can say: lingrex should for now at least fix the issue above and get the example file working for both cases, not just for one ;-)

from lingrex.

Add structure from existing alignment about lingrex HOT 46 OPEN

Comments (46)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent