Comments (46)
This is visually misleading due to the way I print the output. There are no spaces, and the alignments look fine in Edictor. There are no spaces
from lingrex.
I think you need to make the immediate strings that are all trimmed.
from lingrex.
But in order to make them, you can iterate over the alignment:
for cogid, msa in alms.msa["cogid"].items():
print(msa["alignment"])
print(msa)
The msa-object that is generated for you here stores the alignments, I think it even automatically only retains those that are not inside brackets!
from lingrex.
But I'd ask you to test for this.
from lingrex.
Please check also this function, that could otherwise be applied to the alignment object: https://lingpy.org/reference/lingpy.align.html#lingpy.align.sca.Alignments.reduce_alignments
from lingrex.
So, please check alms.reduce_alignments
and how it behaves, as I do not remember if it returns a list of reduced alignments or if it reduces alignments in the alms.msa["cogid"]
attribute, etc.
from lingrex.
Not much happens. It applies directly to alms
and adds a _alignment
entry in the Dictionary, but nothing is reduced. Even after looking at the code, I do not understand where this would proceed.
from lingrex.
Please use a minimal example and share it here (you can zip it). WE only need 3 cogids with 3 alignments that are "to be reduced".
from lingrex.
And please check the import statement in the script on align/sca.py
in lingpy:
from lingpy.read.qlc import read_msa, normalize_alignment, reduce_alignment
So you have the function in read.qlc!
from lingrex.
def reduce_alignment(alignment):
"""
Function reduces a given alignment.
Notes
-----
Reduction here means that the output alignment consists only of those parts
which have not been marked to be ignored by the user (parts in brackets).
It requires that all data is properly coded. If reduction fails, this will
throw a warning, and all brackets are simply removed in the output
alignment.
"""
# check for bracket indices in all columns
cols = misc.transpose(alignment)
ignore_indices = []
ignore = False
for i, col in enumerate(cols):
reduced_col = sorted(set(col))
if '(' in reduced_col:
if len(reduced_col) == 1:
ignore_indices += [i]
ignore = True
else:
ignore = False
elif ')' in reduced_col:
if len(reduced_col) == 1:
ignore_indices += [i]
ignore = False
else:
ignore_indices = []
elif ignore:
ignore_indices += [i]
if ignore_indices:
new_cols = []
for i, col in enumerate(cols):
if i not in ignore_indices:
new_cols += [col]
else:
new_cols = cols
new_alm = misc.transpose(new_cols)
for i, alm in enumerate(new_alm):
for j, char in enumerate(alm):
if char in '()':
new_alm[i][j] = '-'
return new_alm
from lingrex.
The alignment here is a simple array, so it is like the msa["alignment"]
what you can pass it. This should reduce the alignment, and you can from the reduced alignment then store tokesn [x for x in alm if x != "-"]
for each reduced alignment as well.
It is probably easier to integrate all of this later into the Sites class in lingrex, but it is not that difficult to make a small preprocessing with pure lingpy here.
from lingrex.
from lingpy.read.qlc import reduce_alignment
from lingpy import basictypes
dct = {}
for idx, msa in alms.msa["cogid"].items():
reduced = reduce_alignment(msa["alignment"])
for i, row in enumerate(reduced):
dct[msa["ID"][i] = row
alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)
# same for alignemnts
from lingrex.
Here comes the minimum example based on your code, showcasing the problem. There is no reduction apparently.
from lingrex.
You DO realize that you have python lists typed into the alignment column, not space-segmented strings?
from lingrex.
['m', 'ã', 'n', '(', '-', ')', 'ã', '-', '(', '-', '-', '-', '-', ')']
from lingrex.
That is the first alignment ;-)
from lingrex.
But isn't that how reduce_alignment
works, taking the whole `msa["alignment"] as input? If I run the following code:
for idx, msa in alms.msa["cogid"].items():
for alg in msa["alignment"]:
print("Alignment:", alg)
reduce_alignment(alg)
I receive an error like this:
lingreg) blum@lingn45 example % python regularity.py
Alignment: ['m', 'ã', 'n', '(', '-', ')', 'ã', '-', '(', '-', '-', '-', '-', ')']
Traceback (most recent call last):
File "/Users/blum/Projects/lingreg/example/regularity.py", line 42, in <module>
reduce_alignment(alg)
File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/read/qlc.py", line 23, in reduce_alignment
cols = misc.transpose(alignment)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in transpose
out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in <listcomp>
out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in <listcomp>
out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
~~~~~~~~~^^^
IndexError: string index out of range
This does not happen if I use the whole msa as input.
from lingrex.
cols = misc.transpose(alignment)
is a function that only pertains to one matrix (2-dim list). The requirement is that the length of all rows is identical per alignment. If it throws an index error, it means one of your rows (your strings, your aligned words) is of different length.
The function clearly does not take an msa-dictionary as input.
from lingrex.
The error is again in your data.
In [1]: from lingpy.read.qlc import reduce_alignment
In [3]: reduce_alignment([["1", "(", "-", ")"], ["2", "(", "2", ")"]])
Out[3]: [['1'], ['2']]
from lingrex.
To avoid that such errors occur, you must test:
if len(set([len(row) for row in msa["alignment"])) != 1:
print("problem in alignment {0}".format(cogid))
else:
...
from lingrex.
You DO realize that you have python lists typed into the alignment column, not space-segmented strings?
Now I understood what you meant! And why I was so confused. I adapted the CLDF conversion of the dataset so that now space-segmented strings are added, not a python list. Adding the structure based on the reduced alignment works now:
dct = {}
for idx, msa in alms.msa["cogid"].items():
msa_reduced = []
for site in msa["alignment"]:
# print("Alignment:", site)
reduced = reduce_alignment([site])[0]
msa_reduced.append(reduced)
for i, row in enumerate(msa_reduced):
dct[msa["ID"][i]] = "".join(row)
alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: [y for y in x], override=True)
alms.add_entries("structure", "tokens", lambda x: " ".join(Sites([x]).soundclasses))
from lingrex.
However, I now have problems matching alignment, tokens, and structure. I played around with the lambda expressions and come up with the following:
alms.add_entries("tokens", dct, lambda x: [y for y in x if y != "-"], override=True)
alms.add_entries("alignment", dct, lambda x: [y for y in x], override=True)
alms.add_entries("structure", "alignment", lambda x: " ".join(Sites([x]).soundclasses))
Now I get errors that the alignment and the structure do not match:
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6245
6251 C V + C V C V | ʂ o - t o k o | ʂ o t o k o
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6251
6258 C V + C V C V | ʃ u - t a k u | ʃ u t a k u
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6258
3601 C V + + | m ã - - | m ã
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 3601
3603 C V + + | m ã - - | m ã
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 3603
3605 C V + + | m ã - - | m ã
I strongly suspect that this is due to the "+" in the structure column. I do not understand how I need to adapt the "add_entries" command to succesfully match both. Any hints at what could solve this?
from lingrex.
Adding "0" instead of "+" results int he same error.
from lingrex.
Yes, you need to re-compute the structure, sorry.
from lingrex.
But since structure is just CV, it is not that difficult:
struc = tokens2class(tokens, "cv")
from lingrex.
alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True)
alms.add_entries("alignment", dct, lambda x: "".join(y for y in x), override=True)
alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))
cop = get_copar(alms, ref="cogid", structure="structure", min_refs=3)
keeps throwing me the same errors, for all data points.
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 6258
2558 C V C C V | k a tʃ i | k a tʃ i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2558
2567 C C V 0 V C C V 0 V | k !á/a r !í/i | k !á/a r !í/i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2567
3601 C V 0 0 | m ã - - | m ã
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 3601
3603 C V 0 0 | m ã - - | m ã
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 3603
3605 C V 0 0 | m ã - - | m ã
The data itself looks fine to me:
['Yaminawa', 'woman, wife', 'ʂotokoɸakɨ̃', 'ʂotokoɸakɨ̃', 'ʂ o t o k o', 'None', 375, 'ʂ o - t o k o', 'C V 0 C V C V']
['Yawanawa', 'young woman', 'ʃutaku_βakɨ', 'ʃutaku_βakɨ', 'ʃ u t a k u', 'None', 375, 'ʃ u - t a k u', 'C V 0 C V C V']
['Amawaka', 'yam', 'kari', 'kari', 'k a r i', 'II', 81, 'k a r i', 'C V C V']
['Chakobo', 'yam', 'ˈkari', 'ˈkari', 'k a r i', 'None', 81, 'k a r i', 'C V C V']
['Chaninawa', 'yam', 'kaɾi', 'kaɾi', 'k a ɾ i', 'None', 81, 'k a ɾ i', 'C V C V']
All three of tokens, alignment, and structure are space-segmented strings, not lists.
from lingrex.
The current setup has another fundamental problem. that does not even surface as of yet: Segments such as "ts" are separated when adding the structure. And it seems like the algorithm has problems with the slash annotation as well:
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 6258
2558 C V C C V | k a tʃ i | k a tʃ i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2558
2567 C C V 0 V C C V 0 V | k !á/a r !í/i | k !á/a r !í/i
Should I get back to lists?
from lingrex.
@tarotis, if the warning says that alignment and structure do not match, it means they don't match. So even if the data looks fine to you, it is wrong, and you should have a look where the problem lies.
from lingrex.
And the data are not fine, I mean, check this string 'ʂ o t o k o'
, it has two spaces! If you have a 0
in the CV sound class conversion, it means lingpy does not know the sound. This points ot a problem in the data.
from lingrex.
And your problem is this line:
alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True)
It should be:
alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x.split() if y != "-"]), override=True)
Assuming that your alignment is a string, space-segmented!
from lingrex.
for i, row in enumerate(msa_reduced):
dct[msa["ID"][i]] = "".join(row)
This should be:
dct[msa["ID"][i]] = row
Then you have a list and not a string, which also does not really make sense.
from lingrex.
The spaces were in the tokens, not in alignments/structure, so they should not have caused any problems. I have removed them, thanks for highlighting that. They were introduced due to the joining taking overhand, trying to fix this.
The 0's in the structure get inserted based on the gaps, not from any sound, which I guess I did not communicate well The data produced by this command:
alms.add_entries("structure", dct, lambda x: [token2class(y, "cv") for y in x])
Alignment: ['m', 'ã', '-', '-']
Structure: ['C', 'V', '0', '0']
returns the following error :
2023-06-07 10:58:33,449 [WARNING] alignment and structure do not match in 3603
3605 C V 0 0 | m ã - - | m ã
So, the basic question is: Assuming that the presence of "-" in the alignments is correct (which I do) - what is the correct way of representing them in the structures column?
from lingrex.
Misunderstanding here is that the strucutre mimics the tokens, since the alignment can always be changed, so the tokens are the orientation point for the structure, and the comparison if structure fits the alignment is done by adding gaps internally where they are needed.
from lingrex.
So the error in the line is:
alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))
which should be
alms.add_entries("structure", "tokens", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))
assuming that tokens is already the new entry.
from lingrex.
But even there, the lambda is a bit problematic, better use:
alms.add_entries("structure", "tokens", lambda x: tokens2class(x, "cv")))
Strucutre is exptected to be a list internally, and it will be explicitly checked.
from lingrex.
I thought I made this clear with my example, where I used tokens
and not alignment
.
from lingrex.
Sent another two hours on this, without success (but I made small progress). Please let's go through this step by step and make sure that I am using the correct formats. I am starting to seriously doubt myself, spending so many hours on this rather small problem, but with some fundamental concepts behind it.
- I have space-segmented strings for my reduced alignments, stored in a dictionary.
- My tokens are space-segmented strings. As I build them from the dictionar, I get a list when calling
dict()
that I have to join. I also eliminate gaps.
alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x if y != "-"]), override=True)
- I create the structure based on those new tokens. The structure is a list. As
tokens2class
requires a list as input, I split the tokens based on space.
alms.add_entries("structure", "tokens", lambda x: tokens2class(x.split(" "), "cv"))
- I use the new, reduced alignments.
alms.add_entries("alignment", dct, lambda x: " ".join([y for y in x]), override=True)
Tokens: a j a
Alignment: - a j a
Structure: ['V', 'C', 'V']
---
Tokens: a j a
Alignment: - a j a
Structure: ['V', 'C', 'V']
Which leads to the rather new error:
cop = CoPaR(alms, segments="tokens", transcription="ipa", ref="cogid", structure="structure", min_refs=2)
Traceback (most recent call last):
File "/Users/blum/Projects/lingreg/example/regularity.py", line 77, in <module>
cop.get_sites()
File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 298, in get_sites
positions = self.positions_from_prostrings(cogid, _wlid, _alms, _strucs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 219, in positions_from_prostrings
row = [x[i] for x in strucs if x[i] != "-"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 219, in <listcomp>
row = [x[i] for x in strucs if x[i] != "-"]
~^^^
IndexError: list index out of range
But we are some 16 code lines advanced, so I assume this is the correct way to progress. But where to now?
- I have tried to add print-statements to the
copar.py
code to see where the error comes from, but I do not understand this part of the code. Some list index cannot be accessed - so there seems to be something wrong with the formats of my data. But where? Did I turn the wrong way with any of my assumptions 0-3?
I attach a new minimum example.
from lingrex.
But this example is again with erroneous alignments, which are preceded by a space!
from lingrex.
If you check the file in edictor, you would see this directly.
from lingrex.
from lingpy import Wordlist, Alignments
from lingrex.copar import CoPaR
from lingrex.util import prep_wordlist
from lingpy.read.qlc import reduce_alignment
from lingpy.sequence.sound_classes import tokens2class
from lingpy import basictypes
from lingrex.util import add_structure
data = Wordlist("minimum_data.tsv")
wordlist = prep_wordlist(data)
alms = Alignments(wordlist, ref="cogid", transcription="tokens")
dct = {}
for idx, msa in alms.msa["cogid"].items():
msa_reduced = []
for site in msa["alignment"]:
reduced = reduce_alignment([site])[0]
msa_reduced.append(reduced)
for i, row in enumerate(msa_reduced):
dct[msa["ID"][i]] = row
alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)
#alms.add_entries("ipa", dct, lambda x: "".join([y for y in x if y != "-"]), override=True)
alms.add_entries("alignment", dct, lambda x: basictypes.lists(x), override=True)
#alms.add_entries("structure", "tokens", lambda x:
# basictypes.lists(tokens2class(x, "cv")))
add_structure(alms)
alms.add_alignments()
for x in alms:
print("Tokens:", alms[x, "tokens"])
print("Alignment:", alms[x, "alignment"])
print("Structure:", alms[x, "structure"])
print(alms[x])
print("---")
alms.output("tsv", filename="tmp")
cop = CoPaR(alms, segments="tokens", transcription="tokens", ref="cogid", structure="structure")
cop.get_sites()
cop2 = CoPaR("tmp.tsv", segments="tokens", transcription="tokens", ref="cogid",
structure="structure")
cop2.get_sites()
This works and illustrates the problem.
from lingrex.
you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this.
from lingrex.
You also had wrong representations of things as strings. And I think my remark on your file holds, even if lingpy does strip off spaces at the end and the beginning.
from lingrex.
you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this.
Thanks for sticking with me through this, that did the trick. If you agree, I'd propose to create a PR modifying the docstring of CoPaR, which currently reads as this:
class CoPaR(Alignments):
"""Correspondence Pattern Recognition class
Parameters
----------
wordlist : ~lingpy.basic.wordlist.Wordlist
A wordlist object which should have a column for segments and a column
for cognate sets. Since the class inherits from LingPy's
Alignments-class, the same kind of data should be submitted.
For me, this reads as if it takes a python object (Wordlist/Alignment), not a file. Is this due to my reading, or due to a potentially confusing description?
from lingrex.
This should definitely be changed, but I should also see if I cannot fix this internally, since it should then either throw an error if one does not load from file, or one should make sure to fix the problem with the types.
from lingrex.
So it is an issue that is annoying in lingrex and we should find ways to avoid it in general. The passing of a wordlist to classes derived from wordlists like Alignments and CoPaR is generally difficult and has been questioned, specifically since we have one init-function for all.
from lingrex.
So we can say: lingrex should for now at least fix the issue above and get the example file working for both cases, not just for one ;-)
from lingrex.
Related Issues (20)
- Import correspondence patterns HOT 1
- new format for patterns, tight to tokens?
- cognate ids for the same word in the same language do not pass the test for cross-semantic cognates
- prediction experiments
- sanity checks on every dataset: strict cognates HOT 1
- template_alignment HOT 8
- `find_colexified_alignments` fails in some cases HOT 5
- make a prediction experiment with gap-free alignments
- proxies for borrowing detection
- [dev] supervised reconstruction method for lingrex HOT 6
- bug with keyword "family" in lingrex.borrowing
- new methods from the evaluation study HOT 1
- Add long_description_content_type to setup.py to make sure the README renders ok on PyPI HOT 1
- Cleaning data prior to correspondence pattern analysis HOT 1
- Patterns from MSA function
- Can LingRex get frequencies of sound correspondences? HOT 2
- New Methods from Accepted Papers
- write_frequency() inflates Frequency: How to use correctly? HOT 23
- Method for fuzzy cognates with alignments provided by user. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lingrex.