yannvgn / laserembeddings Goto Github PK
View Code? Open in Web Editor NEWLASER multilingual sentence embeddings as a pip package
License: BSD 3-Clause "New" or "Revised" License
LASER multilingual sentence embeddings as a pip package
License: BSD 3-Clause "New" or "Revised" License
I am planning to run multiple embeddings parallel. I would like to know if I should initialize a Laser object for each thread or If I can share 1 instance across multiple threads.
I faced issue that while encoding same sentence but in lists of different length i receive slightly different embeddings.
Here is a code to describe what I meant:
from laserembeddings import Laser
import numpy as np
laser = Laser()
a = laser.embed_sentences(["apple", "banana", "clementina"], lang='en')
b = laser.embed_sentences(["apple"], lang='en')
c = laser.embed_sentences(["apple", "potato", "strawberry"], lang='en')
(a[0]==b[0]).all() # check if all elemnts same
#False
(a[0]==c[0]).all()
#True
np.linalg.norm(a[0]-b[0])
#1.3968409e-07
np.linalg.norm(a[0]-c[0])
#0.0
My goal is to get same embedding of word sentence apple, no matter which size of text list I use - but it seems to be unreal with current version of laserembeddings.
I would like to know if such behavior is intentional or it's a bug?
hey, did this support Persian/Farsi language? what is its code to pass into this function:
embeddings = laser.embed_sentences(
['let your neural network be polyglot',
'use multilingual embeddings!'],
lang='en')
Hi,
I was just testing different outputs with the 'laserembeddings' pypi package. One thing I observed is that, it doesn't raise an error for a false tag. (such as 'xx', 'yy' or even for single letters like 'x','y') Also, when I tried with Sinhala language, (tag='si') I observed that I get an output embedding even if I change the tag (to a valid or a false one) and all the time these outputs are the same. How can this behavior be explained? How can we verify the results? Or could there be something wrong with my setup?
Python==3.7.10
torch==1.8.1+cu101
from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(
["අත්සන් කළේ චරිත හේරත්"],
lang='si')
embeddings2=laser.embed_sentences(
["අත්සන් කළේ චරිත හේරත්"],
lang='y')
embeddings3=laser.embed_sentences(
["අත්සන් කළේ චරිත හේරත්"],
lang='en')
embeddings4=laser.embed_sentences(
["A test sentence"],
lang='si') #even the tag is different getting a result
comp=embeddings2==embeddings
comp2=embeddings2==embeddings3
print(np.sum(comp))
print(np.sum(comp2))
print(comp.all())
print(embeddings4)
Result:
1024 1024 True [[2.5131851e-03 4.6637398e-04 3.9160903e-05 ... 1.0697229e-02 1.6339000e-02 1.8368352e-02]]
Hello,
I was going through your work and it was really great. I used the library in various instances. I just wanted to know how did you visualize the laser embeddings in a 2D space. I am very new to machine learning and data science. It would be a real help if you could tell me how did your visualize the embeddings. Thanks in advance.
Hello,
When trying to install Japanese and Chinese dependencies (pip install laserembeddings[zh,ja]) on windows i'm seeing an error:
The same error occurs when trying to install mecab-python3 directly. I've installed mecab and the frozen environment is attached.
requirements.txt
EDIT: Have also installed mecab-python-windows but get the same error when installing laserembeddings for chinese and japanese.
EDIT 2: This error only occurs when installing for Japenese or Chinese, but seems like it results in an issue with all languages. In english, retrieving embeddings gives the following MeCab related error:
Thanks in advance :)
can you provide a solution for this issue
facebookresearch/LASER#95
I'm currently using the version 0.3.3 of bpemb, and yesterday (8 December 2021), subword-nmt got updated to 3.8.0 which causes an attribute error AttributeError: 'BPECodesAdapter' object has no attribute 'read'
.
The quick fix I found was to rollback to subword-nmt 3.7.0
Hello!
First of all, thank you very much for the repo - it is quite handy!
I just found one ambiguous moment (which seems to me, at least) which may confuse other users.
If I have a list with 10 sentences, then the function
laser_model.embed_sentences(list_of_sents, lang='en')
returns 101024 matrix.
On the other hand, if I provide language not as a string, but as a list with a single string, then the function
laser_model.embed_sentences(list_of_sents, lang=['en'])
returns 11024 matrix.
At first, I thought - could it be due to some aggregation, like a mean vectors of all 10 vectors or something. While, according to code it is clearly due to ZIP function. I think it might be a good idea either to add some Warning, or raise an Error in such a case. Though, it is just a suggestion!
I'm on OpenSUSE Leap 42 on WSL using Python 3.9.7, pytorch 1.11.0 (cuda version) and laserembeddings 1.1.2.
I'm computing embeddings for a long list of English text sequences (each of length <= 300 characters) using Laser.embed_sentences(). My GPU is an RTX 3080.
The property Laser.bpeSentenceEmbedding.encoder.use_cuda
is True
, indicating that the GPU is detected and Laser is attempting to use it. GPU memory is reserved as expected.
However, the GPU remains idle, at 0 to 2% utilisation, with the python process using 100% of a CPU thread. The performance is the same when I disable the GPU entirely, as is the CPU utilisation.
This leads me to believe that although something is taking up GPU memory, the inference is being done on the CPU. This is puzzling as looking at the source code the model and data are both unambiguously moved to the GPU if use_cuda
is enabled.
Assuming it's not a quirk of my setup, I think this is quite a high priority issue as applications of this library on large amounts of data require GPU acceleration to be economical.
530
531
/opt/conda/lib/python3.6/site-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
700 unpickler = pickle_module.Unpickler(f, **pickle_load_args)
701 unpickler.persistent_load = persistent_load
--> 702 result = unpickler.load()
703
704 deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)
UnpicklingError: pickle data was truncated
Sometimes the calculation of embedding is not performed without any errors and logs.
Is it possible that there is some error inside embed_sentences() function that is not caught and logged?
Hi there,
as I'm trying to understand how LASER works I tried this
laser.embed_sentences( "wer we2dwdfw ewrwer", "nl")
and I got a result whose norm is 0.65.
My question is whether it makes sense to talk about out of vocabulary for LASER?
Hello!
First of all, thank you very much for this pretty light and simple for use laser model fork.
Do you have any plans to update laserembeddings to new LASER2 and LASER3 models?
For face-to-face comparison laserembeddings and original model this dataset can also be used: flores-101 or flores-200 dataset
Cheers!
I really loved your work of porting the LASER as python pip package,
I am new and trying to learn the use of these embeddings.
what i have done so far:
I have a list of sentences in three languages(L_1,L_2,L_3 let's say).
final = L_1 + L_2 + L_3
Generated embeddings as shown below:
import laserembeddings as le
laser = le.Laser()
langs=[]
for i in range(len(L_1)):
langs.append('L_1')
for i in range(len(L_2)):
langs.append('L_2')
for i in range(len(L_3)):
langs.append('L_3')
embeddings = laser.embed_sentences(final,lang=langs)
assuming that the index of embeddings is as per their corresponding sentences in final.
now for finding similarity between embeddings what i have done is converted these embeddings into gensim KeyedVectors so that we have the flexibility of using the functions like similar_by_vector() etc.
Lang_based_keys = [sent for sent in final]
sent_vecs = KeyedVectors(vector_size=embeddings.shape[1])
sent_vecs.add(Lang_based_keys,embeddings)
but here i am having the issue, suppose a sentence which was present in the final at the time of generating embeddings "The iPhone SDK, set programming tools developers, enhanced support development iPad"..
but when i try to see what are the closest vectors in the embedding space to the vector of given sentence as follows:
let word = "The iPhone SDK, set programming tools developers, enhanced support development iPad"
sent_vecs.similar_by_vector(sent_vecs[word],topn=100)
what is see is the closest one given by the model is.
[('Poborsky played minutes, 291 minutes, Czech Republic Euro 2004', 1.0),
....
..]
how is it possible that the similarity was 1.0 which means both these sentences have same vector...
Kindly correct me wherever i am wrong..
Thank you.
I've recently experienced this error AttributeError: 'BPECodesAdapter' object has no attribute 'read'
which comes from the package subword_nmt used during the tokenization. Apparently, this package had some modifications. It's better to fix the specific version during installation of laserembeddings to avoid the problem, i.e. pip install subword-nmt==0.3.7
How can i approch towards japanese support on windows?
what kind of issues are there when using the japanese on windows?
Thnaks!!
Hello and thank you for this library!
I have a question regarding how different sentence lengths are treated. Here's a code I ran:
from laserembeddings.preprocessing import Tokenizer
tokenizer = Tokenizer(lang)
sent_to_embed = tokenizer.tokenize(my_big_sentence) # my _big_sentence is actually a whole document
# len(sent_to_embed.split(" ")) =14037
laser = ls()
laser_extended = ls(embedding_options={"max_tokens": 40000, "max_sentences": 400})
t = sent_to_embed.split(" ")
out = []
for _ in range(100, len(t), 5000):
print(_)
out.append({
"split_at": _,
"default_embedding": laser.embed_sentences(" ".join(t[:_]), "nl"),
"extemded_embedding": laser_extended.embed_sentences(" ".join(t[:_]), "nl"),
})
out.append({
"split_at": len(t),
"default_embedding": laser.embed_sentences(" ".join(t), "nl"),
"extemded_embedding": laser_extended.embed_sentences(" ".join(t), "nl"),
})
Then I computed the cosine similarity between the last embedding (i.e. out[-1]
) and the other ones and the result is in the plot below.
As you can see, one can't differentiate the results from the 2 LASER instances (laser
and laser_extended
). Is this expected? I also get the very same result with max_tokens = 200
. I would've expected that the result doesn't change when the number of tokens exceeds this parameter.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.