yannvgn / laserembeddings Goto Github PK

View Code? Open in Web Editor NEW

224.0 6.0 29.0 335 KB

LASER multilingual sentence embeddings as a pip package

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

multilingual embeddings nlp pytorch zero-shot transfer-learning

laserembeddings's People

Contributors

Stargazers

Watchers

laserembeddings's Issues

Is Laser class thread safe?

I am planning to run multiple embeddings parallel. I would like to know if I should initialize a Laser object for each thread or If I can share 1 instance across multiple threads.

Different embeddings with different length

I faced issue that while encoding same sentence but in lists of different length i receive slightly different embeddings.

Here is a code to describe what I meant:

from laserembeddings import Laser

import numpy as np

laser = Laser()
a = laser.embed_sentences(["apple", "banana", "clementina"], lang='en')
b = laser.embed_sentences(["apple"], lang='en')
c = laser.embed_sentences(["apple", "potato", "strawberry"], lang='en')
(a[0]==b[0]).all() # check if all elemnts same
#False
(a[0]==c[0]).all()
#True
np.linalg.norm(a[0]-b[0])
#1.3968409e-07
np.linalg.norm(a[0]-c[0])
#0.0

My goal is to get same embedding of word sentence apple, no matter which size of text list I use - but it seems to be unreal with current version of laserembeddings.
I would like to know if such behavior is intentional or it's a bug?

How does this compare to XLM-R / mBERT ? :)

what is the code of Persian language?

hey, did this support Persian/Farsi language? what is its code to pass into this function:

embeddings = laser.embed_sentences(
    ['let your neural network be polyglot',
     'use multilingual embeddings!'],
    lang='en')

No Error raised for a false/wrong tag and same results are obtained even the tag is changed

Hi,
I was just testing different outputs with the 'laserembeddings' pypi package. One thing I observed is that, it doesn't raise an error for a false tag. (such as 'xx', 'yy' or even for single letters like 'x','y') Also, when I tried with Sinhala language, (tag='si') I observed that I get an output embedding even if I change the tag (to a valid or a false one) and all the time these outputs are the same. How can this behavior be explained? How can we verify the results? Or could there be something wrong with my setup?
Python==3.7.10
torch==1.8.1+cu101

from laserembeddings import Laser

laser = Laser()

embeddings = laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='si')

embeddings2=laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='y')

embeddings3=laser.embed_sentences(
    ["අත්සන් කළේ චරිත හේරත්"],
    lang='en')

embeddings4=laser.embed_sentences(
    ["A test sentence"],
    lang='si') #even the tag is different getting a result

comp=embeddings2==embeddings
comp2=embeddings2==embeddings3
print(np.sum(comp))
print(np.sum(comp2))
print(comp.all())
print(embeddings4)

Result:

1024 1024 True [[2.5131851e-03 4.6637398e-04 3.9160903e-05 ... 1.0697229e-02 1.6339000e-02 1.8368352e-02]]

Visualization of laserembedding in a 2D vector space

Hello,

I was going through your work and it was really great. I used the library in various instances. I just wanted to know how did you visualize the laser embeddings in a 2D space. I am very new to machine learning and data science. It would be a real help if you could tell me how did your visualize the embeddings. Thanks in advance.

Failed to install on windows

Hello,

When trying to install Japanese and Chinese dependencies (pip install laserembeddings[zh,ja]) on windows i'm seeing an error:

The same error occurs when trying to install mecab-python3 directly. I've installed mecab and the frozen environment is attached.
requirements.txt

EDIT: Have also installed mecab-python-windows but get the same error when installing laserembeddings for chinese and japanese.

EDIT 2: This error only occurs when installing for Japenese or Chinese, but seems like it results in an issue with all languages. In english, retrieving embeddings gives the following MeCab related error:

Thanks in advance :)

tokenize and apply bpe for one sentence

can you provide a solution for this issue
facebookresearch/LASER#95

Incompatibility with subword-nmt 3.8.0

I'm currently using the version 0.3.3 of bpemb, and yesterday (8 December 2021), subword-nmt got updated to 3.8.0 which causes an attribute error AttributeError: 'BPECodesAdapter' object has no attribute 'read'.

The quick fix I found was to rollback to subword-nmt 3.7.0

Error downloading pre-trained model

When running !python -m laserembeddings download-models , got below error. I'm using Python 3.7.6.

Lang attribute input check

Hello!
First of all, thank you very much for the repo - it is quite handy!

I just found one ambiguous moment (which seems to me, at least) which may confuse other users.

If I have a list with 10 sentences, then the function
laser_model.embed_sentences(list_of_sents, lang='en')
returns 101024 matrix.
On the other hand, if I provide language not as a string, but as a list with a single string, then the function
laser_model.embed_sentences(list_of_sents, lang=['en']) returns 11024 matrix.
At first, I thought - could it be due to some aggregation, like a mean vectors of all 10 vectors or something. While, according to code it is clearly due to ZIP function. I think it might be a good idea either to add some Warning, or raise an Error in such a case. Though, it is just a suggestion!

GPU is not utilised

I'm on OpenSUSE Leap 42 on WSL using Python 3.9.7, pytorch 1.11.0 (cuda version) and laserembeddings 1.1.2.

I'm computing embeddings for a long list of English text sequences (each of length <= 300 characters) using Laser.embed_sentences(). My GPU is an RTX 3080.

The property Laser.bpeSentenceEmbedding.encoder.use_cuda is True, indicating that the GPU is detected and Laser is attempting to use it. GPU memory is reserved as expected.

However, the GPU remains idle, at 0 to 2% utilisation, with the python process using 100% of a CPU thread. The performance is the same when I disable the GPU entirely, as is the CPU utilisation.

This leads me to believe that although something is taking up GPU memory, the inference is being done on the CPU. This is puzzling as looking at the source code the model and data are both unambiguously moved to the GPU if use_cuda is enabled.

Assuming it's not a quirk of my setup, I think this is quite a high priority issue as applications of this library on large amounts of data require GPU acceleration to be economical.

Pickle data was truncated

530 
531

/opt/conda/lib/python3.6/site-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
700 unpickler = pickle_module.Unpickler(f, **pickle_load_args)
701 unpickler.persistent_load = persistent_load
--> 702 result = unpickler.load()
703
704 deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)

UnpicklingError: pickle data was truncated

embed_sentences() lack of logging

Sometimes the calculation of embedding is not performed without any errors and logs.

Is it possible that there is some error inside embed_sentences() function that is not caught and logged?

How to know if a sentence is "out of vocabulary"?

Hi there,

as I'm trying to understand how LASER works I tried this

laser.embed_sentences( "wer we2dwdfw ewrwer", "nl")

and I got a result whose norm is 0.65.

My question is whether it makes sense to talk about out of vocabulary for LASER?

Error when installing pre-trained model

When trying to install the pre-trained model, got below error:

Getting version error related to torch

while installing : getting an error related to version of torch required.

Upgrade or make new laserembeddings branch for Laser2 and Laser3 models

Hello!
First of all, thank you very much for this pretty light and simple for use laser model fork.
Do you have any plans to update laserembeddings to new LASER2 and LASER3 models?
For face-to-face comparison laserembeddings and original model this dataset can also be used: flores-101 or flores-200 dataset
Cheers!

How to find similar embeddings??

I really loved your work of porting the LASER as python pip package,
I am new and trying to learn the use of these embeddings.

what i have done so far:

I have a list of sentences in three languages(L_1,L_2,L_3 let's say).

final = L_1 + L_2 + L_3

Generated embeddings as shown below:

import laserembeddings as le
laser = le.Laser()
langs=[]
for i in range(len(L_1)):
     langs.append('L_1')
for i in range(len(L_2)):
      langs.append('L_2')
for i in range(len(L_3)):
      langs.append('L_3')

embeddings = laser.embed_sentences(final,lang=langs)

assuming that the index of embeddings is as per their corresponding sentences in final.

now for finding similarity between embeddings what i have done is converted these embeddings into gensim KeyedVectors so that we have the flexibility of using the functions like similar_by_vector() etc.

Lang_based_keys = [sent for sent in final]
sent_vecs = KeyedVectors(vector_size=embeddings.shape[1])
sent_vecs.add(Lang_based_keys,embeddings)

but here i am having the issue, suppose a sentence which was present in the final at the time of generating embeddings "The iPhone SDK, set programming tools developers, enhanced support development iPad"..

but when i try to see what are the closest vectors in the embedding space to the vector of given sentence as follows:

let word = "The iPhone SDK, set programming tools developers, enhanced support development iPad"

sent_vecs.similar_by_vector(sent_vecs[word],topn=100)

what is see is the closest one given by the model is.

[('Poborsky played minutes, 291 minutes, Czech Republic Euro 2004', 1.0),
....
..]

how is it possible that the similarity was 1.0 which means both these sentences have same vector...

Kindly correct me wherever i am wrong..

Thank you.

AttributeError: 'BPECodesAdapter' object has no attribute 'read'

I've recently experienced this error AttributeError: 'BPECodesAdapter' object has no attribute 'read' which comes from the package subword_nmt used during the tokenization. Apparently, this package had some modifications. It's better to fix the specific version during installation of laserembeddings to avoid the problem, i.e. pip install subword-nmt==0.3.7

Japanese support

How can i approch towards japanese support on windows?
what kind of issues are there when using the japanese on windows?

Thnaks!!

Sentence length and embeddings

Hello and thank you for this library!

I have a question regarding how different sentence lengths are treated. Here's a code I ran:

from laserembeddings.preprocessing import Tokenizer
tokenizer = Tokenizer(lang)
sent_to_embed = tokenizer.tokenize(my_big_sentence) # my _big_sentence is actually a whole document
# len(sent_to_embed.split(" ")) =14037
laser = ls()
laser_extended = ls(embedding_options={"max_tokens": 40000, "max_sentences": 400})
t = sent_to_embed.split(" ")
out = []
for _ in range(100, len(t), 5000):
    print(_)
    out.append({
        "split_at": _,
        "default_embedding": laser.embed_sentences(" ".join(t[:_]), "nl"),
        "extemded_embedding": laser_extended.embed_sentences(" ".join(t[:_]), "nl"),
    })
    
out.append({
        "split_at": len(t),
        "default_embedding": laser.embed_sentences(" ".join(t), "nl"),
        "extemded_embedding": laser_extended.embed_sentences(" ".join(t), "nl"),
    })

Then I computed the cosine similarity between the last embedding (i.e. out[-1]) and the other ones and the result is in the plot below.

As you can see, one can't differentiate the results from the 2 LASER instances (laser and laser_extended). Is this expected? I also get the very same result with max_tokens = 200. I would've expected that the result doesn't change when the number of tokens exceeds this parameter.

yannvgn / laserembeddings Goto Github PK

laserembeddings's People

Contributors

Stargazers

Watchers

Forkers

laserembeddings's Issues

Recommend Projects

Recommend Topics

Recommend Org