bnosac / word2vec Goto Github PK

View Code? Open in Web Editor NEW

69.0 69.0 5.0 298 KB

Distributed Representations of Words using word2vec

License: Apache License 2.0

R 19.77% C++ 77.57% CMake 2.35% C 0.31%

embeddings natural-language-processing r-package word2vec

word2vec's People

Contributors

Stargazers

Watchers

Forkers

qingshuinayan lucaz01 randef1ned koheiw

word2vec's Issues

compare the similarity of two words

hello,
i have some problems,can you help me?

compare the similarity of two words, please ask word2vec how to do it.
For example, the similarity of negro and struggle, because this determines the evolution of the idea of the struggle of blacks for liberation.
compare the similarity of two groups of words, how does word2vec do it.
For example, the degree of similarity between c("negro", "negroes", "negro") and c("revolution", "movement"). This is a comparison with a similar set of concepts.

I found the solution in a package called library(wordVectors), but this package does not support Chinese. So I would like to consult you about the implementation of this word2vec package.
Below is the case code of wordVectors package. You can check it out.

https://github.com/kkdey/Black_magazines

How to add words to dictionary

Thank you for developing this package! I am trying to make predictions using the model built with the function word2vec. However, the keywords I am interested in are not part of the dictionary. So when trying the following line:

lookslike_in <- predict(model_in, c("invertebrates", "macrofauna", "meiofauna", "meiobentho"), type = "nearest", top_n = 5)

I get the error message:
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: macrofauna

My question is: how can I add words to the dictionary?

Thank you in advance,
Hadassa

model is error

(1)first is right
sentences.join=c("子夏问曰巧笑倩兮美目盼兮素 ","以为绚兮何谓也子曰绘事后素曰礼")
model <- word2vec(x = sentences.join, type="cbow",min_count = 1,window =8 )

(2) second is error.
sentences.join=paste(c("子夏问曰巧笑倩兮美目盼兮素") ,c("以为绚兮何谓也子曰绘事后素曰礼"),sep=" ")
model <- word2vec(x = sentences.join, type="cbow",min_count = 1,window =8 )

Training failed: fileMapper: 子夏问曰巧笑倩兮美目盼兮素以为绚兮何谓也子曰绘事后素曰礼 - No such file or directory

By combining the two previous string vectors into one vector, the model is wrong.why?

add unit tests

Add unit tests using R package tinytest

to test if embeddings based on list approach and file based approach are the same for different settings of the algorithm
to test if the embeddings are the same across platforms
to test if the embeddings stay the same when the internals are changed
to test if the resulting dimension of the embedding is correct
to evaluate inconsistent input data
to evaluate if the expected tokens are in the data
to evaluate different hyperparameters of the model

avoid adding complex package dependencies in Suggests to avoid continuous integration + CRAN build & dependency issues.

not chinese

file_in is this picture content
model <- word2vec(x = file_in, type = "cbow", dim = 15, iter = 20)
lookslike <- predict(model, c( "鹰"), type = "nearest", top_n = 5)
lookslike

Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 鹰

but 鹰 is in this picture content.Can you provide an example in Chinese?

avoid reencoding when writing out files

See issue #6

Switch to using writeLines(text = x, con = filehandle_train, useBytes = TRUE) otherwise Windows re-encodes

only print out last parts of a file name in case of loading from file

if someone manages to build a word2vec model with 1 text vector this will be crappy:

word2vec(paste(sample(letters, 100000, replace = TRUE), collapse = ""), dim = 2)

limit to first 2500 characters in the error message?

Expand functionality to different word embedding files

Although there is a read.wordvectors function that can read in a plan text file with vectors, the predict.word2vec function only works on 'model' objects, that can not be created from these word vector files.

Would it be possible to have the predict.word2vec function work on only the embedding matrix? This way, it would be possible to use it for all types of word vector models, e.g. trained with fasttext.

Training `word2vec` model fails on Fedora Linux

Training any word2vec() model fails on Fedora 37 with the binary from the iucar/cran COPR repository. I first reported the problem there, but the maintainer makes clear that it is a bug in the word2vec package. He has posted some first insights in the issue.

list of improvements

allow to pass a list of integers instead of tokens to the word2vec function
see how to remove the embedding of </s>
abandon file-based approach
speed up for Xptr's like quanteda objects to avoid copying data?
other speed improvements
progress bar
functionalities for downstream processing
- plotting or functionalities in https://github.com/bnosac/textplot
- downstream topic modelling like https://github.com/bnosac/ETM or as a replacement of SVD's for semi-supervised stuff
- embeddings on sentencepiece/tokenisers.bpe tokenised data
- pretrained models
- further input to torch models
- deeper integration of the similarities like https://github.com/bnosac/doc2vec or https://koheiw.github.io/LSX

Using pre-trained vectors with word2vec

Hi,

Although you mention it's a possibility, I can't find a clear code on how to use a downloaded pre-trained model on a local corpus of text with the R word2vec package.
Can you help me with that?
Thank you!

paragraph2vec / doc2vec

Consider renaming doc2vec to phrase2vec and incorporate the paragraph2vec also known as doc2vec from https://www.github.com/jwijffels/doc2vec

avoid file

allow to build models directly from a character vector instead of loading in from file
as sometimes annoying with small data especially with utf-8

similarity measure

I think typically NLP applications use cosine similarity to measure proximity between vectors, especially in the context of embeddings. The fact that word2vec::word2vec_similarity() does not implement cosine similarity might surprise quite a few users. I would recommend switching to cosine similarity.

m <- matrix(data = c(1,2,3,1,2,3,1,4,99), nrow = 3, ncol = 3, byrow = T)

## here are two efficient ways to implement cosine similarity
## from https://stats.stackexchange.com/questions/31565/compute-a-cosine-dissimilarity-matrix-in-r

res1 <- m %>% 
  {. / sqrt(rowSums(. * .))} %>% 
  {. %*% t(.)}

cos.sim=function(ma, mb){
  mat=tcrossprod(ma, mb)
  t1=sqrt(apply(ma, 1, crossprod))
  t2=sqrt(apply(mb, 1, crossprod))
  mat / outer(t1,t2)

res2 <- cos.sim(m,m)

res1
res2

## compare with

word2vec::word2vec_similarity(c(1,2,3),c(1,2,3))

## or

word2vec::word2vec_similarity(c(0.1,0.2,0.3),c(0.1,0.2,0.3))

Suppporting a list of tokens

Hi @jwijffels

Your word2vec looks great. I wanted to use word2vec to generate word vectors for my LSX package, so I wonder if you have a plan to support a list of tokens as an input.

As far as I understand, I need to convert quanteda's tokens object (list of token IDs) to character strings to pass that to word2vec(). I think it is more efficient to feed texts to the C++ functions without writing them to temporary files.

## Current 
toks_char <- stringi::stri_c_list(as.list(toks), sep = " ")
w2v <- word2vec(x = toks_char, dim = 100, iter = 20, threads = 8, split = c(" ", "\n"))

## My ideal
w2v <- word2vec(x = as.list(toks), dim = 100, iter = 20, threads = 8, split = NULL)

If you like the idea, I think I can contribute.