bnosac / word2vec Goto Github PK
View Code? Open in Web Editor NEWDistributed Representations of Words using word2vec
License: Apache License 2.0
Distributed Representations of Words using word2vec
License: Apache License 2.0
hello,
i have some problems,can you help me?
I found the solution in a package called library(wordVectors), but this package does not support Chinese. So I would like to consult you about the implementation of this word2vec package.
Below is the case code of wordVectors package. You can check it out.
Thank you for developing this package! I am trying to make predictions using the model built with the function word2vec. However, the keywords I am interested in are not part of the dictionary. So when trying the following line:
lookslike_in <- predict(model_in, c("invertebrates", "macrofauna", "meiofauna", "meiobentho"), type = "nearest", top_n = 5)
I get the error message:
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: macrofauna
My question is: how can I add words to the dictionary?
Thank you in advance,
Hadassa
(1)first is right
sentences.join=c("子 夏 问 曰 巧 笑 倩 兮 美 目 盼 兮 素 ","以 为 绚 兮 何 谓 也 子 曰 绘 事 后 素 曰 礼")
model <- word2vec(x = sentences.join, type="cbow",min_count = 1,window =8 )
(2) second is error.
sentences.join=paste(c("子 夏 问 曰 巧 笑 倩 兮 美 目 盼 兮 素") ,c("以 为 绚 兮 何 谓 也 子 曰 绘 事 后 素 曰 礼"),sep=" ")
model <- word2vec(x = sentences.join, type="cbow",min_count = 1,window =8 )
Training failed: fileMapper: 子 夏 问 曰 巧 笑 倩 兮 美 目 盼 兮 素 以 为 绚 兮 何 谓 也 子 曰 绘 事 后 素 曰 礼 - No such file or directory
By combining the two previous string vectors into one vector, the model is wrong.why?
Add unit tests using R package tinytest
avoid adding complex package dependencies in Suggests to avoid continuous integration + CRAN build & dependency issues.
file_in is this picture content
model <- word2vec(x = file_in, type = "cbow", dim = 15, iter = 20)
lookslike <- predict(model, c( "鹰"), type = "nearest", top_n = 5)
lookslike
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 鹰
but 鹰 is in this picture content.Can you provide an example in Chinese?
See issue #6
Switch to using writeLines(text = x, con = filehandle_train, useBytes = TRUE)
otherwise Windows re-encodes
if someone manages to build a word2vec model with 1 text vector this will be crappy:
word2vec(paste(sample(letters, 100000, replace = TRUE), collapse = ""), dim = 2)
limit to first 2500 characters in the error message?
Although there is a read.wordvectors
function that can read in a plan text file with vectors, the predict.word2vec
function only works on 'model' objects, that can not be created from these word vector files.
Would it be possible to have the predict.word2vec
function work on only the embedding matrix? This way, it would be possible to use it for all types of word vector models, e.g. trained with fasttext.
Training any word2vec()
model fails on Fedora 37 with the binary from the iucar/cran
COPR repository. I first reported the problem there, but the maintainer makes clear that it is a bug in the word2vec
package. He has posted some first insights in the issue.
</s>
Hi,
Although you mention it's a possibility, I can't find a clear code on how to use a downloaded pre-trained model on a local corpus of text with the R word2vec package.
Can you help me with that?
Thank you!
Consider renaming doc2vec to phrase2vec and incorporate the paragraph2vec also known as doc2vec from https://www.github.com/jwijffels/doc2vec
allow to build models directly from a character vector instead of loading in from file
as sometimes annoying with small data especially with utf-8
I think typically NLP applications use cosine similarity to measure proximity between vectors, especially in the context of embeddings. The fact that word2vec::word2vec_similarity()
does not implement cosine similarity might surprise quite a few users. I would recommend switching to cosine similarity.
m <- matrix(data = c(1,2,3,1,2,3,1,4,99), nrow = 3, ncol = 3, byrow = T)
## here are two efficient ways to implement cosine similarity
## from https://stats.stackexchange.com/questions/31565/compute-a-cosine-dissimilarity-matrix-in-r
res1 <- m %>%
{. / sqrt(rowSums(. * .))} %>%
{. %*% t(.)}
cos.sim=function(ma, mb){
mat=tcrossprod(ma, mb)
t1=sqrt(apply(ma, 1, crossprod))
t2=sqrt(apply(mb, 1, crossprod))
mat / outer(t1,t2)
res2 <- cos.sim(m,m)
res1
res2
## compare with
word2vec::word2vec_similarity(c(1,2,3),c(1,2,3))
## or
word2vec::word2vec_similarity(c(0.1,0.2,0.3),c(0.1,0.2,0.3))
Hi @jwijffels
Your word2vec looks great. I wanted to use word2vec to generate word vectors for my LSX package, so I wonder if you have a plan to support a list of tokens as an input.
As far as I understand, I need to convert quanteda's tokens object (list of token IDs) to character strings to pass that to word2vec()
. I think it is more efficient to feed texts to the C++ functions without writing them to temporary files.
## Current
toks_char <- stringi::stri_c_list(as.list(toks), sep = " ")
w2v <- word2vec(x = toks_char, dim = 100, iter = 20, threads = 8, split = c(" ", "\n"))
## My ideal
w2v <- word2vec(x = as.list(toks), dim = 100, iter = 20, threads = 8, split = NULL)
If you like the idea, I think I can contribute.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.