Code Monkey home page Code Monkey logo

word2vec's People

Contributors

jwijffels avatar koheiw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

word2vec's Issues

compare the similarity of two words

hello,
i have some problems,can you help me?

  1. compare the similarity of two words, please ask word2vec how to do it.
    For example, the similarity of negro and struggle, because this determines the evolution of the idea of the struggle of blacks for liberation.
  2. compare the similarity of two groups of words, how does word2vec do it.
    For example, the degree of similarity between c("negro", "negroes", "negro") and c("revolution", "movement"). This is a comparison with a similar set of concepts.
    image

I found the solution in a package called library(wordVectors), but this package does not support Chinese. So I would like to consult you about the implementation of this word2vec package.
Below is the case code of wordVectors package. You can check it out.

https://github.com/kkdey/Black_magazines

How to add words to dictionary

Thank you for developing this package! I am trying to make predictions using the model built with the function word2vec. However, the keywords I am interested in are not part of the dictionary. So when trying the following line:

lookslike_in <- predict(model_in, c("invertebrates", "macrofauna", "meiofauna", "meiobentho"), type = "nearest", top_n = 5)

I get the error message:
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: macrofauna

My question is: how can I add words to the dictionary?

Thank you in advance,
Hadassa

model is error

(1)first is right
sentences.join=c("子 夏 问 曰 巧 笑 倩 兮 美 目 盼 兮 素 ","以 为 绚 兮 何 谓 也 子 曰 绘 事 后 素 曰 礼")
model <- word2vec(x = sentences.join, type="cbow",min_count = 1,window =8 )

(2) second is error.
sentences.join=paste(c("子 夏 问 曰 巧 笑 倩 兮 美 目 盼 兮 素") ,c("以 为 绚 兮 何 谓 也 子 曰 绘 事 后 素 曰 礼"),sep=" ")
model <- word2vec(x = sentences.join, type="cbow",min_count = 1,window =8 )

Training failed: fileMapper: 子 夏 问 曰 巧 笑 倩 兮 美 目 盼 兮 素 以 为 绚 兮 何 谓 也 子 曰 绘 事 后 素 曰 礼 - No such file or directory

By combining the two previous string vectors into one vector, the model is wrong.why?

add unit tests

Add unit tests using R package tinytest

  • to test if embeddings based on list approach and file based approach are the same for different settings of the algorithm
  • to test if the embeddings are the same across platforms
  • to test if the embeddings stay the same when the internals are changed
  • to test if the resulting dimension of the embedding is correct
  • to evaluate inconsistent input data
  • to evaluate if the expected tokens are in the data
  • to evaluate different hyperparameters of the model

avoid adding complex package dependencies in Suggests to avoid continuous integration + CRAN build & dependency issues.

not chinese

image
file_in is this picture content
model <- word2vec(x = file_in, type = "cbow", dim = 15, iter = 20)
lookslike <- predict(model, c( "鹰"), type = "nearest", top_n = 5)
lookslike

Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 鹰

but 鹰 is in this picture content.Can you provide an example in Chinese?

Expand functionality to different word embedding files

Although there is a read.wordvectors function that can read in a plan text file with vectors, the predict.word2vec function only works on 'model' objects, that can not be created from these word vector files.

Would it be possible to have the predict.word2vec function work on only the embedding matrix? This way, it would be possible to use it for all types of word vector models, e.g. trained with fasttext.

list of improvements

  • allow to pass a list of integers instead of tokens to the word2vec function
  • see how to remove the embedding of </s>
  • abandon file-based approach
  • speed up for Xptr's like quanteda objects to avoid copying data?
  • other speed improvements
  • progress bar
  • functionalities for downstream processing
    - plotting or functionalities in https://github.com/bnosac/textplot
    - downstream topic modelling like https://github.com/bnosac/ETM or as a replacement of SVD's for semi-supervised stuff
    - embeddings on sentencepiece/tokenisers.bpe tokenised data
    - pretrained models
    - further input to torch models
    - deeper integration of the similarities like https://github.com/bnosac/doc2vec or https://koheiw.github.io/LSX

Using pre-trained vectors with word2vec

Hi,

Although you mention it's a possibility, I can't find a clear code on how to use a downloaded pre-trained model on a local corpus of text with the R word2vec package.
Can you help me with that?
Thank you!

avoid file

allow to build models directly from a character vector instead of loading in from file
as sometimes annoying with small data especially with utf-8

similarity measure

I think typically NLP applications use cosine similarity to measure proximity between vectors, especially in the context of embeddings. The fact that word2vec::word2vec_similarity() does not implement cosine similarity might surprise quite a few users. I would recommend switching to cosine similarity.

m <- matrix(data = c(1,2,3,1,2,3,1,4,99), nrow = 3, ncol = 3, byrow = T)

## here are two efficient ways to implement cosine similarity
## from https://stats.stackexchange.com/questions/31565/compute-a-cosine-dissimilarity-matrix-in-r

res1 <- m %>% 
  {. / sqrt(rowSums(. * .))} %>% 
  {. %*% t(.)}

cos.sim=function(ma, mb){
  mat=tcrossprod(ma, mb)
  t1=sqrt(apply(ma, 1, crossprod))
  t2=sqrt(apply(mb, 1, crossprod))
  mat / outer(t1,t2)

res2 <- cos.sim(m,m)

res1
res2

## compare with

word2vec::word2vec_similarity(c(1,2,3),c(1,2,3))

## or

word2vec::word2vec_similarity(c(0.1,0.2,0.3),c(0.1,0.2,0.3))

Suppporting a list of tokens

Hi @jwijffels

Your word2vec looks great. I wanted to use word2vec to generate word vectors for my LSX package, so I wonder if you have a plan to support a list of tokens as an input.

As far as I understand, I need to convert quanteda's tokens object (list of token IDs) to character strings to pass that to word2vec(). I think it is more efficient to feed texts to the C++ functions without writing them to temporary files.

## Current 
toks_char <- stringi::stri_c_list(as.list(toks), sep = " ")
w2v <- word2vec(x = toks_char, dim = 100, iter = 20, threads = 8, split = c(" ", "\n"))

## My ideal
w2v <- word2vec(x = as.list(toks), dim = 100, iter = 20, threads = 8, split = NULL)

If you like the idea, I think I can contribute.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.