Code Monkey home page Code Monkey logo

Comments (3)

jwijffels avatar jwijffels commented on August 18, 2024

The package always starts from building a model based on a file.
If you can construct a file which looks like this (see Starspace README https://github.com/facebookresearch/StarSpace/blob/master/README.md), you can build a model with specific useWeight = TRUE

word_1:wt_1 word_2:wt_2 ... word_k:wt_k __label__1:lwt_1 ... __label__r:lwt_r

It might as well that you are looking for something called word mover distance (http://proceedings.mlr.press/v37/kusnerb15.pdf)? While I was working on R package doc2vec (https://www.bnosac.be/index.php/blog/103-doc2vec-in-r and https://github.com/bnosac/doc2vec), the C++ backend there allows to provide weights to certain words as well but I removed that functionality last week in order to comply to CRAN policies.
R package text2vec from @dselivanov has a function called RelaxedWordMoversDistance, based on which you can plug in the embeddings coming from either R packages ruimtehol, text2vec, word2vec or doc2vec

And nothing stops you from calculating a different embedding for each document by using whichever linear combination of the word vectors that is coming out of these different packages.

from ruimtehol.

guivivi avatar guivivi commented on August 18, 2024

Hi Jan, many thanks for the insights.

Regarding creating the file with weights, I think I have been able to do it. Following the second example of embed_sentencespace, the idea is to paste an added column with the weights. This is an illustration for the case that I wanted to highlight the importance of the word 'cijfers':

library(udpipe)
data(dekamer, package = "ruimtehol")
x <- udpipe(dekamer$question, "dutch", tagger = "none", parser = "none", trace = 100)
x <- x[, c("doc_id", "sentence_id", "sentence", "token")]

x <- x %>% 
  filter(doc_id == "doc115", sentence_id == "7") %>%
  mutate(weight = ifelse(token == "cijfers", 1, 0))
x
 doc_id   sentence_id                            sentence         token   weight
doc115                    7   Kunt u cijfers meedelen?          Kunt            0
doc115                    7   Kunt u cijfers meedelen?                u            0
doc115                    7   Kunt u cijfers meedelen?         cijfers            1
doc115                    7   Kunt u cijfers meedelen?   meedelen            0
doc115                    7   Kunt u cijfers meedelen?                 ?            0

x <- split(x, f = x$doc_id)
x <- sapply(x, FUN = function(tokens) {
  sentences <- split(tokens, tokens$sentence_id)
  sentences <- sapply(sentences, FUN = function(x) paste(x$token, ":", x$weight, sep = "", 
                                                         collapse = " "))
  paste(sentences, collapse = "\t")
})  
x
"Kunt:0 u:0 cijfers:1 meedelen:0 ?:0"

For anyone interested, the extended function is available at:
https://www.uv.es/vivigui/docs/embed_sentencespace_weighted.R

Basically I have added the former paste(x$token, ":", x$weight, sep = "", collapse = " ") and the condition
stopifnot(all(c("doc_id", "sentence_id", "token", "weight") %in% colnames(x)))

I have tried a couple of tests with embed_sentencespace_weighted(..., useWeight = TRUE) and indeed seems to take into account the added weigths.

Please correct me if I am wrong in my procedure.

I am going to learn now the word mover distance, an unknown concept to me so far.

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024

Looks correct to me

from ruimtehol.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.