Dear Jan, Many thanks for this outstanding package. <p dir="auto

Option of weighting words about ruimtehol HOT 3 OPEN

guivivi commented on August 18, 2024

Option of weighting words

from ruimtehol.

Comments (3)

jwijffels commented on August 18, 2024

The package always starts from building a model based on a file.
If you can construct a file which looks like this (see Starspace README https://github.com/facebookresearch/StarSpace/blob/master/README.md), you can build a model with specific useWeight = TRUE

word_1:wt_1 word_2:wt_2 ... word_k:wt_k __label__1:lwt_1 ... __label__r:lwt_r

It might as well that you are looking for something called word mover distance (http://proceedings.mlr.press/v37/kusnerb15.pdf)? While I was working on R package doc2vec (https://www.bnosac.be/index.php/blog/103-doc2vec-in-r and https://github.com/bnosac/doc2vec), the C++ backend there allows to provide weights to certain words as well but I removed that functionality last week in order to comply to CRAN policies.
R package text2vec from @dselivanov has a function called RelaxedWordMoversDistance, based on which you can plug in the embeddings coming from either R packages ruimtehol, text2vec, word2vec or doc2vec

And nothing stops you from calculating a different embedding for each document by using whichever linear combination of the word vectors that is coming out of these different packages.

from ruimtehol.

guivivi commented on August 18, 2024

Hi Jan, many thanks for the insights.

Regarding creating the file with weights, I think I have been able to do it. Following the second example of embed_sentencespace, the idea is to paste an added column with the weights. This is an illustration for the case that I wanted to highlight the importance of the word 'cijfers':

library(udpipe)
data(dekamer, package = "ruimtehol")
x <- udpipe(dekamer$question, "dutch", tagger = "none", parser = "none", trace = 100)
x <- x[, c("doc_id", "sentence_id", "sentence", "token")]

x <- x %>% 
  filter(doc_id == "doc115", sentence_id == "7") %>%
  mutate(weight = ifelse(token == "cijfers", 1, 0))
x
 doc_id   sentence_id                            sentence         token   weight
doc115                    7   Kunt u cijfers meedelen?          Kunt            0
doc115                    7   Kunt u cijfers meedelen?                u            0
doc115                    7   Kunt u cijfers meedelen?         cijfers            1
doc115                    7   Kunt u cijfers meedelen?   meedelen            0
doc115                    7   Kunt u cijfers meedelen?                 ?            0

x <- split(x, f = x$doc_id)
x <- sapply(x, FUN = function(tokens) {
  sentences <- split(tokens, tokens$sentence_id)
  sentences <- sapply(sentences, FUN = function(x) paste(x$token, ":", x$weight, sep = "", 
                                                         collapse = " "))
  paste(sentences, collapse = "\t")
})  
x
"Kunt:0 u:0 cijfers:1 meedelen:0 ?:0"

For anyone interested, the extended function is available at:
https://www.uv.es/vivigui/docs/embed_sentencespace_weighted.R

Basically I have added the former paste(x$token, ":", x$weight, sep = "", collapse = " ") and the condition
stopifnot(all(c("doc_id", "sentence_id", "token", "weight") %in% colnames(x)))

I have tried a couple of tests with embed_sentencespace_weighted(..., useWeight = TRUE) and indeed seems to take into account the added weigths.

Please correct me if I am wrong in my procedure.

I am going to learn now the word mover distance, an unknown concept to me so far.

from ruimtehol.

jwijffels commented on August 18, 2024

Looks correct to me

from ruimtehol.

Option of weighting words about ruimtehol HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent