Dear Jan, First of all: thank you for this brilliant package! For me

This is a binary of ruimtehol 0.2.2 for Windows 3.5.1: <a href="http://www.datatailor.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

What function to use when checking simmilarity between documents about ruimtehol HOT 23 CLOSED

rdatasculptor commented on August 18, 2024

What function to use when checking simmilarity between documents

from ruimtehol.

Comments (23)

jwijffels commented on August 18, 2024 1

embed_articlespace is for the setting where you have a bunch of articles (e.g. wikipedia articles) and you have a new text and want to see to which article does that new text resemble the most. Or for similar settings (e.g. where you have a knowledge base of answers on questions and you have a new question and want to see to which of these answers does the question resemble the most)

And yes, example is given in that presentation.

from ruimtehol.

jwijffels commented on August 18, 2024 1

word embeddings (as in embed_wordspace) are just a bunch of numbers which are similar for words which are used in the neighbourhood of one another.

from ruimtehol.

jwijffels commented on August 18, 2024 1

Use the same code as shown in the answer I gave on question 1. So provide a character string where words are separated by spaces and sentences are added with the tab separator. As in predict(model, “wat was de precieze oorzaak van de technische problemen \t wat viel er in panne \t welke dienst heeft u gebeld”)

from ruimtehol.

jwijffels commented on August 18, 2024 1

This is a binary of ruimtehol 0.2.2 for Windows 3.5.1: http://www.datatailor.be/ruimtehol_0.2.2.zip

from ruimtehol.

rdatasculptor commented on August 18, 2024

Okay, I think I figured it out thanks to the presentation in your post https://www.bnosac.be/index.php/blog/86-neural-text-modelling-with-r-package-ruimtehol.

from ruimtehol.

rdatasculptor commented on August 18, 2024

Thanks!

from ruimtehol.

jwijffels commented on August 18, 2024

@rdatasculptor out of curiosity, for what textclassification / article recommendation exercise have you applied the model?

from ruimtehol.

rdatasculptor commented on August 18, 2024

@jwijffels I put some texts with personality labels in the model. I have to do more research, but I am starting to believe ruimtehol seems to perform better than Watson :-)

from ruimtehol.

jwijffels commented on August 18, 2024

Ok, thanks for the input. It's indeed a swiss army knife if you tune the hyperparameters such that it learns something.

from ruimtehol.

rdatasculptor commented on August 18, 2024

yes it is! I am still trying to understand what word embeddings are or how they are calculated exactly in ruimtehol. This field is rather new to me, but very interesting.

from ruimtehol.

rdatasculptor commented on August 18, 2024

it seems a very strong way of making a representation of the content and meaning of texts

from ruimtehol.

rdatasculptor commented on August 18, 2024

Two additional questions:

In your presentation you use the variable allarticles$text. I guess that's the same as the dekamer$x?
If I want to look for similar documents and the input is a document as well (meaning more than one sentence), I understand I should use embed_articlespace. The input of this function is one sentence at a time. How to deal with a document with more than one sentence as an input?

from ruimtehol.

jwijffels commented on August 18, 2024

About question 1
I see. That presentation is a knitr document. On page 24 it also had the following but it was not shown due to the printing of head(knowledgebase)

allarticles <- data.table::setDT(knowledgebase)
allarticles <- allarticles[, list(sentence = paste(token, collapse = " ")), by = list(doc_id, sentence_id)]
allarticles <- allarticles[, list(text = paste(sentence, collapse = " \t ")), by = list(doc_id)]

About question 2. The input to embed_articlespace is (see documentation of that function)

a data.frame with sentences containing the columns doc_id, sentence_id and token The doc_id is just an article or document identifier, the sentence_id column is a character field which contains words which are separated by a space and should not contain any tab characters
If you have several sentences per article that would be just looking as follows.

> library(udpipe)
> x <- udpipe(c("You have a question. Go to the doctor.", "Margareth Thatcher is a former PM of the UK. She is blablabla"), "english")[, c("doc_id", "sentence_id", "token")]
> x
 doc_id sentence_id     token
   doc1           1       You
   doc1           1      have
   doc1           1         a
   doc1           1  question
   doc1           1         .
   doc1           2        Go
   doc1           2        to
   doc1           2       the
   doc1           2    doctor
   doc1           2         .
   doc2           1 Margareth
   doc2           1  Thatcher
   doc2           1        is
   doc2           1         a
   doc2           1    former
   doc2           1        PM
   doc2           1        of
   doc2           1       the
   doc2           1        UK
   doc2           1         .
   doc2           2       She
   doc2           2        is
   doc2           2 blablabla

from ruimtehol.

rdatasculptor commented on August 18, 2024

Thank you for your answers!
Regarding question 2, I must admit there was an error. I meant the predict function for checking which documents are most similar to a given sentence. scores <-predict(model,"wat was de precieze oorzaak van de technische problemen",basedoc =allarticles$text). What if there is a complete document instead of one sentence that I want to predict it's similar documents of?

from ruimtehol.

rdatasculptor commented on August 18, 2024

Thanks again! It is completely clear now.

from ruimtehol.

jwijffels commented on August 18, 2024

Note, It should be " \t ", not "\t" to separate the sentences

allarticles <- data.table::setDT(knowledgebase)
allarticles <- allarticles[, list(sentence = paste(token, collapse = " ")), by = list(doc_id, sentence_id)]
allarticles <- allarticles[, list(text = paste(sentence, collapse = " \t ")), by = list(doc_id)]

from ruimtehol.

rdatasculptor commented on August 18, 2024

Okay thanks! I altered my code. You are really helpful.

from ruimtehol.

rdatasculptor commented on August 18, 2024

Hi Jan
Following issue #22, I should use "\t" now as a sentence seperator instead of " \t " after updating to the latest github version ofcourse)?

from ruimtehol.

jwijffels commented on August 18, 2024

Yes, correct!

from ruimtehol.

rdatasculptor commented on August 18, 2024

I guess there's no easy way to install the github version without having to use RTools? I still work in a restricted network unfortenately.

from ruimtehol.

jwijffels commented on August 18, 2024

yes, you need RTools on Windows. Which version of R on windows are you on?

from ruimtehol.

rdatasculptor commented on August 18, 2024

3.5.1

from ruimtehol.

rdatasculptor commented on August 18, 2024

thanks!

from ruimtehol.

What function to use when checking simmilarity between documents about ruimtehol HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent