Code Monkey home page Code Monkey logo

Comments (23)

jwijffels avatar jwijffels commented on August 18, 2024 1

embed_articlespace is for the setting where you have a bunch of articles (e.g. wikipedia articles) and you have a new text and want to see to which article does that new text resemble the most. Or for similar settings (e.g. where you have a knowledge base of answers on questions and you have a new question and want to see to which of these answers does the question resemble the most)

And yes, example is given in that presentation.

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024 1

word embeddings (as in embed_wordspace) are just a bunch of numbers which are similar for words which are used in the neighbourhood of one another.

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024 1

Use the same code as shown in the answer I gave on question 1. So provide a character string where words are separated by spaces and sentences are added with the tab separator. As in predict(model, “wat was de precieze oorzaak van de technische problemen \t wat viel er in panne \t welke dienst heeft u gebeld”)

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024 1

This is a binary of ruimtehol 0.2.2 for Windows 3.5.1: http://www.datatailor.be/ruimtehol_0.2.2.zip

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

Okay, I think I figured it out thanks to the presentation in your post https://www.bnosac.be/index.php/blog/86-neural-text-modelling-with-r-package-ruimtehol.

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

Thanks!

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024

@rdatasculptor out of curiosity, for what textclassification / article recommendation exercise have you applied the model?

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

@jwijffels I put some texts with personality labels in the model. I have to do more research, but I am starting to believe ruimtehol seems to perform better than Watson :-)

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024

Ok, thanks for the input. It's indeed a swiss army knife if you tune the hyperparameters such that it learns something.

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

yes it is! I am still trying to understand what word embeddings are or how they are calculated exactly in ruimtehol. This field is rather new to me, but very interesting.

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

it seems a very strong way of making a representation of the content and meaning of texts

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

Two additional questions:

  1. In your presentation you use the variable allarticles$text. I guess that's the same as the dekamer$x?
  2. If I want to look for similar documents and the input is a document as well (meaning more than one sentence), I understand I should use embed_articlespace. The input of this function is one sentence at a time. How to deal with a document with more than one sentence as an input?

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024

About question 1
I see. That presentation is a knitr document. On page 24 it also had the following but it was not shown due to the printing of head(knowledgebase)

allarticles <- data.table::setDT(knowledgebase)
allarticles <- allarticles[, list(sentence = paste(token, collapse = " ")), by = list(doc_id, sentence_id)]
allarticles <- allarticles[, list(text = paste(sentence, collapse = " \t ")), by = list(doc_id)]

About question 2. The input to embed_articlespace is (see documentation of that function)

a data.frame with sentences containing the columns doc_id, sentence_id and token The doc_id is just an article or document identifier, the sentence_id column is a character field which contains words which are separated by a space and should not contain any tab characters
If you have several sentences per article that would be just looking as follows.

> library(udpipe)
> x <- udpipe(c("You have a question. Go to the doctor.", "Margareth Thatcher is a former PM of the UK. She is blablabla"), "english")[, c("doc_id", "sentence_id", "token")]
> x
 doc_id sentence_id     token
   doc1           1       You
   doc1           1      have
   doc1           1         a
   doc1           1  question
   doc1           1         .
   doc1           2        Go
   doc1           2        to
   doc1           2       the
   doc1           2    doctor
   doc1           2         .
   doc2           1 Margareth
   doc2           1  Thatcher
   doc2           1        is
   doc2           1         a
   doc2           1    former
   doc2           1        PM
   doc2           1        of
   doc2           1       the
   doc2           1        UK
   doc2           1         .
   doc2           2       She
   doc2           2        is
   doc2           2 blablabla

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

Thank you for your answers!
Regarding question 2, I must admit there was an error. I meant the predict function for checking which documents are most similar to a given sentence. scores <-predict(model,"wat was de precieze oorzaak van de technische problemen",basedoc =allarticles$text). What if there is a complete document instead of one sentence that I want to predict it's similar documents of?

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

Thanks again! It is completely clear now.

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024

Note, It should be " \t ", not "\t" to separate the sentences

allarticles <- data.table::setDT(knowledgebase)
allarticles <- allarticles[, list(sentence = paste(token, collapse = " ")), by = list(doc_id, sentence_id)]
allarticles <- allarticles[, list(text = paste(sentence, collapse = " \t ")), by = list(doc_id)]

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

Okay thanks! I altered my code. You are really helpful.

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

Hi Jan
Following issue #22, I should use "\t" now as a sentence seperator instead of " \t " after updating to the latest github version ofcourse)?

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024

Yes, correct!

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

I guess there's no easy way to install the github version without having to use RTools? I still work in a restricted network unfortenately.

from ruimtehol.

jwijffels avatar jwijffels commented on August 18, 2024

yes, you need RTools on Windows. Which version of R on windows are you on?

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

3.5.1

from ruimtehol.

rdatasculptor avatar rdatasculptor commented on August 18, 2024

thanks!

from ruimtehol.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.