Code Monkey home page Code Monkey logo

Comments (6)

cderv avatar cderv commented on June 3, 2024 2

You could decide the return value you want in the recipe.
Example:

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      html <- result$result$value
      rvest::read_html(html, "\n")
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/")
library(rvest)
html %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

Created on 2021-03-11 by the reprex package (v1.0.0.9002)

You could also return the text directly

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      result$result$value
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/")
library(rvest)
read_html(html) %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

Created on 2021-03-11 by the reprex package (v1.0.0.9002)

from crrri.

cderv avatar cderv commented on June 3, 2024 1

This works ok with rvest. Here is an example:

library(promises)
library(crrri)

dump_DOM <- function(url, file = "") {
  perform_with_chrome(function(client) {
    Network <- client$Network
    Page <- client$Page
    Runtime <- client$Runtime
    Network$enable() %...>% { 
      Page$enable()
    } %...>% {
      Network$setCacheDisabled(cacheDisabled = TRUE)
    } %...>% {
      Page$navigate(url = url)
    } %...>% {
      Page$loadEventFired()
    } %...>% {
      Runtime$evaluate(
        expression = 'document.documentElement.outerHTML'
      )
    } %...>% (function(result) {
      html <- result$result$value
      cat(html, "\n", file = file)
    }) 
  })
}

html <- dump_DOM(url = "http://www.ardata.fr/post/", "test.html")
#> Running "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" \
#>   --no-first-run --headless \
#>   "--user-data-dir=C:\Users\chris\AppData\Local\r-crrri\r-crrri\chrome-data-dir-xbpnjxhj" \
#>   "--remote-debugging-port=9222" --disable-gpu --no-sandbox

library(rvest)
#> Le chargement a nécessité le package : xml2
html <- read_html("test.html")
html %>% html_node("title") %>% html_text()
#> [1] "Blog | ArData "

Created on 2021-03-11 by the reprex package (v1.0.0.9002)

from crrri.

markwsac avatar markwsac commented on June 3, 2024

Got it. Thanks. Please find my code below -

  z <- b$Runtime$evaluate('document.documentElement.outerHTML')
  mydf <- z$result$value

Last question - can we use rvest on this? It seems it is not XML , hence not working.
mydf %>% rvest::html_nodes("[id$='_hcontainer']")

from crrri.

cderv avatar cderv commented on June 3, 2024

For now crrri is rather low level and you need to create the recipe yourself.
I believe chrome_read_html() is equivalent to dumpDOM() function we gave as example in the README: https://github.com/RLesur/crrri#transpose-chrome-remote-interface-js-scripts-dump-the-dom

It uses the expression you found and that you evaluate.

The result should be HTML so rvest or xml2 can be used on this. With an example it could be easier to see the issue.

from crrri.

cderv avatar cderv commented on June 3, 2024

To precise my thoughts, It feels like having these in crrri directly is not the best option to keep this package centered around Chrome Remote Interface.
But we had the idea of creating a package that would contain recipes like dumpDOM(), but we did not found the time yet to start it.

from crrri.

markwsac avatar markwsac commented on June 3, 2024

Thanks. This is great. Just asking if it is possible to do it without saving as html file "test.html"

from crrri.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.