gesistsa / adar Goto Github PK
View Code? Open in Web Editor NEW:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++
Home Page: https://gesistsa.github.io/adaR/
License: Other
:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++
Home Page: https://gesistsa.github.io/adaR/
License: Other
yeah. I am eyeing on removing all of these
https://github.com/schochastics/adaR/blob/14692c751ac2fdbc97caa5b491357788d51eb7f6/R/parse.R#L20-L42
Originally posted by @chainsawriot in #47 (comment)
urltools has a function url_decode. We dont want to mask that
adaR::ada_url_parse("https://www.hk01.com/zone/1/港聞")
#> $href
#> [1] "https://www.hk01.com/zone/1/%E6%B8"
#>
#> $protocol
#> [1] "https:"
#>
#> $username
#> [1] ""
#>
#> $password
#> [1] ""
#>
#> $host
#> [1] "www.hk01.com"
#>
#> $hostname
#> [1] "www.hk01.com"
#>
#> $port
#> [1] ""
#>
#> $pathname
#> [1] "/zone/1/%E6%B8"
#>
#> $search
#> [1] ""
#>
#> $hash
#> [1] ""
Created on 2023-09-22 with reprex v2.0.2
checking C++ specification
Not all R platforms support C++17
might haunt us on CRAN
Sometimes, I might want to have those %.
ada_url_parse <- function(url, decode = TRUE) {
url <- stringi::stri_enc_toutf8(url)
url_parsed <- Rcpp_ada_parse(url, nchar(url, type = "bytes"))
if (isTRUE(decode)) {
return(lapply(url_parsed, URLdecode))
}
return(url_parsed)
}
ada_url_parse("https://www.google.co.jp/search?q=ドイツ")$search
## [1] "?q=ドイツ"
ada_url_parse("https://www.google.co.jp/search?q=ドイツ", decode = FALSE)$search
## [1] "?q=%E3%83%89%E3%82%A4%E3%83%84"
Prepare for release:
git pull
urlchecker::url_check()
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)
simply with Vectorize
?
The main purpose of the package is to wrap ada-url. But their might be other features we could support (which are needed for webtrackR).
One such feature is public suffix extraction via PSL. There is a package for that but it is not on CRAN.
The list is accessible as a textfile here and here.
We may not be able to implement something fast, but a simple lookup-ish thing should be possible
adaR::ada_url_parse("bit.ly/32G1ciy")
#> href protocol username password host hostname port pathname search
#> 1 bit.ly/32G1ciy <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> hash
#> 1 <NA>
urltools::url_parse("bit.ly/32G1ciy")
#> scheme domain port path parameter fragment
#> 1 <NA> bit.ly <NA> 32G1ciy <NA> <NA>
Created on 2023-09-25 with reprex v2.0.2
not sure if/how to handle this
adaR::ada_get_domain("http://sub.google.de/test")
#> [1] NA
This is due to public_suffix()
not being able to parse URLs with paths
adaR::public_suffix("http://google.de/test")
#> [1] NA
psl::public_suffix("http://google.de/test")
#> [1] "de/test"
Not sure if this should be NA instead of de. But then again, psl also fails.
For now, we fix ada_get_domain
internally and reconsider public_suffix in #54
A problem in #17 is the decoding of idn/punnycode.
There are Pointers in the ada code to support that but this needs further investigation
adaR::ada_url_parse("http://kobe.jp")
#> href protocol username password host hostname port pathname
#> 1 http://kobe.jp/ http: kobe.jp kobe.jp /
#> search hash
#> 1
adaR::public_suffix("http://kobe.jp")
#> [1] "jp.kobe.jp"
Created on 2023-09-26 with reprex v2.0.2
Probably a fringe use case, but the other day I tried to read the HTML data from the root of a website and though ada_get_domain
would get me there.
adaR::ada_get_domain("https://github.com/schochastics/adaR/issues") |>
rvest::read_html()
#> Error: 'github.com' does not exist in current working directory ('/tmp/RtmpWgmD8k/reprex-95ac10e83d89-wax-mouse').
Unfortunatly, the domain is recognised as local path without the protocol. Would be fantastic if there was a function to get to the base name. This is roughly the behaviour I would expect.
ada_get_basename <- function(x) {
sub(adaR::ada_get_pathname(x), "", x, fixed = TRUE)
}
ada_get_basename("https://github.com/schochastics/adaR/issues") |>
rvest::read_html()
#> {html_document}
#> <html lang="en" data-a11y-animated-images="system" data-a11y-link-underlines="true">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="logged-out env-production page-responsive header-overlay hom ...
Created on 2023-10-05 with reprex v2.0.2
Thanks for considering!
R> adaR::ada_url_parse(NA)
*** caught segfault ***
address (nil), cause 'memory not mapped'
Traceback:
1: Rcpp_ada_parse(url, nchar(url))
2: adaR::ada_url_parse(NA)
Hi!
I've encountered a bug in adaR::ada_set_*
functions family related to pathname processing.
In cases where an URL is in punycode (domain starting with xn--), using adaR's set family functions changes pathname encoding and I don't know how to prevent (or revert) this behavior.
For example:
examples <- c(
"http://xn--53-6kcainf4buoffq.xn--p1ai/pood/junior-electrical-engineer-jobs-remote.html",
"http://xn--80abb0biooohbv.xn--p1ai/",
"http://xn--alicantesueo-khb.com/insomnio",
"https://normal-url.com/this-path-will-be-fine",
"http://xn--53-6kcainf4buoffq.xn--p1ai/this-path-will-not-be-fine"
)
pathnames <- adaR::ada_get_pathname(examples, decode = FALSE)
result_pathnames <- adaR::ada_set_pathname(examples, pathnames, decode = FALSE)
will return:
result_pathnames
[1] "http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html"
[2] "http://xn--80abb0biooohbv.xn--p1ai/"
[3] "http://xn--alicantesueo-khb.com/insomnio"
[4] "https://normal-url.com/this-path-will-be-fine"
[5] "http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be"
Notice 1st and 5th URLs.
even though adaR::ada_get_pathname(examples, decode = FALSE)
returns correct output:
pathnames
[1] "/pood/junior-electrical-engineer-jobs-remote.html"
[2] "/"
[3] "/insomnio"
[4] "/this-path-will-be-fine"
[5] "/this-path-will-not-be-fine"
The same behavior is present even when pathname isn't changed, for example:
hostnames <- adaR::ada_get_hostname(examples, decode = FALSE)
result_hostnames <- adaR::ada_set_hostname(examples, hostnames, decode = FALSE)
result_hostnames
[1] "http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html"
[2] "http://xn--80abb0biooohbv.xn--p1ai/"
[3] "http://xn--alicantesueo-khb.com/insomnio"
[4] "https://normal-url.com/this-path-will-be-fine"
[5] "http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be"
Also it's worth noting that hostnames
looks different (is encoded), but the function call above didn't change the hostname at all.
hostnames
[1] "поверкадома53.рф" "бамбукхутор.рф" "alicantesueño.com" "normal-url.com" "поверкадома53.рф"
My sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Warsaw
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] adaR_0.3.1
loaded via a namespace (and not attached):
[1] compiler_4.3.0 tools_4.3.0 rstudioapi_0.15.0 yaml_2.3.8 Rcpp_1.0.11 triebeard_0.4.1 renv_0.17.3
First release:
usethis::use_cran_comments()
Title:
and Description:
@return
and @examples
Authors@R:
includes a copyright holder (role 'cph')Prepare for release:
git pull
urlchecker::url_check()
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)
kindly requested by webtrack team:
ada_get_domain("https://subsub.sub.domain.co.uk")
#> domain.co.uk
Just glueing some existing functions
https://github.com/schochastics/adaR/blob/b2eb3e4662423b53db979541777da0e5847a7b69/R/parse.R#L13
str(ada_url_parse("https://www.google.co.jp/search?q=ドイツ"))
# Named chr [1:10] "https://www.google.co.jp/search?q=ドイツ" "https:" "" "" "www.google.co.jp" #"www.google.co.jp" "" "/search" "?q=ドイツ" ...
# - attr(*, "names")= chr [1:10] "href" "protocol" "username" "password" ...
simplify
should be FALSE
and coercied as data.frame again; or a better way.
ada_url_parse <- function(url, decode = TRUE) {
url <- utf8::as_utf8(url)
# url_parsed <- Rcpp_ada_parse(url, nchar(url, type = "bytes"))
url_parsed <- as.data.frame(do.call("rbind", lapply(url, function(x) Rcpp_ada_parse(x, nchar(x, type = "bytes")))))
if (isTRUE(decode)) {
url_parsed <- apply(url_parsed, 2, function(x) utils::URLdecode(x), simplify = FALSE)
return(as.data.frame(url_parsed))
}
return(url_parsed)
}
> @chainsawriot yes they should probably give `NA_logical_`. You wanna do that?
Yeah.
Originally posted by @chainsawriot in #3 (comment)
runtime is ok, but given how fast ada-url is by itself, there is room to improvement at a) the interface R/C++ and b)the URLencoding to fix UTF8 support (see #1)
Sorry. Fix is coming.
ada_has_credentials(NULL) ## will kill the R process
adaR::url_decode2(NA)
#> [1] "NA"
Created on 2023-09-27 with reprex v2.0.2
Fixing this can also enable decoding in C++.
data_raw
has the relevant scripts
importing stringi just for stri_enc_toutf8
seems excessive
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.