rmi-pacta / pacta.data.scraping Goto Github PK
View Code? Open in Web Editor NEWScrapes data from various web sources needed for PACTA
Home Page: https://rmi-pacta.github.io/pacta.data.scraping
License: Other
Scrapes data from various web sources needed for PACTA
Home Page: https://rmi-pacta.github.io/pacta.data.scraping
License: Other
{pacta.data.preparation} had a number of issues related to the functions that have been extracted here. Because one cannot transfer an issue from a private repo to a public repo, we should manually transfer the info from the following issues here.
https://github.com/RMI-PACTA/pacta.data.preparation/issues/282
https://github.com/RMI-PACTA/pacta.data.preparation/issues/278
https://github.com/RMI-PACTA/pacta.data.preparation/issues/167 now #4
https://github.com/RMI-PACTA/pacta.data.preparation/issues/166
https://github.com/RMI-PACTA/pacta.data.preparation/issues/164
https://github.com/RMI-PACTA/pacta.data.preparation/issues/172
Implement robust testing for these functions with a mocking package to prevent frequent failures due to frequent hiccups with internet access and web server responses.
related: https://github.com/RMI-PACTA/pacta.data.preparation/issues/167
great resource: https://books.ropensci.org/http-testing/index.html
using one of (in no particular order):
https://docs.ropensci.org/vcr/
https://docs.ropensci.org/webmockr/
https://enpiar.com/r/httptest/
https://enpiar.com/httptest2/
https://webfakes.r-lib.org
The function get_ishares_index
currently pulls indices data from ishares.com
. Currently, the function parameter timestamp
is used to determine what data we should be pulling.
When given the argument 2021Q4
, it converts this to the last date in the quarter, which is "20211231".
This worked well for 2021Q4 data, however it seems that for 2022Q4 doesn't have any data for 20221231
(it does seem to have data for 20221230
).
Perhaps we should make this function more flexible, and parameterize the date itself?
get_index_regions()
sometimes receives an error page, but it does not recognize this and continues to return a data frame with a single row, which could lead to false/incorrect results. We should check the response from loading the webpage somehow to make sure it's legit, and possibly try a few times with a timeout to load the proper data, and if not return an error instead of a data frame with a single row.
Likely must be co-ordinated across many repos. Off the top of my head, probably warrants at least looking at:
pacta.data.preparation
, portfolio.allocate
, and pacta.portfolio.report
and
workflow.data.preparation
and workflow.transition.monitor
cc: @cjyetman any other you can think of?
MSCI has changed their website layout, and the function get_index_regions
now errors because of this.
Implement robust testing for these functions with a mocking package to prevent frequent failures due to frequent hiccups with internet access and web server responses.
related: https://github.com/RMI-PACTA/pacta.data.preparation/issues/167
great resource: https://books.ropensci.org/http-testing/index.html
using one of (in no particular order):
https://docs.ropensci.org/vcr/
https://docs.ropensci.org/webmockr/
https://enpiar.com/r/httptest/
https://enpiar.com/httptest2/
https://webfakes.r-lib.org
pacta.data.scraping/R/get_index_regions.R
Lines 15 to 18 in bb5c269
When unable to retrieve data, the function does not raise an error, but silently returns an empty table, which causes downstream processing errors.
url <-
paste0(
"https://www.ishares.com/uk/individual/en/products/",
"251813/ishares-global-corporate-bond-ucits-etf/"
)
name <- "iShares Global Corporate Bond UCITS ETF <USD (Distributing)>"
as_of_date <- "20231231"
pacta.data.scraping::get_ishares_index_data(url, name, as_of_date)
#> # A tibble: 0 × 3
#> # ℹ 3 variables: base_url <chr>, index_name <chr>, as_of_date <chr>
Created on 2024-02-26 with reprex v2.0.2
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.2 (2023-10-31)
#> os macOS Sonoma 14.2
#> system aarch64, darwin23.0.0
#> ui unknown
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Belgrade
#> date 2024-02-26
#> pandoc 3.1.7 @ /opt/homebrew/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.2)
#> countrycode 1.5.0 2023-05-30 [1] CRAN (R 4.3.1)
#> curl 5.2.0 2023-12-08 [1] CRAN (R 4.3.2)
#> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.2)
#> dplyr 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.2)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.2)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1)
#> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2)
#> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2)
#> httr 1.4.7 2023-08-15 [1] CRAN (R 4.3.1)
#> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.2)
#> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
#> logger 0.2.2 2021-10-19 [1] CRAN (R 4.3.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.2)
#> pacta.data.scraping 0.1.0.9000 2024-01-18 [1] Github (RMI-PACTA/pacta.data.scraping@9d5f632)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.2)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1)
#> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2)
#> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
#> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.3.1)
#> selectr 0.4-2 2019-11-20 [1] CRAN (R 4.3.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1)
#> stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.2)
#> stringr 1.5.1 2023-11-14 [1] CRAN (R 4.3.1)
#> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.2)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.1)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2)
#> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.2)
#> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2)
#> xml2 1.3.6 2023-12-04 [1] CRAN (R 4.3.2)
#> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2)
#>
#> [1] /opt/homebrew/lib/R/4.3/site-library
#> [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
with debugonce()
, I have determined that this is because the aaData
object returned by the server is empty.
[ins] Browse[2]> readLines(data_path)
[1] "{\"aaData\":[]}"
Warning message:
In readLines(data_path) :
incomplete final line found on '/var/folders/g8/1kf0nz093f3gmy0t9yhtwbq00000gn/T//Rtmp17d66K/file1300659d0db32'
Relates to RMI-PACTA/workflow.prepare.pacta.indices#68
Currently this function saves the html files (both the data URL and the page URL) as temporary files within the body of get_ishares_index_data
. It would be good to save the source data explicitly for every index pulled.
Relates to #162
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.