Code Monkey home page Code Monkey logo

pacta.data.scraping's People

Contributors

alexaxthelm avatar cjyetman avatar dependabot[bot] avatar jdhoffa avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pacta.data.scraping's Issues

import issues from {pacta.data.preparation}

implement testing with mocking

Implement robust testing for these functions with a mocking package to prevent frequent failures due to frequent hiccups with internet access and web server responses.

related: https://github.com/RMI-PACTA/pacta.data.preparation/issues/167

great resource: https://books.ropensci.org/http-testing/index.html

using one of (in no particular order):
https://docs.ropensci.org/vcr/
https://docs.ropensci.org/webmockr/
https://enpiar.com/r/httptest/
https://enpiar.com/httptest2/
https://webfakes.r-lib.org

Allow a more flexible choice of date in `get_ishares_index`

The function get_ishares_index currently pulls indices data from ishares.com. Currently, the function parameter timestamp is used to determine what data we should be pulling.

When given the argument 2021Q4, it converts this to the last date in the quarter, which is "20211231".

This worked well for 2021Q4 data, however it seems that for 2022Q4 doesn't have any data for 20221231 (it does seem to have data for 20221230).

Perhaps we should make this function more flexible, and parameterize the date itself?

check that `get_index_regions()` received proper results before returning

get_index_regions() sometimes receives an error page, but it does not recognize this and continues to return a data frame with a single row, which could lead to false/incorrect results. We should check the response from loading the webpage somehow to make sure it's legit, and possibly try a few times with a timeout to load the proper data, and if not return an error instead of a data frame with a single row.

Consider changing the name of the `index_regions` dataset

DANGEROUS

Likely must be co-ordinated across many repos. Off the top of my head, probably warrants at least looking at:
pacta.data.preparation, portfolio.allocate, and pacta.portfolio.report
and
workflow.data.preparation and workflow.transition.monitor

cc: @cjyetman any other you can think of?

Implement testing with mocking

Implement robust testing for these functions with a mocking package to prevent frequent failures due to frequent hiccups with internet access and web server responses.

related: https://github.com/RMI-PACTA/pacta.data.preparation/issues/167

great resource: https://books.ropensci.org/http-testing/index.html

using one of (in no particular order):
https://docs.ropensci.org/vcr/
https://docs.ropensci.org/webmockr/
https://enpiar.com/r/httptest/
https://enpiar.com/httptest2/
https://webfakes.r-lib.org

`get_ishares_index_data()` continues without error while returning an empty table

When unable to retrieve data, the function does not raise an error, but silently returns an empty table, which causes downstream processing errors.

url <-
paste0(
  "https://www.ishares.com/uk/individual/en/products/",
  "251813/ishares-global-corporate-bond-ucits-etf/"
)
name <- "iShares Global Corporate Bond UCITS ETF <USD (Distributing)>"
as_of_date <- "20231231"
pacta.data.scraping::get_ishares_index_data(url, name, as_of_date)
#> # A tibble: 0 × 3
#> # ℹ 3 variables: base_url <chr>, index_name <chr>, as_of_date <chr>

Created on 2024-02-26 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Sonoma 14.2
#>  system   aarch64, darwin23.0.0
#>  ui       unknown
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Belgrade
#>  date     2024-02-26
#>  pandoc   3.1.7 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package             * version    date (UTC) lib source
#>  cli                   3.6.2      2023-12-11 [1] CRAN (R 4.3.2)
#>  countrycode           1.5.0      2023-05-30 [1] CRAN (R 4.3.1)
#>  curl                  5.2.0      2023-12-08 [1] CRAN (R 4.3.2)
#>  digest                0.6.34     2024-01-11 [1] CRAN (R 4.3.2)
#>  dplyr                 1.1.4      2023-11-17 [1] CRAN (R 4.3.2)
#>  evaluate              0.23       2023-11-01 [1] CRAN (R 4.3.2)
#>  fansi                 1.0.6      2023-12-08 [1] CRAN (R 4.3.2)
#>  fastmap               1.1.1      2023-02-24 [1] CRAN (R 4.3.2)
#>  fs                    1.6.3      2023-07-20 [1] CRAN (R 4.3.2)
#>  generics              0.1.3      2022-07-05 [1] CRAN (R 4.3.1)
#>  glue                  1.7.0      2024-01-09 [1] CRAN (R 4.3.2)
#>  htmltools             0.5.7      2023-11-03 [1] CRAN (R 4.3.2)
#>  httr                  1.4.7      2023-08-15 [1] CRAN (R 4.3.1)
#>  jsonlite              1.8.8      2023-12-04 [1] CRAN (R 4.3.2)
#>  knitr                 1.45       2023-10-30 [1] CRAN (R 4.3.2)
#>  lifecycle             1.0.4      2023-11-07 [1] CRAN (R 4.3.1)
#>  logger                0.2.2      2021-10-19 [1] CRAN (R 4.3.1)
#>  magrittr              2.0.3      2022-03-30 [1] CRAN (R 4.3.2)
#>  pacta.data.scraping   0.1.0.9000 2024-01-18 [1] Github (RMI-PACTA/pacta.data.scraping@9d5f632)
#>  pillar                1.9.0      2023-03-22 [1] CRAN (R 4.3.1)
#>  pkgconfig             2.0.3      2019-09-22 [1] CRAN (R 4.3.1)
#>  purrr                 1.0.2      2023-08-10 [1] CRAN (R 4.3.2)
#>  R.cache               0.16.0     2022-07-21 [1] CRAN (R 4.3.1)
#>  R.methodsS3           1.8.2      2022-06-13 [1] CRAN (R 4.3.1)
#>  R.oo                  1.25.0     2022-06-12 [1] CRAN (R 4.3.1)
#>  R.utils               2.12.2     2022-11-11 [1] CRAN (R 4.3.1)
#>  R6                    2.5.1      2021-08-19 [1] CRAN (R 4.3.1)
#>  reprex                2.0.2      2022-08-17 [1] CRAN (R 4.3.1)
#>  rlang                 1.1.3      2024-01-10 [1] CRAN (R 4.3.2)
#>  rmarkdown             2.25       2023-09-18 [1] CRAN (R 4.3.1)
#>  rvest                 1.0.3      2022-08-19 [1] CRAN (R 4.3.1)
#>  selectr               0.4-2      2019-11-20 [1] CRAN (R 4.3.1)
#>  sessioninfo           1.2.2      2021-12-06 [1] CRAN (R 4.3.1)
#>  stringi               1.8.3      2023-12-11 [1] CRAN (R 4.3.2)
#>  stringr               1.5.1      2023-11-14 [1] CRAN (R 4.3.1)
#>  styler                1.10.2     2023-08-29 [1] CRAN (R 4.3.1)
#>  tibble                3.2.1      2023-03-20 [1] CRAN (R 4.3.2)
#>  tidyselect            1.2.0      2022-10-10 [1] CRAN (R 4.3.1)
#>  utf8                  1.2.4      2023-10-22 [1] CRAN (R 4.3.2)
#>  vctrs                 0.6.5      2023-12-01 [1] CRAN (R 4.3.2)
#>  withr                 3.0.0      2024-01-16 [1] CRAN (R 4.3.2)
#>  xfun                  0.41       2023-11-01 [1] CRAN (R 4.3.2)
#>  xml2                  1.3.6      2023-12-04 [1] CRAN (R 4.3.2)
#>  yaml                  2.3.8      2023-12-11 [1] CRAN (R 4.3.2)
#> 
#>  [1] /opt/homebrew/lib/R/4.3/site-library
#>  [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

with debugonce(), I have determined that this is because the aaData object returned by the server is empty.


[ins] Browse[2]> readLines(data_path)
[1] "{\"aaData\":[]}"
Warning message:
In readLines(data_path) :
  incomplete final line found on '/var/folders/g8/1kf0nz093f3gmy0t9yhtwbq00000gn/T//Rtmp17d66K/file1300659d0db32'

Relates to RMI-PACTA/workflow.prepare.pacta.indices#68

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.