Comments (9)
I for now try to get rdwd
on CRAN despite the size of the package.
The index is used to select data to get urls, see the package structure diagram. I would rather not generate it on the fly, as running createIndex
takes several minutes for the full process.
I update the indexes irregularly, and there is an option to generate a query-based current version if needed, see the fileindex page
from rdwd.
Hi Berry,
I had just written a quick script yesterday to download DWD climate data and thought "Hey that would make for a good package!", when I stumbled across your package which looks awesome! Looking forward to checking it out in detail!
on topic:
3 ideas:
- turn columns
c("res", "var", "per")
into integer indices and map them in separate data.frames. Then write an internal function that joins the tables together when fileIndex is needed. Potential size decrease 10 %. - Only store the paths and speed up your
createIndex()
-function (e.g. addfixed = TRUE
to the calls tostrsplit()
). Then add a function to.onLoad()
that prepares the dataset once on package loadup. Potential size decrease 20 %. - Consider hosting a separate package with only the index files. This will also be more kind to CRAN, as you don't have to resubmit the data every time you update the package without reindexing the database. This could give you more headspace for the future as well.
from rdwd.
Filesizes in rdwd/data/
folder:
fileIndex: 744 KB (2181 KB without resave. 635 KB only fileIndex$path. 109 KB without path column)
metaIndex: 430 KB
geoIndex: 222 KB (reduced to 110 KB if display column removed)
gridIndex: 26 KB
formatIndex: 3 KB
from rdwd.
Biggest fileIndex
size contributors:
tab <- table(fileIndex[,2:1])
tab[tab==0] <- NA
tableColVal(tab, digits=0, palette=seqPal(logbase=1.15), nameswidth=0.21)
table(fileIndex[fileIndex$var=="precipitation", c("per","res")])
res
per 1_minute 10_minutes 5_minutes hourly
historical 217274 3267 158216 1043
meta_data 1145 1155 1145 0
now 960 963 937 0
recent 964 980 943 984
from rdwd.
Do I get this right that less stations are providing 5-min-values than 1-min-values (e.g. for recent, 964 vs. 943)?
Making use of {xts}, to be precise xts::period.apply()
to aggregate values to 5-min-sums is a matter of seconds in the end.
So, what exactly is the unique selling point here (from DWD's point of view)?
from rdwd.
Yes, you see that correctly. As the 5_min data is new, this may yet be expanded.
I do not know why they have this data. I cannot find information in the changelog. I presume they may have different data sources or methods.
Currently, I feel rdwd
should have the full index of all available files and leave the discussion of usefullness to the individual user.
I fear that just taking out the 5 minute files out of the index is not a very future-proof method - other datasets might expand and still bring the package over the 5MB limit.
from rdwd.
I played around a little bit, testing of rvest
and reprex
included...
## get station ids, 1 min
html <- rvest::read_html("https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/")
list <- html %>% rvest::html_elements("a") %>% rvest::html_text2() %>% stringr::str_split("_")
df <- list[2:length(list)] %>% unlist() %>% matrix(ncol=4, byrow = TRUE) %>% data.frame(stringsAsFactors = FALSE)
#> Warning in matrix(., ncol = 4, byrow = TRUE): Datenlänge [3857] ist kein Teiler
#> oder Vielfaches der Anzahl der Zeilen [965]
stations_1min <- df[["X3"]][1:(length(df[["X3"]])-2)]
length(stations_1min)
#> [1] 963
## get station ids, 5 min
html <- rvest::read_html("https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/5_minutes/precipitation/recent/")
list <- html %>% rvest::html_elements("a") %>% rvest::html_text2() %>% stringr::str_split("_")
df <- list[2:length(list)] %>% unlist() %>% matrix(ncol=4, byrow = TRUE) %>% data.frame(stringsAsFactors = FALSE)
stations_5min <- df[["X3"]][2:length(df[["X3"]])]
length(stations_5min)
#> [1] 943
## get setdiffs
dplyr::setdiff(stations_1min, stations_5min)
#> [1] "00071" "00410" "00430" "01473" "01578" "02184" "02292" "02556" "03147"
#> [10] "03552" "04623" "05468" "05614" "05616" "05646" "05758" "06186" "06276"
#> [19] "06312" "07099"
dplyr::setdiff(stations_5min, stations_1min)
#> character(0)
Created on 2022-05-17 by the reprex package (v2.0.1)
So basically, it seems like at the moment there is no advantage of the 5 min values over the 1 min ones... I can imagine data to be included here, which was digitized in the course of MUNSTAR but this would only explain additional historical oberservations, I assume. Hm, but who knows.
However, I also get your "not my business" point - and not addressing this issue in general won't help in the long term.
I'm not that familiar with your internal structure to be honest - so no idea if I'm even of help here - but I assume you basically indexed the content being downloadable via rdwd
in order to facilitate function calls or the like?
If so, is the index being updated regularly? And would it also be possible to build it dynamically on-the-fly based on the query issued (combination of product/parameter/resolution/quality, ...) without having to store everything?
from rdwd.
81% of the fileIndex is sub-hourly precipitation. I think a different index concept is needed only there.
mean(fileIndex$var=="precipitation" & fileIndex$res != "hourly")
from rdwd.
Related Issues (20)
- dataDWD: should dbin default be TRUE? HOT 1
- readDWD change argument list to `type`
- fread empty files HOT 3
- Unzip function for windows HOT 15
- readDwd is creating error in a windows pc HOT 3
- website in documentations
- doc references
- add use case: values at locations in grid
- grib2 reading fails HOT 8
- Several links to ftp://opendata.dwd.de in documentation fail HOT 2
- air_temperature history hourly data link is not working. HOT 2
- local test dir
- metaInfo() - start/end date HOT 7
- expand per="hr" to all options HOT 1
- the 'wininet' method for ftp:// URLs is defunct HOT 3
- properly vectorize selectDWD in expansive mode HOT 10
- hr argument
- Please remove dependencies on **rgdal**, **rgeos**, and/or **maptools** HOT 2
- website update
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rdwd.