Code Monkey home page Code Monkey logo

gridsample's Introduction

GitHub Mastodon Twitter ResearchGate Google Scholar LinkedIn YouTube

  • I am currently a Doctoral Researcher at the Department of Digital and Computational Demography, Laboratory of Migration and Mobility at the Max Planck Institute for Demographic Research (MPIDR) in Rostock and at Sociodemography Research Group at the Universitat Pompeu Fabra (UPF) in Barcelona.

  • I analyse spatial digital trace data to study human mobility patterns and the implications for disease dynamics and inequalities.

  • My latest paper is:

    • Kotov, E., & Denecke, E. (2024). Expanding the Lifespan of Software for Demographic Analysis with Containers: An Application of Spatial Sampling. The Denominator. Population Dynamics Lab. DOI:10.6069/WY8K-D973.

    • See all papers.

  • I worked on many R packages: I developed the rJavaEnv package to manage Java environment for Java-dependent R packages, wikimapR for getting data using Wikimapia API, enhanced rang for reproducible research, and made contributions to layer, OpenTripPlanner, geofacet, esri2sf and more R packages and other open source software.

  • I used to teach reproducible data analysis in R and computational spatial morphology to urban researchers and planners.

Research interests

  • Intra-urban and inter-regional mobility, commuting and migration; determinants, external effects and spillover effects

  • Implications of human mobility for disease dynamics and inequalities

Metrics

gridsample's People

Contributors

e-kotov avatar nrukt00 avatar nrukt00vt avatar

gridsample's Issues

raster reprojection generates non-integer strata labels and leads to an error

gridsample/R/gs_sample.r

Lines 276 to 279 in d8a7fb9

if (sp::proj4string(strata_raster) != "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0") {
strata_raster <- projectRaster(strata_raster,
crs = sp::CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
}

This reprojection code generates non-integer values that lead to an error.

library(gridsample)
library(raster)
#> Loading required package: sp


population_raster <- raster(system.file("extdata", "RWA_ppp_2010_adj_v2.tif",
                                        package="gridsample"))
plot(population_raster)

data(RWAshp)
proj4string(RWAshp) <- CRS(proj4string(population_raster))
strata_raster <- rasterize(RWAshp,population_raster,field = "STL.2")
plot(strata_raster)

total_pop <- cellStats(population_raster, stat = "sum")
urban_pop_value <- total_pop * .16
pop_df <- data.frame(index = 1:length(population_raster[]),
                     pop = population_raster[])
pop_df <- pop_df[!is.na(pop_df$pop), ]
pop_df <- pop_df[order(pop_df$pop,decreasing = T), ]
pop_df$cumulative_pop <- cumsum(pop_df$pop)
pop_df$urban <- 0
pop_df$urban[which(pop_df$cumulative_pop <= urban_pop_value)] <- 1
urban_raster <- population_raster >= min(subset(pop_df,urban == 1)$pop)
values(urban_raster)[is.na(values(urban_raster))] <- 0 # setting all NA to 0 for the unfixed version of the package where NAs also cause critical error
plot(urban_raster)

psu_polygons <- gs_sample(
  population_raster = population_raster,
  strata_raster = strata_raster,
  urban_raster = urban_raster,
  cfg_desired_cell_size = NA,
  cfg_hh_per_stratum = 416,
  cfg_hh_per_urban = 26,
  cfg_hh_per_rural = 26,
  cfg_min_pop_per_cell = 0.01,
  cfg_max_psu_size = 10,
  cfg_pop_per_psu = 610,
  cfg_sample_rururb = TRUE,
  cfg_sample_spatial = FALSE,
  # cfg_sample_spatial_scale = 100,
  output_path = tempdir(),
  sample_name = "rwanda_psu")

#> [1] "Beginning gs_sample() Process:"
#> [1] "Randomly Stratifying PSU Cell IDs:"
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 1 ..."
#> [1] "Processing stratum ID 1 ..."
#> [1] "Processing stratum ID 1 ..."
#> [1] "Processing stratum ID 1 ..."
#> [1] "Processing stratum ID 4 ..."
#> [1] "Processing stratum ID 4 ..."
#> [1] "Processing stratum ID 4 ..."
#> [1] "Processing stratum ID 1.44650786031469 ..."
#> Error in sample.int(length(x), size, replace, prob): invalid first argument

Created on 2023-11-29 by the reprex package (v0.3.0)

implement sampling with a quadtree algorithm?

"[Population sampling frame using a quadtree algorithm] it allows an accurate calculation of sampling probability weights compared to GridSample". (Qader et al. 2020)

Perhaps worth updating and reimplementing the code (Qader et al. 2020) from the article supplement in gridsample itself or in a separate package:

## Quadtree-based decomposition of rasters (gridded population dataset)
## Required packages
install.packages("raster")
install.packages("maptools")
install.packages("rgdal")
install.packages("sp")
install.packages("doSNOW")
## Set directory
setwd("C:/")
## Define the raster data (Gridded population data)
RasataData=raster("Raster data")
## Define the shape file to identfy the adminstrative regions or settelment types. 
reg=readOGR("shape file")
RasterData.Region1=subset(reg, ID=="First Region name")
RasterData.Region1=subset(reg, grepl("Second Region name", ID))
Awdal.rast=crop(som, som.Awdal)
Banaadir.rast=crop(som, som.Banaadir)
## Regions to process
provs <- as.character(reg$ID)
## help function: split images into quads
splitim <- function(im){
  r <- nrow(im)
  c <- ncol(im)
  dr <- round(r/2)
  dc <- round(c/2)
    quads <- list()
  quads[[1]] <- im[1:dr, 1:dc, drop=FALSE]
  quads[[2]] <- im[(dr+1):r, 1:dc, drop=FALSE]
  quads[[3]] <- im[(dr+1):r, (dc+1):c, drop=FALSE]
  quads[[4]] <- im[1:dr, (dc+1):c, drop=FALSE]
    return(quads)
}
## Pre-define parameters for population size and area
minSize <- 3500 # Maximum population size
gridSize <- 32 # maximum grid dimension
### main loop 
for(p in provs[1:18]){
  print(p)
    grid_c<- crop(som, subset(reg, ID==p))
  grid<-mask(grid_c, subset(reg, ID==p))
  #grid=mask(grid, UR)
    dim(grid)
    iml <- list() # store the list of images to process
  iml[[1]] <- grid # first in the image list is the full grid
  res <- FALSE # results, start false to begin with a split
  res2 <- FALSE
  imS <- list() # output list of split grids
    # main processing loop
  while (prod(res)==0 | prod(res2)==0) {
    iml <- unlist(lapply(iml, splitim))
        res <- unlist(lapply(iml, function(i){ cellStats(i, stat='sum') < minSize } ))
    res2 <- unlist(lapply(iml, function(i){ max(dim(i)) < gridSize } ))
        imS <- c(imS, iml[which(res==1 & res2==1)])
    iml <- iml[which(res==0 | res2==0)]
  }
   ## drop subscenes with no data
  keep <- lapply(1:length(imS), function(i){ cellStats(imS[[i]],'mean')>0 & !is.nan(cellStats(imS[[i]], 'mean')) })
   imS <- imS[unlist(keep)==T]
  
  ## output split grids
  if(!dir.exists(paste0( "subscenes"))){dir.create(paste0( "subscenes"))}
  if(!dir.exists(paste0( "subscenes/", p,"/"))){dir.create(paste0( "subscenes/", p,"/"))}
  
  ## output extent polygons
  plylist <- lapply(1:length(imS), 
                    function(i){ 
                      if(!is.null(imS[[i]])){
                        pop=cellStats(imS[[i]], sum)
                        e <- extent(imS[[i]])
                        sp <- as(e, 'SpatialPolygons')
                        crs(sp) <- crs(imS[[i]])
                        polydf <- SpatialPolygonsDataFrame(sp, data=data.frame(id=i, pop=pop))
                        return(polydf)}
                    })
  ## merge all to a single feature
  joinpoly <- do.call(bind, plylist) 
    if(!dir.exists(paste0( "subshapes"))){dir.create(paste0( "subshapes"))}
  if(!dir.exists(paste0( "subshapes/", p,"/"))){dir.create(paste0( "subshapes/", p,"/"))}
  shapefile(joinpoly, paste0( "subshapes/", p, "/", p, "_ply_all3.shp"), overwrite=T)
  pdf(paste0( "subshapes/", p, "/", p, "_plot.pdf"))
  plot(grid, main=p)
  plot(joinpoly, add=T)
  plot(reg, add=T)
  dev.off()
  }

References

Qader, S.H., Lefebvre, V., Tatem, A.J. et al. Using gridded population and quadtree sampling units to support survey sample design in low-income settings. Int J Health Geogr 19, 10 (2020). https://doi.org/10.1186/s12942-020-00205-5

gs_sample fails if the raster contains NA, even the original vignette is not reproducible

NA values in rasters lead to critical errors:

sessionInfo()

#> R version 4.0.1 (2020-06-06)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.0.1 magrittr_1.5 tools_4.0.1 htmltools_0.5.0
#> [5] yaml_2.2.1 stringi_1.4.6 rmarkdown_2.2 highr_0.8
#> [9] knitr_1.28 stringr_1.4.0 xfun_0.14 digest_0.6.25
#> [13] rlang_0.4.6 evaluate_0.14

library(gridsample)
library(raster)
#> Loading required package: sp
population_raster <- raster(system.file("extdata", "RWA_ppp_2010_adj_v2.tif",
                                        package="gridsample"))
plot(population_raster)

data(RWAshp)
proj4string(RWAshp) <- CRS(proj4string(population_raster))
strata_raster <- rasterize(RWAshp,population_raster,field = "STL.2")
plot(strata_raster)

total_pop <- cellStats(population_raster, stat = "sum")
urban_pop_value <- total_pop * .16
pop_df <- data.frame(index = 1:length(population_raster[]),
                     pop = population_raster[])
pop_df <- pop_df[!is.na(pop_df$pop), ]
pop_df <- pop_df[order(pop_df$pop,decreasing = T), ]
pop_df$cumulative_pop <- cumsum(pop_df$pop)
pop_df$urban <- 0
pop_df$urban[which(pop_df$cumulative_pop <= urban_pop_value)] <- 1
urban_raster <- population_raster >= min(subset(pop_df,urban == 1)$pop)
plot(urban_raster)

psu_polygons <- gs_sample(
  population_raster = population_raster,
  strata_raster = strata_raster,
  urban_raster = urban_raster,
  cfg_desired_cell_size = NA,
  cfg_hh_per_stratum = 416,
  cfg_hh_per_urban = 26,
  cfg_hh_per_rural = 26,
  cfg_min_pop_per_cell = 0.01,
  cfg_max_psu_size = 10,
  cfg_pop_per_psu = 610,
  cfg_sample_rururb = TRUE,
  cfg_sample_spatial = FALSE,
  # cfg_sample_spatial_scale = 100,
  output_path = tempdir(),
  sample_name = "rwanda_psu"
)

#> [1] "Beginning gs_sample() Process:"
#> [1] "Randomly Stratifying PSU Cell IDs:"
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 6 ..."
#> [1] "Processing stratum ID 6 ..."
#> Error in while (total_stratum < cfg_hh_per_stratum) {: missing value where TRUE/FALSE needed

Created on 2023-11-29 by the reprex package (v0.3.0)

The problematic code part is

gridsample/R/gs_sample.r

Lines 411 to 418 in d8a7fb9

if (total_stratum < cfg_hh_per_stratum) {
while (total_stratum < cfg_hh_per_stratum) {
newid <- sample(psu_data[selected == 0]$id, 1)
total_stratum <- total_stratum +
(urban_vec[newid] == 0) * cfg_hh_per_rural +
(urban_vec[newid] == 1) * cfg_hh_per_urban
psu_data[id == newid]$selected <- 1
}

When urban_vec contains NAs, the code breaks, because NA is then written into total_stratum.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.