dickoa / rhdx Goto Github PK

View Code? Open in Web Editor NEW

26.0 5.0 6.0 496 KB

R package to interact with the Humanitarian Data Exchange portal - http://dickoa.gitlab.io/rhdx/

License: Other

R 97.96% CSS 2.04%

r ckan hdx humdata opendata

rhdx's Introduction

rhdx

rhdx is an R client for the Humanitarian Exchange Data platform.

Introduction

The Humanitarian Data Exchange platform is the open platform to easily find and analyze humanitarian data.

Installation

This package is not on yet on CRAN and to install it, you will need the remotes package. You can get rhdx from Gitlab or Github (mirror)

## install.packages("remotes")
remotes::install_gitlab("dickoa/rhdx")
remotes::install_github("dickoa/rhdx")

rhdx: A quick tutorial

library("rhdx")

The first step is usually to connect to HDX using the set_rhdx_config function and check the config using get_rhdx_config

set_rhdx_config(hdx_site = "prod")
get_rhdx_config()
## <HDX Configuration>
##   HDX site: prod
##   HDX site url: https://data.humdata.org/
##   HDX API key:

Now that we are connected to HDX, we can search for dataset using search_datasets, access resources withini the dataset page with the get_resources function and finally read the data directly into the R session using read_resource. magrittr pipes operator are also supported

library(tidyverse)
search_datasets("ACLED Mali", rows = 2) %>% ## search dataset in HDX, limit the results to two rows
  pluck(1) %>% ## select the first dataset
  get_resource(1) %>% ## pick the first resource
  read_resource() ## read this HXLated data into R
## # A tibble: 2,516 x 30
##    data_id   iso event_id_cnty event_id_no_cnty event_date  year
##  *   <dbl> <dbl> <chr>                    <dbl> <date>     <dbl>
##  1 2942561   466 MLI2605                   2605 2019-01-26  2019
##  2 2942562   466 MLI2606                   2606 2019-01-26  2019
##  3 2942557   466 MLI2601                   2601 2019-01-25  2019
##  4 2942558   466 MLI2602                   2602 2019-01-25  2019
##  5 2942559   466 MLI2603                   2603 2019-01-25  2019
##  6 2942560   466 MLI2604                   2604 2019-01-25  2019
##  7 2942555   466 MLI2599                   2599 2019-01-24  2019
##  8 2942556   466 MLI2600                   2600 2019-01-24  2019
##  9 2942553   466 MLI2597                   2597 2019-01-23  2019
## 10 2942554   466 MLI2598                   2598 2019-01-23  2019
## # … with 2,506 more rows, and 24 more variables:
## #   time_precision <dbl>, event_type <chr>, actor1 <chr>,
## #   assoc_actor_1 <chr>, inter1 <dbl>, actor2 <chr>,
## #   assoc_actor_2 <chr>, inter2 <dbl>, interaction <dbl>,
## #   region <chr>, country <chr>, admin1 <chr>, admin2 <chr>,
## #   admin3 <chr>, location <chr>, latitude <dbl>,
## #   longitude <dbl>, geo_precision <dbl>, source <chr>,
## #   source_scale <chr>, notes <chr>, fatalities <dbl>,
## #   timestamp <dbl>, iso3 <chr>

read_resource will not work with resources in HDX, so far the following format are supported: csv, xlsx, xls, json, geojson, zipped shapefile, kmz, zipped geodatabase and zipped geopackage. I will consider adding more data types in the future, feel free to file an issue if it doesn’t work as expected or you want to add a support for a format.

Reading dataset directly

We can also use pull_dataset to directly read and access a dataset object.

pull_dataset("acled-data-for-mali") %>%
  get_resource(1) %>%
  read_resource()
## # A tibble: 3,990 x 31
##    data_id   iso event_id_cnty event_id_no_cnty event_date  year
##      <dbl> <dbl> <chr>                    <dbl> <date>     <dbl>
##  1 7173324   466 MLI4111                   4111 2020-07-31  2020
##  2 7173322   466 MLI4109                   4109 2020-07-29  2020
##  3 7173323   466 MLI4110                   4110 2020-07-29  2020
##  4 7173423   466 MLI4107                   4107 2020-07-28  2020
##  5 7173761   466 MLI4108                   4108 2020-07-28  2020
##  6 7173702   466 MLI4104                   4104 2020-07-27  2020
##  7 7173732   466 MLI4103                   4103 2020-07-27  2020
##  8 7173319   466 MLI4102                   4102 2020-07-27  2020
##  9 7173320   466 MLI4105                   4105 2020-07-27  2020
## 10 7173321   466 MLI4106                   4106 2020-07-27  2020
## # … with 3,980 more rows, and 25 more variables:
## #   time_precision <dbl>, event_type <chr>,
## #   sub_event_type <chr>, actor1 <chr>, assoc_actor_1 <chr>,
## #   inter1 <dbl>, actor2 <chr>, assoc_actor_2 <chr>,
## #   inter2 <dbl>, interaction <dbl>, region <chr>,
## #   country <chr>, admin1 <chr>, admin2 <chr>, admin3 <chr>,
## #   location <chr>, latitude <dbl>, longitude <dbl>,
## #   geo_precision <dbl>, source <chr>, source_scale <chr>,
## #   notes <chr>, fatalities <dbl>, timestamp <dbl>, iso3 <chr>

A step by step tutorial to getting data from rhdx

Connect to a server

In order to connect to HDX, we can use the set_rhdx_config function

set_rhdx_config(hdx_site = "prod")

Search datasets

Once a server is chosen, we can now search from dataset using the search_datasets In this case we will limit just to two results (rows parameter).

list_of_ds <- search_datasets("displaced Nigeria", rows = 2)
list_of_ds
## [[1]]
## <HDX Dataset> 4fbc627d-ff64-4bf6-8a49-59904eae15bb
##   Title: Nigeria - Internally displaced persons - IDPs
##   Name: idmc-idp-data-for-nigeria
##   Date: 01/01/2009-12/31/2016
##   Tags (up to 5): displacement, idmc, population
##   Locations (up to 5): nga
##   Resources (up to 5): displacement_data, conflict_data, disaster_data

## [[2]]
## <HDX Dataset> 4adf7874-ae01-46fd-a442-5fc6b3c9dff1
##   Title: Nigeria Baseline Assessment Data [IOM DTM]
##   Name: nigeria-baseline-data-iom-dtm
##   Date: 01/31/2018
##   Tags (up to 5): adamawa, assessment, baseline-data, baseline-dtm, bauchi
##   Locations (up to 5): nga
##   Resources (up to 5): DTM Nigeria Baseline Assessment Round 21, DTM Nigeria Baseline Assessment Round 20, DTM Nigeria Baseline Assessment Round 19, DTM Nigeria Baseline Assessment Round 18, DTM Nigeria Baseline Assessment Round 17

Choose the dataset you want to manipulate in R, in this case we will take the first one.

The result of search_datasets is a list of HDX datasets, you can manipulate this list like any other list in R. We can use purrr::pluck to select the element we want in our list, here it is the first.

ds <- pluck(list_of_ds, 1)
ds
## <HDX Dataset> 4fbc627d-ff64-4bf6-8a49-59904eae15bb
##   Title: Nigeria - Internally displaced persons - IDPs
##   Name: idmc-idp-data-for-nigeria
##   Date: 01/01/2009-12/31/2016
##   Tags (up to 5): displacement, idmc, population
##   Locations (up to 5): nga
##   Resources (up to 5): displacement_data, conflict_data, disaster_data

List all resources in the dataset

With our dataset, the next step is to list all the resources. If you are not familiar with CKAN terminology, resources refer to the actual files shared in a dataset page and you can download. Each dataset page contains one or more resources.

get_resources(ds)
## [[1]]
## <HDX Resource> f57be018-116e-4dd9-a7ab-8002e7627f36
##   Name: displacement_data
##   Description: Internally displaced persons - IDPs (new displacement associated with conflict and violence)
##   Size:
##   Format: JSON

## [[2]]
## <HDX Resource> 6261856c-afb9-4746-b340-9cf531cbd38f
##   Name: conflict_data
##   Description: Internally displaced persons - IDPs (people displaced by conflict and violence)
##   Size:
##   Format: JSON

## [[3]]
## <HDX Resource> b8ff1f4b-105c-4a6c-bf54-a543a486ab7e
##   Name: disaster_data
##   Description: Internally displaced persons - IDPs (new displacement associated with disasters)
##   Size:
##   Format: JSON

Choose a resource we need to download/read

For this example, we are looking for the displacement data and it’s the first resource in the dataset page. We can use pluck on the list of resources or the helper function get_resource(resource, resource_index) to select the resource we want to use. The selected resource can be then downloaded and store for further use or directly read into your R session using the read_resource function. The resource is a json file and it can be read directly using jsonlite package, we added a simplify_json option to get a vector or a data.frame when possible instead of a list.

idp_nga_rs <- get_resource(ds, 1)
idp_nga_df <- read_resource(idp_nga_rs, simplify_json = TRUE, download_folder = tempdir())
idp_nga_df
## # A tibble: 11 x 7
##    ISO3  Name   Year `Conflict Stock… `Conflict New D…
##    <chr> <chr> <dbl>            <dbl>            <dbl>
##  1 NGA   Nige…  2009               NA             5000
##  2 NGA   Nige…  2010               NA             5000
##  3 NGA   Nige…  2011               NA            65000
##  4 NGA   Nige…  2012               NA            63000
##  5 NGA   Nige…  2013          3300000           471000
##  6 NGA   Nige…  2014          1075000           975000
##  7 NGA   Nige…  2015          2096000           737000
##  8 NGA   Nige…  2016          1955000           501000
##  9 NGA   Nige…  2017          1707000           279000
## 10 NGA   Nige…  2018          2216000           541000
## 11 NGA   Nige…  2019          2583000           248000
## # … with 2 more variables: `Disaster New Displacements` <dbl>,
## #   `Disaster Stock Displacement` <dbl>

Using `magrittr` pipe

All these operations can be chained using pipes %>% and allow for a powerful grammar to easily get humanitarian data in R.

library(tidyverse)

set_rhdx_config(hdx_site = "prod")

idp_nga_df <-
  search_datasets("displaced Nigeria", rows = 2) %>%
  pluck(1) %>%
  get_resource(1) %>% ## get the first resource
  read_resource(simplify_json = TRUE, download_folder = tempdir()) ## the file will be downloaded in a temporary directory

idp_nga_df
## # A tibble: 11 x 7
##    ISO3  Name   Year `Conflict Stock… `Conflict New D…
##    <chr> <chr> <dbl>            <dbl>            <dbl>
##  1 NGA   Nige…  2009               NA             5000
##  2 NGA   Nige…  2010               NA             5000
##  3 NGA   Nige…  2011               NA            65000
##  4 NGA   Nige…  2012               NA            63000
##  5 NGA   Nige…  2013          3300000           471000
##  6 NGA   Nige…  2014          1075000           975000
##  7 NGA   Nige…  2015          2096000           737000
##  8 NGA   Nige…  2016          1955000           501000
##  9 NGA   Nige…  2017          1707000           279000
## 10 NGA   Nige…  2018          2216000           541000
## 11 NGA   Nige…  2019          2583000           248000
## # … with 2 more variables: `Disaster New Displacements` <dbl>,
## #   `Disaster Stock Displacement` <dbl>

rhdx's People

Contributors

Stargazers

Watchers

Forkers

bmpacifique andysouth elliottmess paulinenimo mhkhan27 unhcrmdl

rhdx's Issues

download admin area cod (common operational dataset) for any country

I want to write some code allowing users to download the admin area COD for any country and admin level.

I'm coming up against a few inconsistencies in hdx tags and format that make this tricky.

I've put all this in one issue for now, in case there is a better way that you can point me to. I can break these up into individual issues if that helps.

What I'm trying to do.

write a query that returns a single dataset for the admin area COD
identify the shapefile resource (assuming that is commonest)
get list of layers
identify and download layer for a specified admin level

Current issues (for examples see in the code below) :

I'm struggling to get a query that reliably returns just the one dataset
shapefiles are sometimes tagged as 'zipped shapefile' sometimes 'zipped shapefiles'
sometimes the zipfile contains a subfolder that stops it being opened by sf

Thanks.

    iso3clow <- 'nga'
    #iso3clow <- 'mli'
    level <- 2

    #nigeria does return single result
    #mali returns two datasets first one is population
    querytext <- paste0('vocab_Topics:("common operational dataset - cod" AND "gazetteer" NOT "baseline population") AND groups:', iso3clow)

    rhdx::set_rhdx_config()
    datasets_list <- rhdx::search_datasets(fq = querytext)

    #query needs to return a single dataset (with multiple resources)
    ds <- datasets_list[[1]]

    #get list of resources
    list_of_rs <- rhdx::get_resources(ds)
    list_of_rs

    #selecting resource
    #nigeria "zipped shapefiles"
    #mali "zipped shapefile"
    ds_id <- which( rhdx::get_formats(ds) %in% c("zipped shapefiles","zipped shapefile"))

    rs <- rhdx::get_resource(ds, ds_id)

    # find which layers in file
    mlayers <- rhdx::get_resource_layers(rs, download_folder=getwd())


    #error for nigeria
    #<HDX Resource> aa69f07b-ed8e-456a-9233-b20674730be6
    #Name: nga_adm_osgof_20190417_SHP.zip
    #Format: ZIPPED SHAPEFILES
    #Error: This (spatial) data format is not yet supported
    #in hdx resources.r
    # supported_geo_format <- c("geojson", "zipped shapefile", "zipped geodatabase",
    #                           "zipped geopackage", "kmz", "zipped kml")
    #added "zipped shapefiles" option to supported_geo_format in my local branch of rhdx
    #now I get
    #Cannot open data source /vsizip/C:/rsprojects/afriadmin/nga_adm_osgof_20190417_shp.zip
    #Error in CPL_get_layers(dsn, options, do_count) : Open failed.
    #can I open a layer from the downloaded file directly ?
    #using default should open the first layer
    sflayer <- rhdx::read_resource(rs, download_folder=getwd())
    plot(sf::st_geometry(sflayer))
    #no this also fails
    #seemingly because there is a subfolder within the zip
    #aha, nigeria is in a folder within the zip and mali isn't so nigeria fails and mali works
    #is there a way of detecting and dealing with this ?

    # later read layer using layername
    # this relies on all country layers having adm* in their names    
    layername <- mlayers$name[ grep(paste0("adm",level),mlayers$name) ]

    sflayer <- read_resource(re, layer=layername, download_folder=getwd())
    
    #test plotting
    plot(sf::st_geometry(sflayer))

identify and download healthsites data in a query

Hi Ahmadou,

I see that hdx has monthly updated healthsites data by country. I wonder if this could be an easier way of getting at the healthsites data.

I've been trying to do something like this to be able to identify the healthsites data but haven't been able to get it to work yet.

library(rhdx)
rhdx::set_rhdx_config()
querytext <- 'Name:"Kenya-healthsites-shp"'
datasets_list <- rhdx::search_datasets(query = querytext)
#so far not returning any results

Any tips appreciated.
Thanks,
Andy

Error in basename(self$data$url) : path too long

remotes::install_gitlab("dickoa/rhxl") ## rhdx dependency
remotes::install_gitlab("dickoa/rhdx") ## github mirror also avalailable
install.packages("gifski")
library(tidyverse)
library(sf)
library(rhdx)
library(gganimate)
set_rhdx_config()
wca <- pull_dataset("west-and-central-africa-administrative-boundaries-levels") %>%
get_resource(1) %>%
read_resource(folder = "/data")

reading layer: wca_adm0

glimpse(wca)
g5_ab <- wca %>%
filter(admin0Pcod %in% c("BF", "ML", "NE", "MR", "TD"))
g5_ab %>%
ggplot() +
geom_sf() +
theme_minimal()
solr_query <- "organization:acled AND groups:(mli OR bfa OR tcd OR mrt OR ner)"
g5_acled <- search_datasets(query = "conflict data",
fq = solr_query)
g5_acled <- g5_acled[1:5] ## pick the first 5 the 6th is the Africa wide dataset

create a helper function to read resources from each dataset

read_acled_data <- function(dataset) {
dataset %>%
get_resource(1) %>%
read_resource(force_download = TRUE)
}
g5_acled_data <- map_df(g5_acled, read_acled_data)

read HERA data from xls or ; delimited csv

Hi Ahmadou,

I'm trying to read the HERA subnational covid data via rhdx.
https://data.humdata.org/organization/hera-humanitarian-emergency-response-africa

The csv data are delimited by ;

I tried this which does find the data but ignores delim and reads to a single column

df1 <- search_datasets("hera", rows = 2) %>% 
  pluck(1) %>% ## select the first dataset
  get_resource(2) %>% ## 2nd resource is csv
  read_resource(delim=';')

I also tried reading the xls but that seems to put the column names into the first row.

df1 <- search_datasets("hera", rows = 2) %>% 
  pluck(1) %>% ## select the first dataset
  get_resource(1) %>% ## 1st resource is xls
  read_resource()

Thanks! Andy

Better support for HXL tags

Preliminary support for rhxl added
Better support for HXL e.g automatically reading HXL tagged data
@dirkschumacher

get_resource_layers() problem

Hi Ahmadou,

I'm trying to get the number of layers in a resource and I get this error.

pull_dataset("administrative-boundaries-cod-mli") %>%
     get_resource(3) %>%
     get_resource_layers()

Error in self$download(folder = download_folder, quiet = quiet_download, :
formal argument "folder" matched by multiple actual arguments

I suspect this may be due to argument being specified as folder and download_folder (but I'm not big on these class things).

To show that getting the first layer of the resource does work:

pull_dataset("administrative-boundaries-cod-mli") %>%
              get_resource(3) %>%
              read_resource(download_folder=getwd()) -> malishp

    plot(sf::st_geometry(malishp))

Accessing .zip file

Hi Ahmadou,

I'm currently trying to access Facebook's mobility data but am having trouble.

dat <- search_datasets("movement-range-maps", rows = 1) %>%
  pluck(1) %>% # select first result from search
  get_resource(1) %>%
  ???

It seems like their data is currently zipped. Is there a way to access the file? Once I have it, I know how to unzip in R and get the data (which is .txt formatted) but I'm not sure how to get the .zip file itself. Thanks!

set_rhdx_config() leading to could not find function isFalse

Just after installing the package, I was not able to connect to hdx site, getting this error:

rhdx::set_rhdx_config()
#> Error in isFALSE(read_only): could not find function "isFALSE"

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

dickoa / rhdx Goto Github PK

rhdx's Introduction

rhdx

Introduction

Installation

rhdx: A quick tutorial

Reading dataset directly

A step by step tutorial to getting data from rhdx

Connect to a server

Search datasets

Choose the dataset you want to manipulate in R, in this case we will take the first one.

List all resources in the dataset

Choose a resource we need to download/read

Using magrittr pipe

Meta

rhdx's People

Contributors

Stargazers

Watchers

Forkers

rhdx's Issues

reading layer: wca_adm0

create a helper function to read resources from each dataset

Recommend Projects

Recommend Topics

Recommend Org

Using `magrittr` pipe