petrbouchal / czso Goto Github PK

View Code? Open in Web Editor NEW

12.0 5.0 1.0 1.48 MB

Use Open Data from the Czech Statistical Office in R

Home Page: https://petrbouchal.xyz/czso

License: Other

R 96.13% JavaScript 0.71% CSS 3.16%

czech-republic open-data statistics r czso rstats czech-statistical-office dataset rstats-package

czso's Introduction

czso

The goal of czso is to provide direct, programmatic, hassle-free access from R to open data provided by the Czech Statistical Office (CZSO).

This is done by

providing direct access from R to the catalogue of open CZSO datasets, eliminating the hassle from data discovery. Normally this is done done through the CZSO’s product catalogue which is unfortunately a bit clunky, or data.gov.cz, which is not a natural starting point for many.
providing a function to load a specific dataset to R directly from the CZSO’s datastore, eliminating the friction of copying a URL, downloading, unzipping etc.

Additionally, the package provides access to metadata on datasets and to codelists (číselníky) as a special case of datasets listed in the catalogue.

Installation

You can install the package from CRAN:

install.packages("czso")

You can install the latest in-development release from github with:

remotes::install_github("petrbouchal/czso", ref = github_release())

or the latest version with:

remotes::install_github("petrbouchal/czso")

I also keep binaries in a drat repo, which you can access by

install.packages("czso", repos = "https://petrbouchal.xyz/drat")

Example

Say you are looking for a dataset whose title refers to wages (mzda/mzdy):

First, retrieve the list of available CZSO datasets:

library(czso)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(stringr))

catalogue <- czso_get_catalogue()

Now search for your terms of interest in the dataset titles:

catalogue %>% 
  filter(str_detect(title, "[Mm]zd[ay]")) %>% 
  select(dataset_id, title, description)
#> # A tibble: 2 × 3
#>   dataset_id title                                                   description
#>   <chr>      <chr>                                                   <chr>      
#> 1 110080     Průměrná hrubá měsíční mzda a medián mezd v krajích     Datová sad…
#> 2 110079     Zaměstnanci a průměrné hrubé měsíční mzdy podle odvětví Datová sad…

You could also search in descriptions or keywords which are also retrieved into the catalogue.

We can see the dataset_id for the required dataset - now use it to get the dataset:

czso_get_table("110080")
#> # A tibble: 1,080 × 14
#>    idhod  hodnota stapro_kod SPKVANTIL_cis SPKVANTIL_kod POHLAVI_cis POHLAVI_kod
#>    <chr>    <dbl> <chr>      <chr>         <chr>         <chr>       <chr>      
#>  1 73662…   21782 5958       7636          Q5            <NA>        <NA>       
#>  2 73662…   25625 5958       <NA>          <NA>          <NA>        <NA>       
#>  3 73662…   28431 5958       <NA>          <NA>          102         1          
#>  4 73662…   22133 5958       <NA>          <NA>          102         2          
#>  5 73662…   23533 5958       7636          Q5            102         1          
#>  6 73662…   19731 5958       7636          Q5            102         2          
#>  7 74595…   26033 5958       <NA>          <NA>          <NA>        <NA>       
#>  8 74595…   28873 5958       <NA>          <NA>          102         1          
#>  9 74595…   22496 5958       <NA>          <NA>          102         2          
#> 10 74595…   21997 5958       7636          Q5            <NA>        <NA>       
#> # ℹ 1,070 more rows
#> # ℹ 7 more variables: rok <int>, uzemi_cis <chr>, uzemi_kod <chr>,
#> #   STAPRO_TXT <chr>, uzemi_txt <chr>, SPKVANTIL_txt <chr>, POHLAVI_txt <chr>

You can retrieve the schema for the dataset:

czso_get_table_schema("110080")
#> # A tibble: 14 × 5
#>    name          titles        `dc:description`                required datatype
#>    <chr>         <chr>         <chr>                           <lgl>    <chr>   
#>  1 idhod         idhod         "unikátní identifikátor údaje … TRUE     string  
#>  2 hodnota       hodnota       "zjištěná hodnota"              TRUE     number  
#>  3 stapro_kod    stapro_kod    "kód statistické proměnné ze s… TRUE     string  
#>  4 spkvantil_cis spkvantil_cis "kód číselníku pro kvantil"     TRUE     string  
#>  5 spkvantil_kod spkvantil_kod "kód položky z číselníku pro k… TRUE     string  
#>  6 pohlavi_cis   pohlavi_cis   "kód číselníku pro pohlaví"     TRUE     string  
#>  7 pohlavi_kod   pohlavi_kod   "kód položky číselníku pro poh… TRUE     string  
#>  8 rok           rok           "rok referenčního období ve fo… TRUE     number  
#>  9 uzemi_cis     uzemi_cis     "kód číselníku pro referenční … TRUE     string  
#> 10 uzemi_kod     uzemi_kod     "kód položky číselníku pro ref… TRUE     string  
#> 11 uzemi_txt     uzemi_txt     "text položky z číselníku pro … TRUE     string  
#> 12 stapro_txt    stapro_txt    "text statistické proměnné"     TRUE     string  
#> 13 spkvantil_txt spkvantil_txt "text položky číselníku pro kv… TRUE     string  
#> 14 pohlavi_txt   pohlavi_txt   "text položky číselníku pro po… TRUE     string

and download the documentation in PDF:

czso_get_dataset_doc("110080", action = "download", format = "pdf")
#> ✔ Downloaded <https://www.czso.cz/documents/62353418/171419376/110080-22dds.pdf> to '110080-22dds.pdf'

If you are interested in linking this data to different data, you might need the NUTS codes for regions. Seeing that the lines with regional breakdown list uzemi_cis as "100", you can get that codelist (číselník):

czso_get_codelist(100)
#> # A tibble: 15 × 11
#>    kodjaz akrcis  kodcis chodnota zkrtext text  admplod admnepo cznuts kod_ruian
#>    <chr>  <chr>   <chr>  <chr>    <chr>   <chr> <chr>   <chr>   <chr>  <chr>    
#>  1 CS     KRAJ_N… 100    3000     Extra-… Extr… 2004-0… 9999-0… CZZZZ  <NA>     
#>  2 CS     KRAJ_N… 100    3018     Hl. m.… Hlav… 2001-0… 9999-0… CZ010  19       
#>  3 CS     KRAJ_N… 100    3026     Středo… Stře… 2001-0… 9999-0… CZ020  27       
#>  4 CS     KRAJ_N… 100    3034     Jihoče… Jiho… 2001-0… 9999-0… CZ031  35       
#>  5 CS     KRAJ_N… 100    3042     Plzeňs… Plze… 2001-0… 9999-0… CZ032  43       
#>  6 CS     KRAJ_N… 100    3051     Karlov… Karl… 2001-0… 9999-0… CZ041  51       
#>  7 CS     KRAJ_N… 100    3069     Ústeck… Úste… 2001-0… 9999-0… CZ042  60       
#>  8 CS     KRAJ_N… 100    3077     Libere… Libe… 2001-0… 9999-0… CZ051  78       
#>  9 CS     KRAJ_N… 100    3085     Králov… Král… 2001-0… 9999-0… CZ052  86       
#> 10 CS     KRAJ_N… 100    3093     Pardub… Pard… 2001-0… 9999-0… CZ053  94       
#> 11 CS     KRAJ_N… 100    3107     Kraj V… Kraj… 2001-0… 9999-0… CZ063  108      
#> 12 CS     KRAJ_N… 100    3115     Jihomo… Jiho… 2001-0… 9999-0… CZ064  116      
#> 13 CS     KRAJ_N… 100    3123     Olomou… Olom… 2001-0… 9999-0… CZ071  124      
#> 14 CS     KRAJ_N… 100    3131     Zlínsk… Zlín… 2001-0… 9999-0… CZ072  141      
#> 15 CS     KRAJ_N… 100    3140     Moravs… Mora… 2001-0… 9999-0… CZ080  132      
#> # ℹ 1 more variable: zkrkraj <chr>

You would then need to do a bit of manual work to join this codelist onto the data.

A note about “tables” and “datasets”

In the parlance of the official open data catalogue, a dataset can have multiple distributions (typically multiple formats of the same data). These are called resources in the internals, and manifest as tables in this package. Some metainformation is the property of a dataset (the documentation), while other - the schema - is the property of a table. Hence the function names in this package. This is to keep things organised even if the CZSO almost always provides only one table per dataset and appends new data to it over time.

Data sources

The catalogue is drawn from https://data.gov.cz through the SPARQL endpoint.

The data and specific metadata is then accessed via the package_show endpoint of the CZSO API at (example) https://vdb.czso.cz/pll/eweb/package_show?id=290038r19.

Credit and notes

not created or endorsed by the Czech Statistical Office, though they, as well as the open data team at the Ministry of Interior deserve credit for getting the data out there.
the package relies on the data.gov.cz catalogue of open data and on the CZSO’s local catalogue
NB: The robots.txt at the domain hosting the CZSO’s catalogue prohibits robots from accessing it; while this may be an inappropriate/erroneous setting for what is in essence a data API, this package tries to honor the spirit of that setting by only accessing the API once per czso_get_table() call, relying on a different system for czso_get_catalogue(). Hence, do not use this package for harvesting large numbers of datasets from the CZSO.

Acknowledgments

Thanks to @jakubklimek and @martinnecasky for helping me figure out the SPARQL endpoint on the Czech National Open Data Catalogue.

The logo

An homage to the CZSO’s work in releasing its data in an open format, something that is not necessarily in its DNA.

It alludes to the shades of the country reflected in the tabular data provided, By interspersing the comma symbol into the name of the package, it refers to both integration between statistics and open data and the slight disruption that the world of statistics undergoes when that integration happens.

Contributing / code of conduct

Please note that the ‘czso’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

czso's People

Contributors

Stargazers

Watchers

Forkers

jlacko

czso's Issues

Switch to using data.gov.cz API instead of CSV of all datasets

Create logo :)

Add function for showing definition of an indicator/all indicators in a data frame

handle XML schemas in czso_get_table_schema()

list extracted files when zip contains multiple

Automatically convert dates in files

Graceful fail in case of vdb.czso.cz not available

Recently the VDB server of ČSÚ was unavailable for extended period of time, which triggered timeout failures in CI workflows relying on {czso}.

This behaviour of {czso} may be not fully compliant with CRAN policy of a graceful fail of internet resource not being available.

I have a function written for this, and I would be happy to propose a pull request to {czso} if this would be agreeable to the maintainers (the code is ready, so little extra work would be involved).

For your information the code is available for your information at https://github.com/jlacko/RCzechia/blob/master/R/ok_to_proceed.R

Switch messaging to {cli}

parameterise CSV separator in czso_get_table()

Move to Github Actions

add codelist translation function

Typo in message in czso_get_table()

Incorporate text pattern matching in czso_get_catalogue()

compatibility with macOS 12 (Monterey)

I strongly suspect that a possible incompatibility of {czso} with macOS 12 (released 2021-10-25) was the root cause of recent archival of RCzechia; I used to use czso::czso_get_table("SLDB-VYBER") in the vignette, which led to unexplainable cryptography exceptions at TLS level (I suspect, but I can not be 100% certain, related to certificate on staťák servers accessed via https).

I was unable to reproduce the issue in a satisfactory manner / not to mention resolve - but I have been able to sidestep it by replacing the czso call by a download of a static file (zvcr034.xls)

I don't have access to a Monterey machine and thus I am not in a position to offer help with the topic :(

Problems with the download of some tables

There is error message, when downloading some tables from the database. Most are coorect, but for example industrial output download has this problem (you can seee below).

catalogue <- czso::czso_get_catalogue()

catalogue %>%

filter(str_detect(title, "[Pp]růmysl")) %>%
select(dataset_id, title, description)

A tibble: 1 × 3

dataset_id title description

1 150196 Index průmyslové produkce Datová sada obsahuje časovou řadu s údaji o vývo…

Dataset download

prumysl <- czso_get_table("150196")
Error: lexical error: invalid character inside string.
418/198519054/150196-23dds.htm ","temporal_start":"2000-01-0
(right here) ------^

Add checks that data has > 0 rows

Integrate VDB public database

Add citation functions

see petrbouchal/statnipokladna#49

Implement date-specific codelist loading

Add print method metadata (alongside dataset?)

Switch API interfacing to {httr2}

Election results datasets missing from catalogue

Problem

At least two datasets listed on data.gov.cz are missing from the output of czso_get_catalogue():

Cause

These are missing from the CZSO catalogue extract that the czso_get_catalogue() function relies on - for complicated back-compatibility-related technical reasons, the catalogue is not drawn directly from data.gov.cz. These datasets are added by CZSO to the data.gov.cz catalogue manually so are not in this extract.

Workaround

They can, however, still be loaded using czso_get_table():

czso_get_table("ps2021pst4p")

czso_get_table("ps2021pst4")

In similar situations, one might want to inspect this listing to see if a plausible-looking ID does in fact exist. Any ID listed in this file can be be fed to czso_get_table().

Add dest_dir param to czso_get_table() and czso_get_dataset()

Expose functions for {targets} workflow

Desired end state

tar_url(czso_url, czso_get_url(id, resource_num))
tar_file(czso_csv, czso_get_csv(url))
tar_target(czso_table, czso_load_table(czso_csv))

czso_get_table() does not send User-Agent

switch to pkgdown 1.5.x when available

In-session caching

Use usethis UI functions consistently

rename functions to _czso_

Switch to matomo tracking

Add print method for catalogue entry

Switch to newer CZSO LKOD API

Lexical error / invalid character when calling for codebook No. 1035 / DELCASI

A call for the codebook 1035 / DELCASI errors out.

czso::czso_get_codelist("cis1035")
# Error: lexical error: invalid character inside string.
#           du (alfanumerický kód) :  a)	1. znak - čas intervalu, nab
#                      (right here) ------^

At a glance it seems possible that the problem may be related to the formatting of an underlying JSON object (i.e. possibly a problem on the CZSO side), but I did not have the time to look into execution of the underlying code so this is just a guess.

allow automatic adding of codelists

Add social-graph metadata

detect and read in XML codelists in get_table()

Use schema for parsing tables and column types

Attempt creating link to CZSO documentation file for each dataset

Make proper use of tibble

https://github.com/petrbouchal/statnipokladna/issues/new

Complete schema provider function

Examples fail on travis

return documentation URL visibly

Dataset IDs shown in result of `czso_get_catalogue()` do not work in `czso_get_table()`

Current workaround

run czso_get_catalogue()
identify dataset you need
go to the URL listed in the page column
look at the Kód item at the top of the page - a code in the form of 123456-21
use the first six digits of this code as the dataset_id parameter to czso_get_table()

You can replace steps 1-3 with Google.

Issue

This is caused by the new version of the local open data catalogue (LKOD) deployed by the data provider. The field dct:identifier in the National Data Catalogue (NKOD) used to contain the CZSO dataset code, but now contains the full URL of the metadata for the dataset in the LKOD.

Solution

I am talking to the data provider to understand

whether this is the intended end state
whether they can also publish the CZSO dataset ID (Kód) in the NKOD/LKOD responses so the this package can have backward compatibility

Once I hear back, I will either rewrite the code to handle the new catalogue with a breaking change, or rewrite without a breaking change if the original IDs can be provided by the new catalogue, or provide a transitional fallback mechanism that will rely on the old LKOD, but the recommended way will be to use the new IDs as provided by both LKOD and NKOD.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.