Hi Ahmadou, I'm trying to read the HERA subnational covid data via r

Aha thanks, I hadn't noticed the xls , <code class="no

read HERA data from xls or ; delimited csv about rhdx HOT 6 CLOSED

dickoa commented on August 26, 2024

read HERA data from xls or ; delimited csv

from rhdx.

Comments (6)

andysouth commented on August 26, 2024 1

OK thanks.

I think it might be useful to be able to cope with different delimiters if possible.

I wasn't sure why the delim=';' doesn't get passed onto read_csv in the ..., which from a quick look at your code is what I expected.

I'm currently trying reading in the xls and using this to correct.

  #some, but not all xls files, read rownames into first column
  if (df1[[1,1]] == 'ID')
  {
    #set column names from 1st row
    names(df1) <- as.character(df1[1,])
    #remove first row
    df1 <- df1[-1,]    
  }

There are some other inconsistencies between countries to patch too.

from rhdx.

andysouth commented on August 26, 2024 1

Aha thanks, I hadn't noticed the xls, xlsx thing. I'll feed this back to HERA.
I'll close this issue now.

from rhdx.

dickoa commented on August 26, 2024

Hi Andy,
I was also bitten by this one recently and I'm thinking about exposing some extra parameters for each reader for extra flexibility. I first wanted to check if HDX/HERA are willing to change the format metadata from CSV to CSV2.
It's not pretty, but I used download_resource(folder = "folder_path") and used read_csv2 but we could support something more generic.
Thanks

from rhdx.

dickoa commented on August 26, 2024

Good suggestion, can you try the latest version I pushed. I think it allow more flexibility for parsing csv now.

library(rhdx)
library(tidyverse)

pull_dataset("mauritania_covid19_subnational") %>%
  get_resources(format = "csv") %>%
  pluck(1) %>%
  read_resource(delim = ";", locale = locale(decimal_mark = ","))

## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   ID = col_double(),
##   DATE = col_character(),
##   ISO_3 = col_character(),
##   PAYS = col_character(),
##   ID_PAYS = col_double(),
##   REGION = col_character(),
##   ID_REGION = col_double(),
##   CONTAMINES = col_double(),
##   DECES = col_double(),
##   GUERIS = col_double(),
##   CONTAMINES_FEMME = col_double(),
##   CONTAMINES_HOMME = col_double(),
##   CONTAMINES_GENRE_NON_SPECIFIE = col_double(),
##   SOURCE = col_character()
## )

## # A tibble: 3,850 x 14
##       ID DATE  ISO_3 PAYS  ID_PAYS REGION ID_REGION CONTAMINES
##    <dbl> <chr> <chr> <chr>   <dbl> <chr>      <dbl>      <dbl>
##  1     1 13/0… MRT   Maur…       6 Adrar         59          0
##  2     2 13/0… MRT   Maur…       6 Assaba        60          0
##  3     3 13/0… MRT   Maur…       6 Brakna        61          0
##  4     4 13/0… MRT   Maur…       6 Dakhl…        62          0
##  5     5 13/0… MRT   Maur…       6 Gorgol        63          0
##  6     6 13/0… MRT   Maur…       6 Guidi…        64          0
##  7     7 13/0… MRT   Maur…       6 Hodh …        65          0
##  8     8 13/0… MRT   Maur…       6 Hodh …        66          0
##  9     9 13/0… MRT   Maur…       6 Inchi…        67          0
## 10    10 13/0… MRT   Maur…       6 Nouak…        68          1
## # … with 3,840 more rows, and 6 more variables: DECES <dbl>,
## #   GUERIS <dbl>, CONTAMINES_FEMME <dbl>,
## #   CONTAMINES_HOMME <dbl>, CONTAMINES_GENRE_NON_SPECIFIE <dbl>,
## #   SOURCE <chr>

I also added locale just in case it's pure csv2.

For xls file, let me have a look, I don't have this issue when I read it can you also re-check reading the xls file with the latest version. This is what I have

pull_dataset("mauritania_covid19_subnational") %>%
  get_resources(format = "xls") %>%
  pluck(1) %>%
  read_resource()
## Reading sheet:  Feuil1 
## # A tibble: 3,850 x 14
##       ID DATE                ISO_3 PAYS  ID_PAYS REGION ID_REGION
##    <dbl> <dttm>              <chr> <chr>   <dbl> <chr>      <dbl>
##  1     1 2020-03-13 00:00:00 MRT   Maur…       6 Adrar         59
##  2     2 2020-03-13 00:00:00 MRT   Maur…       6 Assaba        60
##  3     3 2020-03-13 00:00:00 MRT   Maur…       6 Brakna        61
##  4     4 2020-03-13 00:00:00 MRT   Maur…       6 Dakhl…        62
##  5     5 2020-03-13 00:00:00 MRT   Maur…       6 Gorgol        63
##  6     6 2020-03-13 00:00:00 MRT   Maur…       6 Guidi…        64
##  7     7 2020-03-13 00:00:00 MRT   Maur…       6 Hodh …        65
##  8     8 2020-03-13 00:00:00 MRT   Maur…       6 Hodh …        66
##  9     9 2020-03-13 00:00:00 MRT   Maur…       6 Inchi…        67
## 10    10 2020-03-13 00:00:00 MRT   Maur…       6 Nouak…        68
## # … with 3,840 more rows, and 7 more variables:
## #   CONTAMINES <dbl>, DECES <dbl>, GUERIS <dbl>,
## #   CONTAMINES_FEMME <dbl>, CONTAMINES_HOMME <dbl>,
## #   CONTAMINES_GENRE_NON_SPECIFIE <dbl>, SOURCE <chr>

Thanks

from rhdx.

andysouth commented on August 26, 2024

Thanks, great csv is working well now for all countries, so I can do what I wanted to.

xls is strange, it works for e.g. mauritania (as you show) and senegal

but not e.g. for nigeria, ghana, niger

pull_dataset("ghana_covid19_subnational") %>%
     get_resources(format = "xls") %>%
     pluck(1) %>%
     read_resource()

Error: Not an HDX Resource object!

pull_dataset("niger_covid19_subnational") %>%
     get_resources(format = "xls") %>%
     pluck(1) %>%
     read_resource()

Error: Not an HDX Resource object!

from rhdx.

dickoa commented on August 26, 2024

The issue is the fact they're not consistent, they share xls for some data and xlsx for others, skip a line in one and not other.

pull_dataset("niger_covid19_subnational") %>%
     get_resources()
##  [[1]]
## <HDX Resource> a02fdda1-2c3a-4458-ab24-e2b2de91d1f2 
##   Name: GHA_Subnational_Covid19_HERA.xlsx
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 239636
##   Format: XLSX

## [[2]]
## <HDX Resource> 1a9d2a9b-36a0-46bc-94df-b1a1d4f62bd5 
##   Name: GHA_Subnational_Covid19_HERA.csv
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 386326
##   Format: CSV

## [[3]]
## <HDX Resource> 632fbd66-e9a7-4f75-b8e4-c1074d3372fa 
##   Name: GHA_Subnational_Covid19_HXL_HERA.csv
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 386513
##   Format: CSV

I think HERA or HDX can fix this, it will be easier to have a single format i.e xlsx and not having to skip a line for some countries and not the others. These are good use case for pushing for more consistency in data format.

pull_dataset("ghana_covid19_subnational") %>%
     get_resources(format = "xlsx") %>%
     pluck(1) %>%
     read_resource(skip = 1L)

## Reading sheet:  Feuille 1 
## # A tibble: 4,641 x 14
##       ID DATE  ISO_3 PAYS  ID_PAYS REGION ID_REGION CONTAMINES
##    <dbl> <chr> <chr> <chr>   <dbl> <chr>      <dbl>      <dbl>
##  1     1 12/0… GHA   Ghana       4 Ashan…        26          0
##  2     2 12/0… GHA   Ghana       4 Brong…        27          0
##  3     3 12/0… GHA   Ghana       4 Centr…        28          0
##  4     4 12/0… GHA   Ghana       4 Upper…        29          0
##  5     5 12/0… GHA   Ghana       4 Great…        30          2
##  6     6 12/0… GHA   Ghana       4 North…        31          0
##  7     7 12/0… GHA   Ghana       4 Weste…        32          0
##  8     8 12/0… GHA   Ghana       4 Easte…        33          0
##  9     9 12/0… GHA   Ghana       4 Upper…        34          0
## 10    10 12/0… GHA   Ghana       4 Volta         35          0
## # … with 4,631 more rows, and 6 more variables: DECES <dbl>,
## #   GUERIS <dbl>, CONTAMINES_FEMME <chr>,
## #   CONTAMINES_HOMME <chr>, CONTAMINES_GENRE_NON_SPECIFIE <dbl>,
## #   SOURCE <chr>
## Warning messages:
## 1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet,  :
##   Expecting numeric in I1445 / R1445C9: got 'Null'
## 2: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet,  :
##   Expecting numeric in I1446 / R1446C9: got 'Null'

from rhdx.

read HERA data from xls or ; delimited csv about rhdx HOT 6 CLOSED

Comments (6)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent