Code Monkey home page Code Monkey logo

Comments (6)

andysouth avatar andysouth commented on August 26, 2024 1

OK thanks.

I think it might be useful to be able to cope with different delimiters if possible.

I wasn't sure why the delim=';' doesn't get passed onto read_csv in the ..., which from a quick look at your code is what I expected.

I'm currently trying reading in the xls and using this to correct.

  #some, but not all xls files, read rownames into first column
  if (df1[[1,1]] == 'ID')
  {
    #set column names from 1st row
    names(df1) <- as.character(df1[1,])
    #remove first row
    df1 <- df1[-1,]    
  }

There are some other inconsistencies between countries to patch too.

from rhdx.

andysouth avatar andysouth commented on August 26, 2024 1

Aha thanks, I hadn't noticed the xls, xlsx thing. I'll feed this back to HERA.
I'll close this issue now.

from rhdx.

dickoa avatar dickoa commented on August 26, 2024

Hi Andy,
I was also bitten by this one recently and I'm thinking about exposing some extra parameters for each reader for extra flexibility. I first wanted to check if HDX/HERA are willing to change the format metadata from CSV to CSV2.
It's not pretty, but I used download_resource(folder = "folder_path") and used read_csv2 but we could support something more generic.
Thanks

from rhdx.

dickoa avatar dickoa commented on August 26, 2024

Good suggestion, can you try the latest version I pushed. I think it allow more flexibility for parsing csv now.

library(rhdx)
library(tidyverse)

pull_dataset("mauritania_covid19_subnational") %>%
  get_resources(format = "csv") %>%
  pluck(1) %>%
  read_resource(delim = ";", locale = locale(decimal_mark = ","))

## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   ID = col_double(),
##   DATE = col_character(),
##   ISO_3 = col_character(),
##   PAYS = col_character(),
##   ID_PAYS = col_double(),
##   REGION = col_character(),
##   ID_REGION = col_double(),
##   CONTAMINES = col_double(),
##   DECES = col_double(),
##   GUERIS = col_double(),
##   CONTAMINES_FEMME = col_double(),
##   CONTAMINES_HOMME = col_double(),
##   CONTAMINES_GENRE_NON_SPECIFIE = col_double(),
##   SOURCE = col_character()
## )

## # A tibble: 3,850 x 14
##       ID DATE  ISO_3 PAYS  ID_PAYS REGION ID_REGION CONTAMINES
##    <dbl> <chr> <chr> <chr>   <dbl> <chr>      <dbl>      <dbl>
##  1     1 13/0… MRT   Maur…       6 Adrar         59          0
##  2     2 13/0… MRT   Maur…       6 Assaba        60          0
##  3     3 13/0… MRT   Maur…       6 Brakna        61          0
##  4     4 13/0… MRT   Maur…       6 Dakhl…        62          0
##  5     5 13/0… MRT   Maur…       6 Gorgol        63          0
##  6     6 13/0… MRT   Maur…       6 Guidi…        64          0
##  7     7 13/0… MRT   Maur…       6 Hodh …        65          0
##  8     8 13/0… MRT   Maur…       6 Hodh …        66          0
##  9     9 13/0… MRT   Maur…       6 Inchi…        67          0
## 10    10 13/0… MRT   Maur…       6 Nouak…        68          1
## # … with 3,840 more rows, and 6 more variables: DECES <dbl>,
## #   GUERIS <dbl>, CONTAMINES_FEMME <dbl>,
## #   CONTAMINES_HOMME <dbl>, CONTAMINES_GENRE_NON_SPECIFIE <dbl>,
## #   SOURCE <chr>

I also added locale just in case it's pure csv2.

For xls file, let me have a look, I don't have this issue when I read it can you also re-check reading the xls file with the latest version. This is what I have

pull_dataset("mauritania_covid19_subnational") %>%
  get_resources(format = "xls") %>%
  pluck(1) %>%
  read_resource()
## Reading sheet:  Feuil1 
## # A tibble: 3,850 x 14
##       ID DATE                ISO_3 PAYS  ID_PAYS REGION ID_REGION
##    <dbl> <dttm>              <chr> <chr>   <dbl> <chr>      <dbl>
##  1     1 2020-03-13 00:00:00 MRT   Maur…       6 Adrar         59
##  2     2 2020-03-13 00:00:00 MRT   Maur…       6 Assaba        60
##  3     3 2020-03-13 00:00:00 MRT   Maur…       6 Brakna        61
##  4     4 2020-03-13 00:00:00 MRT   Maur…       6 Dakhl…        62
##  5     5 2020-03-13 00:00:00 MRT   Maur…       6 Gorgol        63
##  6     6 2020-03-13 00:00:00 MRT   Maur…       6 Guidi…        64
##  7     7 2020-03-13 00:00:00 MRT   Maur…       6 Hodh …        65
##  8     8 2020-03-13 00:00:00 MRT   Maur…       6 Hodh …        66
##  9     9 2020-03-13 00:00:00 MRT   Maur…       6 Inchi…        67
## 10    10 2020-03-13 00:00:00 MRT   Maur…       6 Nouak…        68
## # … with 3,840 more rows, and 7 more variables:
## #   CONTAMINES <dbl>, DECES <dbl>, GUERIS <dbl>,
## #   CONTAMINES_FEMME <dbl>, CONTAMINES_HOMME <dbl>,
## #   CONTAMINES_GENRE_NON_SPECIFIE <dbl>, SOURCE <chr>

Thanks

from rhdx.

andysouth avatar andysouth commented on August 26, 2024

Thanks, great csv is working well now for all countries, so I can do what I wanted to.

xls is strange, it works for e.g. mauritania (as you show) and senegal

but not e.g. for nigeria, ghana, niger

pull_dataset("ghana_covid19_subnational") %>%
     get_resources(format = "xls") %>%
     pluck(1) %>%
     read_resource()

Error: Not an HDX Resource object!

pull_dataset("niger_covid19_subnational") %>%
     get_resources(format = "xls") %>%
     pluck(1) %>%
     read_resource()

Error: Not an HDX Resource object!

from rhdx.

dickoa avatar dickoa commented on August 26, 2024

The issue is the fact they're not consistent, they share xls for some data and xlsx for others, skip a line in one and not other.

pull_dataset("niger_covid19_subnational") %>%
     get_resources()
##  [[1]]
## <HDX Resource> a02fdda1-2c3a-4458-ab24-e2b2de91d1f2 
##   Name: GHA_Subnational_Covid19_HERA.xlsx
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 239636
##   Format: XLSX

## [[2]]
## <HDX Resource> 1a9d2a9b-36a0-46bc-94df-b1a1d4f62bd5 
##   Name: GHA_Subnational_Covid19_HERA.csv
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 386326
##   Format: CSV

## [[3]]
## <HDX Resource> 632fbd66-e9a7-4f75-b8e4-c1074d3372fa 
##   Name: GHA_Subnational_Covid19_HXL_HERA.csv
##   Description: Subnational data about Covid19 in Ghana - Infected (new cases, gender), Deceased, Recovered.
##   Size: 386513
##   Format: CSV

I think HERA or HDX can fix this, it will be easier to have a single format i.e xlsx and not having to skip a line for some countries and not the others. These are good use case for pushing for more consistency in data format.

pull_dataset("ghana_covid19_subnational") %>%
     get_resources(format = "xlsx") %>%
     pluck(1) %>%
     read_resource(skip = 1L)

## Reading sheet:  Feuille 1 
## # A tibble: 4,641 x 14
##       ID DATE  ISO_3 PAYS  ID_PAYS REGION ID_REGION CONTAMINES
##    <dbl> <chr> <chr> <chr>   <dbl> <chr>      <dbl>      <dbl>
##  1     1 12/0… GHA   Ghana       4 Ashan…        26          0
##  2     2 12/0… GHA   Ghana       4 Brong…        27          0
##  3     3 12/0… GHA   Ghana       4 Centr…        28          0
##  4     4 12/0… GHA   Ghana       4 Upper…        29          0
##  5     5 12/0… GHA   Ghana       4 Great…        30          2
##  6     6 12/0… GHA   Ghana       4 North…        31          0
##  7     7 12/0… GHA   Ghana       4 Weste…        32          0
##  8     8 12/0… GHA   Ghana       4 Easte…        33          0
##  9     9 12/0… GHA   Ghana       4 Upper…        34          0
## 10    10 12/0… GHA   Ghana       4 Volta         35          0
## # … with 4,631 more rows, and 6 more variables: DECES <dbl>,
## #   GUERIS <dbl>, CONTAMINES_FEMME <chr>,
## #   CONTAMINES_HOMME <chr>, CONTAMINES_GENRE_NON_SPECIFIE <dbl>,
## #   SOURCE <chr>
## Warning messages:
## 1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet,  :
##   Expecting numeric in I1445 / R1445C9: got 'Null'
## 2: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet,  :
##   Expecting numeric in I1446 / R1446C9: got 'Null'

from rhdx.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.