Code Monkey home page Code Monkey logo

Comments (9)

hrbrmstr avatar hrbrmstr commented on June 15, 2024 4

An issue here is that read_csv() uses iconvlist() to test the encoding parameter to locale() and on my Linux and macOS systems, UTF-8-BOM isn't in said list so it won't do the conversion (throws an error).

This could be useful (modified from the linked gist):

has_bom <- function(path, encoding="UTF-8") {

  B <- readBin(path, "raw", 4, 1)
  switch(encoding,
       `UTF-8`=B[1]==as.raw(0xef) & B[2]==as.raw(0xbb) & B[3]==as.raw(0xbf),
       `UTF-16`=B[1]==as.raw(0xff) & B[2]==as.raw(0xfe),
       `UTF-16BE`=B[1]==as.raw(0xfe) & B[2]==as.raw(0xff),
       { message("Unsupported encoding") ; return(NA) }
  )
}

has_bom("/tmp/stop_times.txt")
## [1] TRUE

has_bom("/tmp/stops.txt")
## [1] FALSE

Hadley isn't technically correct here when he says that read.csv supports UTF-8-BOM as an encoding parameter. read.csv — well, ultimately the C source for read.table() — just ignores the BOM (try it with and w/o setting that parameter and you'll see that it makes no difference).

You could do something like:

if (has_bom(path)) { # has_bom() from above
  read.csv(path, stringsAsFactors=FALSE) %>% 
    type_convert() %>% 
    as_tibble()
} else {
  read_csv(path)
} 

(adding any other parameters you were using in the gtfs read code for the CSVs) and it should take care of this.

The sneaky thing here is that the column name looks OK after read_csv() reads in a file with a BOM but in the case of what's going on with the existing gtfs code:

Parsed with column specification:
cols(
  `trip_id` = col_integer(),
  arrival_time = col_time(format = ""),
  departure_time = col_time(format = ""),
  stop_id = col_integer(),
  stop_sequence = col_integer(),
  pickup_type = col_integer(),
  drop_off_type = col_integer()
)

the back-ticks around trip_id give it away that something weird is going on and, sure enough, the column name is this:

[1] ef bb bf 74 72 69 70 5f 69 64

So another option is to use make.names() after read_csv() on the column names and check for trip_id really being X.trip_id and just change the column name. That means you don't need the has_bom() check.

Either way, it was a nice catch by Patrick.

from gtfsr.

dantonnoriega avatar dantonnoriega commented on June 15, 2024

I'll be going over the package next week! I'll give this a look.

On Sep 13, 2016 7:32 AM, "Patrick Hausmann" [email protected]
wrote:

Hello,
import_gtfs fails if the encoding of a gtfs file is UTF-8 BOM. This is
because the function use readr::read_csv with the default encoding in
locale set to UTF-8.

Example:

source http://www.sardegnamobilita.it/opengovernment/opendata/# the encoding for stop_times.txt is UTF-8 BOMr <- import_gtfs("http://www.covimo.de/gtfs/dati_atpss.zip")

Not sure what a solution could be - maybe make locale an argument in the
function call? Most of the times all files of a gtfs dataset are of one
type (BOM or not BOM).

Thank you for the package!
Patrick


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#19, or mute the thread
https://github.com/notifications/unsubscribe-auth/AFaR2ZmNE7033VkjQfoFhH0tk7u2MXFGks5qponpgaJpZM4J7kdf
.

from gtfsr.

patperu avatar patperu commented on June 15, 2024

Ok, great!
I came across this gist from @hrbrmstr https://gist.github.com/hrbrmstr/be3bf6e2b7e8b06648fd - maybe it could be useful.

from gtfsr.

patperu avatar patperu commented on June 15, 2024

Wow! Bob, thank you so much for this! I'm always excited about your answers!!

from gtfsr.

dantonnoriega avatar dantonnoriega commented on June 15, 2024

I think I've solved this but before closing, please give it a go!

from gtfsr.

hrbrmstr avatar hrbrmstr commented on June 15, 2024

I felt compelled to update here since I was a bit wrong (I like to admit when I'm wrong :-)

read.csv() itself is not handling the UTF-8-BOM parameter, but R is internally when processing connections:

void set_iconv(Rconnection con)
{
    void *tmp;

    /* need to test if this is text, open for reading to writing or both,
       and set inconv and/or outconv */
    if(!con->text || !strlen(con->encname) ||
       strcmp(con->encname, "native.enc") == 0) {
    con->UTF8out = FALSE;
    return;
    }
    if(con->canread) {
    size_t onb = 50;
    char *ob = con->oconvbuff;
    /* UTF8out is set in readLines() and scan()
       Was Windows-only until 2.12.0, but we now require iconv.
     */
    Rboolean useUTF8 = !utf8locale && con->UTF8out;
    const char *enc =
        streql(con->encname, "UTF-8-BOM") ? "UTF-8" : con->encname;
    tmp = Riconv_open(useUTF8 ? "UTF-8" : "", enc);
    if(tmp != (void *)-1) con->inconv = tmp;
    else set_iconv_error(con, con->encname, useUTF8 ? "UTF-8" : "");
    con->EOF_signalled = FALSE;
    /* initialize state, and prepare any initial bytes */
    Riconv(tmp, NULL, NULL, &ob, &onb);
    con->navail = (short)(50-onb); con->inavail = 0;
    /* libiconv can handle BOM marks on Windows Unicode files, but
       glibc's iconv cannot. Aargh ... */
    if(streql(con->encname, "UCS-2LE") ||
       streql(con->encname, "UTF-16LE")) con->inavail = -2;
    /* Discaard BOM */
    if(streql(con->encname, "UTF-8-BOM")) con->inavail = -3;
    }
    if(con->canwrite) {
    size_t onb = 25;
    char *ob = con->init_out;
    tmp = Riconv_open(con->encname, "");
    if(tmp != (void *)-1) con->outconv = tmp;
    else set_iconv_error(con, con->encname, "");
    /* initialize state, and prepare any initial bytes */
    Riconv(tmp, NULL, NULL, &ob, &onb);
    ob[25-onb] = '\0';
    }
}

I especially like the:

    /* libiconv can handle BOM marks on Windows Unicode files, but
       glibc's iconv cannot. Aargh ... */

comment :-)

That actually gets exposed at the C-level R interface as Rf_set_iconv(), too.

So, anything that can take a connection can actually have the BOM stripped if one does, say, file("example.csv", open="r", encoding="UTF-8-BOM").

Apologies if I led you astray at all.

from gtfsr.

dantonnoriega avatar dantonnoriega commented on June 15, 2024

No apologies needed! Even if your solution was "wrong" in the sense that
read.csv is not really doing the work, it still provided me with a way of
getting the code to be more robust!

Without your insight, I would have likely spent a long while trying to
figure out what was wrong, and likely would have ended up with zero
conceptual understanding, off base entirely, or with a shit solution.

On Sep 30, 2016 10:39 AM, "boB Rudis" [email protected] wrote:

I felt compelled to update here since I was a bit wrong (I like to admit
when I'm wrong :-)

read.csv() itself is not handling the UTF-8-BOM parameter, but R is
internally when processing connections:

void set_iconv(Rconnection con)
{
void *tmp;

/* need to test if this is text, open for reading to writing or both,       and set inconv and/or outconv */
if(!con->text || !strlen(con->encname) ||
   strcmp(con->encname, "native.enc") == 0) {
con->UTF8out = FALSE;
return;
}
if(con->canread) {
size_t onb = 50;
char *ob = con->oconvbuff;
/* UTF8out is set in readLines() and scan()       Was Windows-only until 2.12.0, but we now require iconv.     */
Rboolean useUTF8 = !utf8locale && con->UTF8out;
const char *enc =
    streql(con->encname, "UTF-8-BOM") ? "UTF-8" : con->encname;
tmp = Riconv_open(useUTF8 ? "UTF-8" : "", enc);
if(tmp != (void *)-1) con->inconv = tmp;
else set_iconv_error(con, con->encname, useUTF8 ? "UTF-8" : "");
con->EOF_signalled = FALSE;
/* initialize state, and prepare any initial bytes */
Riconv(tmp, NULL, NULL, &ob, &onb);
con->navail = (short)(50-onb); con->inavail = 0;
/* libiconv can handle BOM marks on Windows Unicode files, but       glibc's iconv cannot. Aargh ... */
if(streql(con->encname, "UCS-2LE") ||
   streql(con->encname, "UTF-16LE")) con->inavail = -2;
/* Discaard BOM */
if(streql(con->encname, "UTF-8-BOM")) con->inavail = -3;
}
if(con->canwrite) {
size_t onb = 25;
char *ob = con->init_out;
tmp = Riconv_open(con->encname, "");
if(tmp != (void *)-1) con->outconv = tmp;
else set_iconv_error(con, con->encname, "");
/* initialize state, and prepare any initial bytes */
Riconv(tmp, NULL, NULL, &ob, &onb);
ob[25-onb] = '\0';
}

}

I especially like the:

/* libiconv can handle BOM marks on Windows Unicode files, but       glibc's iconv cannot. Aargh ... */

comment :-)

That actually gets exposed at the C-level R interface as Rf_set_iconv()
so anything that can take a connection can actually have the BOM stripped
if one does, say, file("example.csv", open="r", encoding="UTF-8-BOM").

Apologies if I led you astray at all.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#19 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFaR2aheG2Ll8VVWpRSIw2Vk5hg8CbRFks5qvR8wgaJpZM4J7kdf
.

from gtfsr.

patperu avatar patperu commented on June 15, 2024

Sorry for being late, I'll give it a try during the weekend!
Thank you!

from gtfsr.

hrbrmstr avatar hrbrmstr commented on June 15, 2024

Thanks to this I've got a fledgling pkg: https://github.com/hrbrmstr/bom that will eventually make it more trivial to deal with these crazy BOMs. I haven't looked at how you solved it for gtfsr, yet, but let me know what types of API calls would help in the future. I implemented it in C for speed but may be able to tap into the connections interface on the R side if that would be more helpful.

from gtfsr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.