Import error if the gtfs-file encoding is `UTF-8 BOM` about gtfsr HOT 9 CLOSED

ropensci-archive commented on June 15, 2024

Import error if the gtfs-file encoding is `UTF-8 BOM`

from gtfsr.

Comments (9)

hrbrmstr commented on June 15, 2024 4

An issue here is that read_csv() uses iconvlist() to test the encoding parameter to locale() and on my Linux and macOS systems, UTF-8-BOM isn't in said list so it won't do the conversion (throws an error).

This could be useful (modified from the linked gist):

has_bom <- function(path, encoding="UTF-8") {

  B <- readBin(path, "raw", 4, 1)
  switch(encoding,
       `UTF-8`=B[1]==as.raw(0xef) & B[2]==as.raw(0xbb) & B[3]==as.raw(0xbf),
       `UTF-16`=B[1]==as.raw(0xff) & B[2]==as.raw(0xfe),
       `UTF-16BE`=B[1]==as.raw(0xfe) & B[2]==as.raw(0xff),
       { message("Unsupported encoding") ; return(NA) }
  )
}

has_bom("/tmp/stop_times.txt")
## [1] TRUE

has_bom("/tmp/stops.txt")
## [1] FALSE

Hadley isn't technically correct here when he says that read.csv supports UTF-8-BOM as an encoding parameter. read.csv — well, ultimately the C source for read.table() — just ignores the BOM (try it with and w/o setting that parameter and you'll see that it makes no difference).

You could do something like:

if (has_bom(path)) { # has_bom() from above
  read.csv(path, stringsAsFactors=FALSE) %>% 
    type_convert() %>% 
    as_tibble()
} else {
  read_csv(path)
}

(adding any other parameters you were using in the gtfs read code for the CSVs) and it should take care of this.

The sneaky thing here is that the column name looks OK after read_csv() reads in a file with a BOM but in the case of what's going on with the existing gtfs code:

Parsed with column specification:
cols(
  `trip_id` = col_integer(),
  arrival_time = col_time(format = ""),
  departure_time = col_time(format = ""),
  stop_id = col_integer(),
  stop_sequence = col_integer(),
  pickup_type = col_integer(),
  drop_off_type = col_integer()
)

the back-ticks around trip_id give it away that something weird is going on and, sure enough, the column name is this:

[1] ef bb bf 74 72 69 70 5f 69 64

So another option is to use make.names() after read_csv() on the column names and check for trip_id really being X.trip_id and just change the column name. That means you don't need the has_bom() check.

Either way, it was a nice catch by Patrick.

from gtfsr.

dantonnoriega commented on June 15, 2024

I'll be going over the package next week! I'll give this a look.

On Sep 13, 2016 7:32 AM, "Patrick Hausmann" [email protected]
wrote:

Hello,
import_gtfs fails if the encoding of a gtfs file is UTF-8 BOM. This is
because the function use readr::read_csv with the default encoding in
locale set to UTF-8.

Example:

source http://www.sardegnamobilita.it/opengovernment/opendata/# the encoding for stop_times.txt is UTF-8 BOMr <- import_gtfs("http://www.covimo.de/gtfs/dati_atpss.zip")

Not sure what a solution could be - maybe make locale an argument in the
function call? Most of the times all files of a gtfs dataset are of one
type (BOM or not BOM).

Thank you for the package!
Patrick

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#19, or mute the thread
https://github.com/notifications/unsubscribe-auth/AFaR2ZmNE7033VkjQfoFhH0tk7u2MXFGks5qponpgaJpZM4J7kdf
.

from gtfsr.

patperu commented on June 15, 2024

Ok, great!
I came across this gist from @hrbrmstr https://gist.github.com/hrbrmstr/be3bf6e2b7e8b06648fd - maybe it could be useful.

from gtfsr.

patperu commented on June 15, 2024

Wow! Bob, thank you so much for this! I'm always excited about your answers!!

from gtfsr.

dantonnoriega commented on June 15, 2024

I think I've solved this but before closing, please give it a go!

from gtfsr.

hrbrmstr commented on June 15, 2024

I felt compelled to update here since I was a bit wrong (I like to admit when I'm wrong :-)

read.csv() itself is not handling the UTF-8-BOM parameter, but R is internally when processing connections:

void set_iconv(Rconnection con)
{
    void *tmp;

    /* need to test if this is text, open for reading to writing or both,
       and set inconv and/or outconv */
    if(!con->text || !strlen(con->encname) ||
       strcmp(con->encname, "native.enc") == 0) {
    con->UTF8out = FALSE;
    return;
    }
    if(con->canread) {
    size_t onb = 50;
    char *ob = con->oconvbuff;
    /* UTF8out is set in readLines() and scan()
       Was Windows-only until 2.12.0, but we now require iconv.
     */
    Rboolean useUTF8 = !utf8locale && con->UTF8out;
    const char *enc =
        streql(con->encname, "UTF-8-BOM") ? "UTF-8" : con->encname;
    tmp = Riconv_open(useUTF8 ? "UTF-8" : "", enc);
    if(tmp != (void *)-1) con->inconv = tmp;
    else set_iconv_error(con, con->encname, useUTF8 ? "UTF-8" : "");
    con->EOF_signalled = FALSE;
    /* initialize state, and prepare any initial bytes */
    Riconv(tmp, NULL, NULL, &ob, &onb);
    con->navail = (short)(50-onb); con->inavail = 0;
    /* libiconv can handle BOM marks on Windows Unicode files, but
       glibc's iconv cannot. Aargh ... */
    if(streql(con->encname, "UCS-2LE") ||
       streql(con->encname, "UTF-16LE")) con->inavail = -2;
    /* Discaard BOM */
    if(streql(con->encname, "UTF-8-BOM")) con->inavail = -3;
    }
    if(con->canwrite) {
    size_t onb = 25;
    char *ob = con->init_out;
    tmp = Riconv_open(con->encname, "");
    if(tmp != (void *)-1) con->outconv = tmp;
    else set_iconv_error(con, con->encname, "");
    /* initialize state, and prepare any initial bytes */
    Riconv(tmp, NULL, NULL, &ob, &onb);
    ob[25-onb] = '\0';
    }
}

I especially like the:

    /* libiconv can handle BOM marks on Windows Unicode files, but
       glibc's iconv cannot. Aargh ... */

comment :-)

That actually gets exposed at the C-level R interface as Rf_set_iconv(), too.

So, anything that can take a connection can actually have the BOM stripped if one does, say, file("example.csv", open="r", encoding="UTF-8-BOM").

Apologies if I led you astray at all.

from gtfsr.

dantonnoriega commented on June 15, 2024

No apologies needed! Even if your solution was "wrong" in the sense that
read.csv is not really doing the work, it still provided me with a way of
getting the code to be more robust!

Without your insight, I would have likely spent a long while trying to
figure out what was wrong, and likely would have ended up with zero
conceptual understanding, off base entirely, or with a shit solution.

On Sep 30, 2016 10:39 AM, "boB Rudis" [email protected] wrote:

I felt compelled to update here since I was a bit wrong (I like to admit
when I'm wrong :-)

read.csv() itself is not handling the UTF-8-BOM parameter, but R is
internally when processing connections:

void set_iconv(Rconnection con)
{
void *tmp;
/* need to test if this is text, open for reading to writing or both,       and set inconv and/or outconv */
if(!con->text || !strlen(con->encname) ||
   strcmp(con->encname, "native.enc") == 0) {
con->UTF8out = FALSE;
return;
}
if(con->canread) {
size_t onb = 50;
char *ob = con->oconvbuff;
/* UTF8out is set in readLines() and scan()       Was Windows-only until 2.12.0, but we now require iconv.     */
Rboolean useUTF8 = !utf8locale && con->UTF8out;
const char *enc =
    streql(con->encname, "UTF-8-BOM") ? "UTF-8" : con->encname;
tmp = Riconv_open(useUTF8 ? "UTF-8" : "", enc);
if(tmp != (void *)-1) con->inconv = tmp;
else set_iconv_error(con, con->encname, useUTF8 ? "UTF-8" : "");
con->EOF_signalled = FALSE;
/* initialize state, and prepare any initial bytes */
Riconv(tmp, NULL, NULL, &ob, &onb);
con->navail = (short)(50-onb); con->inavail = 0;
/* libiconv can handle BOM marks on Windows Unicode files, but       glibc's iconv cannot. Aargh ... */
if(streql(con->encname, "UCS-2LE") ||
   streql(con->encname, "UTF-16LE")) con->inavail = -2;
/* Discaard BOM */
if(streql(con->encname, "UTF-8-BOM")) con->inavail = -3;
}
if(con->canwrite) {
size_t onb = 25;
char *ob = con->init_out;
tmp = Riconv_open(con->encname, "");
if(tmp != (void *)-1) con->outconv = tmp;
else set_iconv_error(con, con->encname, "");
/* initialize state, and prepare any initial bytes */
Riconv(tmp, NULL, NULL, &ob, &onb);
ob[25-onb] = '\0';
}
}

I especially like the:
/* libiconv can handle BOM marks on Windows Unicode files, but       glibc's iconv cannot. Aargh ... */
comment :-)

That actually gets exposed at the C-level R interface as Rf_set_iconv()
so anything that can take a connection can actually have the BOM stripped
if one does, say, file("example.csv", open="r", encoding="UTF-8-BOM").

Apologies if I led you astray at all.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#19 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFaR2aheG2Ll8VVWpRSIw2Vk5hg8CbRFks5qvR8wgaJpZM4J7kdf
.

from gtfsr.

patperu commented on June 15, 2024

Sorry for being late, I'll give it a try during the weekend!
Thank you!

from gtfsr.

hrbrmstr commented on June 15, 2024

Thanks to this I've got a fledgling pkg: https://github.com/hrbrmstr/bom that will eventually make it more trivial to deal with these crazy BOMs. I haven't looked at how you solved it for gtfsr, yet, but let me know what types of API calls would help in the future. I implemented it in C for speed but may be able to tap into the connections interface on the R side if that would be more helpful.

from gtfsr.

Import error if the gtfs-file encoding is `UTF-8 BOM` about gtfsr HOT 9 CLOSED

Comments (9)

source http://www.sardegnamobilita.it/opengovernment/opendata/# the encoding for `stop_times.txt` is `UTF-8 BOM`r <- import_gtfs("http://www.covimo.de/gtfs/dati_atpss.zip")

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (9)

source http://www.sardegnamobilita.it/opengovernment/opendata/# the encoding for stop_times.txt is UTF-8 BOMr <- import_gtfs("http://www.covimo.de/gtfs/dati_atpss.zip")

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

source http://www.sardegnamobilita.it/opengovernment/opendata/# the encoding for `stop_times.txt` is `UTF-8 BOM`r <- import_gtfs("http://www.covimo.de/gtfs/dati_atpss.zip")