Comments (9)
An issue here is that read_csv()
uses iconvlist()
to test the encoding
parameter to locale()
and on my Linux and macOS systems, UTF-8-BOM
isn't in said list so it won't do the conversion (throws an error).
This could be useful (modified from the linked gist):
has_bom <- function(path, encoding="UTF-8") {
B <- readBin(path, "raw", 4, 1)
switch(encoding,
`UTF-8`=B[1]==as.raw(0xef) & B[2]==as.raw(0xbb) & B[3]==as.raw(0xbf),
`UTF-16`=B[1]==as.raw(0xff) & B[2]==as.raw(0xfe),
`UTF-16BE`=B[1]==as.raw(0xfe) & B[2]==as.raw(0xff),
{ message("Unsupported encoding") ; return(NA) }
)
}
has_bom("/tmp/stop_times.txt")
## [1] TRUE
has_bom("/tmp/stops.txt")
## [1] FALSE
Hadley isn't technically correct here when he says that read.csv
supports UTF-8-BOM
as an encoding
parameter. read.csv
— well, ultimately the C source for read.table()
— just ignores the BOM (try it with and w/o setting that parameter and you'll see that it makes no difference).
You could do something like:
if (has_bom(path)) { # has_bom() from above
read.csv(path, stringsAsFactors=FALSE) %>%
type_convert() %>%
as_tibble()
} else {
read_csv(path)
}
(adding any other parameters you were using in the gtfs read code for the CSVs) and it should take care of this.
The sneaky thing here is that the column name looks OK after read_csv()
reads in a file with a BOM but in the case of what's going on with the existing gtfs code:
Parsed with column specification:
cols(
`trip_id` = col_integer(),
arrival_time = col_time(format = ""),
departure_time = col_time(format = ""),
stop_id = col_integer(),
stop_sequence = col_integer(),
pickup_type = col_integer(),
drop_off_type = col_integer()
)
the back-ticks around trip_id
give it away that something weird is going on and, sure enough, the column name is this:
[1] ef bb bf 74 72 69 70 5f 69 64
So another option is to use make.names()
after read_csv()
on the column names and check for trip_id
really being X.trip_id
and just change the column name. That means you don't need the has_bom()
check.
Either way, it was a nice catch by Patrick.
from gtfsr.
I'll be going over the package next week! I'll give this a look.
On Sep 13, 2016 7:32 AM, "Patrick Hausmann" [email protected]
wrote:
Hello,
import_gtfs fails if the encoding of a gtfs file is UTF-8 BOM. This is
because the function use readr::read_csv with the default encoding in
locale set to UTF-8.Example:
source http://www.sardegnamobilita.it/opengovernment/opendata/# the encoding for
stop_times.txt
isUTF-8 BOM
r <- import_gtfs("http://www.covimo.de/gtfs/dati_atpss.zip")Not sure what a solution could be - maybe make locale an argument in the
function call? Most of the times all files of a gtfs dataset are of one
type (BOM or not BOM).Thank you for the package!
Patrick—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#19, or mute the thread
https://github.com/notifications/unsubscribe-auth/AFaR2ZmNE7033VkjQfoFhH0tk7u2MXFGks5qponpgaJpZM4J7kdf
.
from gtfsr.
Ok, great!
I came across this gist from @hrbrmstr https://gist.github.com/hrbrmstr/be3bf6e2b7e8b06648fd - maybe it could be useful.
from gtfsr.
Wow! Bob, thank you so much for this! I'm always excited about your answers!!
from gtfsr.
I think I've solved this but before closing, please give it a go!
from gtfsr.
I felt compelled to update here since I was a bit wrong (I like to admit when I'm wrong :-)
read.csv()
itself is not handling the UTF-8-BOM
parameter, but R is internally when processing connections:
void set_iconv(Rconnection con)
{
void *tmp;
/* need to test if this is text, open for reading to writing or both,
and set inconv and/or outconv */
if(!con->text || !strlen(con->encname) ||
strcmp(con->encname, "native.enc") == 0) {
con->UTF8out = FALSE;
return;
}
if(con->canread) {
size_t onb = 50;
char *ob = con->oconvbuff;
/* UTF8out is set in readLines() and scan()
Was Windows-only until 2.12.0, but we now require iconv.
*/
Rboolean useUTF8 = !utf8locale && con->UTF8out;
const char *enc =
streql(con->encname, "UTF-8-BOM") ? "UTF-8" : con->encname;
tmp = Riconv_open(useUTF8 ? "UTF-8" : "", enc);
if(tmp != (void *)-1) con->inconv = tmp;
else set_iconv_error(con, con->encname, useUTF8 ? "UTF-8" : "");
con->EOF_signalled = FALSE;
/* initialize state, and prepare any initial bytes */
Riconv(tmp, NULL, NULL, &ob, &onb);
con->navail = (short)(50-onb); con->inavail = 0;
/* libiconv can handle BOM marks on Windows Unicode files, but
glibc's iconv cannot. Aargh ... */
if(streql(con->encname, "UCS-2LE") ||
streql(con->encname, "UTF-16LE")) con->inavail = -2;
/* Discaard BOM */
if(streql(con->encname, "UTF-8-BOM")) con->inavail = -3;
}
if(con->canwrite) {
size_t onb = 25;
char *ob = con->init_out;
tmp = Riconv_open(con->encname, "");
if(tmp != (void *)-1) con->outconv = tmp;
else set_iconv_error(con, con->encname, "");
/* initialize state, and prepare any initial bytes */
Riconv(tmp, NULL, NULL, &ob, &onb);
ob[25-onb] = '\0';
}
}
I especially like the:
/* libiconv can handle BOM marks on Windows Unicode files, but
glibc's iconv cannot. Aargh ... */
comment :-)
That actually gets exposed at the C-level R interface as Rf_set_iconv()
, too.
So, anything that can take a connection can actually have the BOM stripped if one does, say, file("example.csv", open="r", encoding="UTF-8-BOM")
.
Apologies if I led you astray at all.
from gtfsr.
No apologies needed! Even if your solution was "wrong" in the sense that
read.csv is not really doing the work, it still provided me with a way of
getting the code to be more robust!
Without your insight, I would have likely spent a long while trying to
figure out what was wrong, and likely would have ended up with zero
conceptual understanding, off base entirely, or with a shit solution.
On Sep 30, 2016 10:39 AM, "boB Rudis" [email protected] wrote:
I felt compelled to update here since I was a bit wrong (I like to admit
when I'm wrong :-)read.csv() itself is not handling the UTF-8-BOM parameter, but R is
internally when processing connections:void set_iconv(Rconnection con)
{
void *tmp;/* need to test if this is text, open for reading to writing or both, and set inconv and/or outconv */ if(!con->text || !strlen(con->encname) || strcmp(con->encname, "native.enc") == 0) { con->UTF8out = FALSE; return; } if(con->canread) { size_t onb = 50; char *ob = con->oconvbuff; /* UTF8out is set in readLines() and scan() Was Windows-only until 2.12.0, but we now require iconv. */ Rboolean useUTF8 = !utf8locale && con->UTF8out; const char *enc = streql(con->encname, "UTF-8-BOM") ? "UTF-8" : con->encname; tmp = Riconv_open(useUTF8 ? "UTF-8" : "", enc); if(tmp != (void *)-1) con->inconv = tmp; else set_iconv_error(con, con->encname, useUTF8 ? "UTF-8" : ""); con->EOF_signalled = FALSE; /* initialize state, and prepare any initial bytes */ Riconv(tmp, NULL, NULL, &ob, &onb); con->navail = (short)(50-onb); con->inavail = 0; /* libiconv can handle BOM marks on Windows Unicode files, but glibc's iconv cannot. Aargh ... */ if(streql(con->encname, "UCS-2LE") || streql(con->encname, "UTF-16LE")) con->inavail = -2; /* Discaard BOM */ if(streql(con->encname, "UTF-8-BOM")) con->inavail = -3; } if(con->canwrite) { size_t onb = 25; char *ob = con->init_out; tmp = Riconv_open(con->encname, ""); if(tmp != (void *)-1) con->outconv = tmp; else set_iconv_error(con, con->encname, ""); /* initialize state, and prepare any initial bytes */ Riconv(tmp, NULL, NULL, &ob, &onb); ob[25-onb] = '\0'; }
}
I especially like the:
/* libiconv can handle BOM marks on Windows Unicode files, but glibc's iconv cannot. Aargh ... */
comment :-)
That actually gets exposed at the C-level R interface as Rf_set_iconv()
so anything that can take a connection can actually have the BOM stripped
if one does, say, file("example.csv", open="r", encoding="UTF-8-BOM").Apologies if I led you astray at all.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#19 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFaR2aheG2Ll8VVWpRSIw2Vk5hg8CbRFks5qvR8wgaJpZM4J7kdf
.
from gtfsr.
Sorry for being late, I'll give it a try during the weekend!
Thank you!
from gtfsr.
Thanks to this I've got a fledgling pkg: https://github.com/hrbrmstr/bom that will eventually make it more trivial to deal with these crazy BOMs. I haven't looked at how you solved it for gtfsr
, yet, but let me know what types of API calls would help in the future. I implemented it in C for speed but may be able to tap into the connections
interface on the R side if that would be more helpful.
from gtfsr.
Related Issues (20)
- Add package level manual
- read_gtfs will delete invalid gtfs zip files. HOT 1
- Add more examples throughout package
- New format maybe worth knowing HOT 1
- Added overlays for stops and routes HOT 1
- add docs for os-specific setup? HOT 4
- Calculating distance of each trip_id and shape_id HOT 24
- Dealing with irregular, unrecognised GTFS files HOT 4
- GTFS Realtime HOT 4
- summarise_each deprecated HOT 1
- Update for newer version of R HOT 2
- Can't build vignettes HOT 11
- cran release HOT 5
- separate validation, joining, and mapping? HOT 4
- Add names of recent contributors to description HOT 3
- consider breaking off pieces of the package that make maintenance and CRAN submission harder HOT 7
- trouble installing on windows and rstudio.cloud HOT 1
- API key HOT 2
- Is this package still maintained? HOT 3
- New Maintainer Wanted :-) HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gtfsr.