tidyverse / readr Goto Github PK

View Code? Open in Web Editor NEW

996.0 996.0 286.0 11.09 MB

Read flat files (csv, tsv, fwf) into R

Home Page: https://readr.tidyverse.org

License: Other

R 53.72% C++ 42.11% C 4.16%

csv fwf parsing r

readr's Introduction

tidyverse

Overview

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.

If you’d like to learn how to use the tidyverse effectively, the best place to start is R for Data Science (2e).

Installation

# Install from CRAN
install.packages("tidyverse")

# Install the development version from GitHub
# install.packages("pak")
pak::pak("tidyverse/tidyverse")

If you’re compiling from source, you can run pak::pkg_system_requirements("tidyverse"), to see the complete set of system packages needed on your machine.

Usage

library(tidyverse) will load the core tidyverse packages:

ggplot2, for data visualisation.
dplyr, for data manipulation.
tidyr, for data tidying.
readr, for data import.
purrr, for functional programming.
tibble, for tibbles, a modern re-imagining of data frames.
stringr, for strings.
forcats, for factors.
lubridate, for date/times.

You also get a condensed summary of conflicts with other packages you have loaded:

library(tidyverse)
#> ── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0.9000 ──
#> ✔ dplyr     1.1.3     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You can see conflicts created later with tidyverse_conflicts():

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
tidyverse_conflicts()
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ✖ MASS::select()  masks dplyr::select()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

And you can check that all tidyverse packages are up-to-date with tidyverse_update():

tidyverse_update()
#> The following packages are out of date:
#>  * broom (0.4.0 -> 0.4.1)
#>  * DBI   (0.4.1 -> 0.5)
#>  * Rcpp  (0.12.6 -> 0.12.7)
#>  
#> Start a clean R session then run:
#> install.packages(c("broom", "DBI", "Rcpp"))

Packages

As well as the core tidyverse, installing this package also installs a selection of other packages that you’re likely to use frequently, but probably not in every analysis. This includes packages for:

Working with specific types of vectors:
- hms, for times.
Importing other types of data:
- feather, for sharing with Python and other languages.
- haven, for SPSS, SAS and Stata files.
- httr, for web apis.
- jsonlite for JSON.
- readxl, for .xls and .xlsx files.
- rvest, for web scraping.
- xml2, for XML.
Modelling
- modelr, for modelling within a pipeline
- broom, for turning models into tidy data

Code of Conduct

Please note that the tidyverse project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

readr's People

Contributors

Stargazers

Watchers

Forkers

arturochian astamm cesarmaalouf christophergandrud jjallaire dickoa xtmgah dgromer jmtorres138 johnmcdonnell berryhn sjackman bquast pachevalier kbenoit lucacerone maruthiprithivi mcdelaney humburg freys3 alyst marcds sphs80 antoine-lizee wypd asnr richierocks huichunchien omdgit xumaoxuan bkg6 scipionesarlo danruderman zhmz90 dselivanov ijlyttle gshotwell uribo cophy08 shrektan lluisramon constantin345 realakhmed noamross dpastoor fdzul welch16 technologiclee 1402yash jorane filomenac javierluraschi nabilabd dulani zeehio junjiemao milesmcbain leeper mbask tdoehmen xulukai mjdata 8bit-pixies jimhester defconst fisher0218 xinguangliu linkfar korterling raniereramos michaelquinn32 stillmatic liyistat tarakc02 mmuurr nikolayvoronchikhin fengwangjiang vblain cderv aaronmcdaid simplyfrank robbyshaver mhlinder lzcheng lakshmikanthgd99 rlugojr izahn ktaranov vspinu yeedle commonll danicomul alberthkcheng rgknight jrnold connectthefuture gergness manishgodse ipea aespar21

readr's Issues

C++Object install issue using devtools

When trying to install fread from devtools I get the following error after all the g++ messages scroll through:

installing to /home/devin/R/x86_64-pc-linux-gnu-library/3.1/fastread/libs
** R
** inst
** tests
** preparing package for lazy loading
** help
No man pages found in package  'fastread' 
*** installing help indices
** building package indices
** testing if installed package can be loaded
Warning: class "C++Object" is defined (with package slot 'Rcpp') but no metadata object found to revise subclass information---not exported?  Making a copy in package 'fastread'
* DONE (fastread)

When I try to load it I get the same warning:

> library(fastread)
Warning message:
class "C++Object" is defined (with package slot ‘Rcpp’) but no metadata object found to revise subclass information---not exported?  Making a copy in package ‘fastread’

However no functions are available:

> test_csv <- read_csv("big_csv.csv")
Error: could not find function "read_csv"

For reference here is my sessionInfo()

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C        
 [3] LC_TIME=C            LC_COLLATE=C        
 [5] LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C           
 [9] LC_ADDRESS=C         LC_TELEPHONE=C      
[11] LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils    
[5] datasets  methods   base     

other attached packages:
[1] fastread_0.0.0 devtools_1.5  

loaded via a namespace (and not attached):
 [1] RCurl_1.95-4.1   Rcpp_0.11.1     
 [3] assertthat_0.1   codetools_0.2-8 
 [5] digest_0.6.4     evaluate_0.5.3  
 [7] httr_0.3         memoise_0.1     
 [9] parallel_3.1.0   stringr_0.6.2   
[11] tools_3.1.0      whisker_0.3-2   
[13] xpose4data_4.4.1

Always add tbl_df class

This is not harmful if dplyr is not loaded, but is helpful if it is (especially for large datasets).

Other compression libraries

For now, we've only looked at either mmap files or leveraging R connections. Perhaps it would be interesting to read from other formats, e.g. https://code.google.com/p/lz4/

arbitrary connection support

We need to leverage this: https://gist.github.com/romainfrancois/6119995

so that we can read a stream from an arbitrary connection. We would still use mmap for speed when it makes sense, but otherwise we can process the stream from a buffered connection.

In a threaded world we can imagine to separate the work into:

buffering data from the connection
process data

so that the thread(s) processing the data would not have to wait to process.

But before threads, we can sequentially retrieve data from the connection and process.

The problem I suppose is that I guess we can only read once from some connections, which might render difficult a two steps algorithm like counting the number of lines and then allocate data ...

Need to add check for interrupts

Maybe every ~5,000 rows?

Reading dates

Lots to be learned from Simon's fasttime.

Should be able to supply encoding

Output should always be utf-8

Should line_delim be a parameter?

Or is it better to just automatically recognise any of the common combinations? I can't think of a use case where the line delimiter matters.

Microbenchmark tweaks

I think things are a bit more readable if you do:

options(digits = 3)
microbenchmark(     
  base = read.csv('factors.csv', header = FALSE, nrows = n, 
    colClasses = rep("factor", 10)), 
  datatable = fread_factors('factors.csv', header = FALSE, 
    sep = ",", nrows = n), 
  fastread = read_csv( 'factors.csv', n, rep( "factor", 10 ) ), 
  times = 10L 
)

i.e. set a smaller number of decimals and label the output explicitly.

I'd also think about testing with 1e5 rows - that's still big enough to get a decent estimate of the speed, but makes the benchmark process a little faster.

Finally, it'd be nice if the benchmarks automatically created the data when needed:

if (!file.exists("factor.csv")) make_factors(nr = 1e5)

Strict ISO8601 date & time parsers

Should fail if not exactly as specified, e.g. http://en.wikipedia.org/wiki/ISO_8601

Dates:

YYYY-MM-DD
YYYYMMDD

Times:

hh:mm:ss
hhmmss
hh:mm
hhmm

Time zones

Z
±hh:mm
±hhmm
±hh

Date time = date + T + time + (optional) time zone

Growing vector strategy.

Sometimes we don't know the final size and it is too expensive or perhaps impossible to calculate it in advance. For example when we read from several files or connections, I don't think we can realistically assume that users will give us the number of lines in each file.

So we can use std::vector to accumulate data or perhaps use R_alloc to get some buffer garbage collected memory, or perhaps both (see http://blog.r-enthusiasts.com/2014/01/21/r-transient-allocator.html which can be easily extracted out Rcpp11 if needed).

connection pushback

We need to consider the connection push back.

When we read from a connection: there might be data in the push back, we need to read this first

Because we use ->read to buffer data, we need to put data back into the push back if we exit. For example if we read 3 lines, we need to put data in the push back so that it starts at the start of the 4th line.

read_lines and count_lines should take line delimiter as a parameter

@romainfrancois can you give a hint at how to implement this? It's not clear to me where you need to add to the existing templates.

I'm now thinking exposing the line delimiter is something we'll do for low-level functions like read_lines() and count_lines(), but we don't need to expose it for high-level functions like read_csv() and read_delim().

Decouple MMapReader

Too much logic in MMapReader perhaps it should just exhibit:

int get_int
double get_double
String get_String
String get_line

and higher level algorithms can be defined elsewhere, e.g. things like read, get, count_lines, ...

This way we can separate the algorithm from low level stuff. So that supporting other connections would only be a matter of implementing these low level stuff.

Fast writers

We may also want to consider having fast writers. write.table generates the complete output before saving it to disk, so it's not suitable for saving large files. For a recent problem I had to write

n <- 10000L
m <- floor(nrow(logs) / n)
for (i in seq(0, m, by = 1)) {
  start <- i * n + 1
  end <- pmin((i + 1) * n, nrow(logs))

  write.table(logs[start:end, ], "logs.csv", row.names = FALSE, 
    sep = ",", append = i != 0L, col.names = FALSE, na = "")
  cat(".")
}

That's obviously not ideal.

benchmark reading / vs parsing

Here are some simple benchmarks for read_csv of http://cran-logs.rstudio.com/2014/2014-01-27.csv.gz for now setting all columns to character

path <- "/tmp/2014-01-27.csv.gz"
junk <- read_csv( file(path, open = "r"), n = 115829, classes = rep("character", 10) )

nore that we assume we know in advance the number of rows (115829).
The times below are milli seconds, split into 3 parts

alloc: time R takes to allocate the data. nothing we can do here.
process: parsing the feed and assigning chunks into the vectors
read: reading from the connection

    what      time percent
1 process 157.90994   71.38
2   alloc  17.95150    8.11
3    read  45.36878   20.51

It takes more time to read than to process. I think that's a good thing. When we move to threading, it means we can essentially anticipate reading.

Here is a raw comparison of read_csv and read.csv using a gz connection:

> system.time( read_csv( file(path, open = "r"), n = 115829, classes = rep("character", 10) ) )
utilisateur     système      écoulé
      0.203       0.005       0.207
> system.time( read.csv( file = file(path, open = "r"), nrows = 115829, colClasses = rep("character", 10) ) )
utilisateur     système      écoulé
      1.285       0.010       1.308

Should factor parser have labels argument?

That would allow you to recode values as they are loaded. Maybe use names of levels?

Enumerate design principles

When designing fastread functions, it's useful to have some design principles to adhere to. Here's my thoughts:

Fail early: fastread functions should fail early, as soon as they discover there is something wrong with the input. fastread should make reasonable default guesses, but if it guesses wrong (e.g. a column turns out to be character, not integer) it shouldn't automatically coerce, but should raise an error indicating what the problem is. The user always know best about their data, so if a column type (or name) conflicts between the file and specification, then we should throw an error so that the user can resolve the problem (This also means providing informative diagnostics so they can see exactly where the problem was).
Fast: the goal of fastread is to be fast. If adding a rarely used feature would result in a significant slow down, we should reconsider how important it is. Users can always fall back to the slow built in functions if they need to parse an unusual format. fastread should be competitive with the best file readers from other programming languages.

Once we've iterated on these a couple of times, they can go in the readme.

File specification

If the user doesn't supply any classes, we should run something like guess_classes(file, n = 10) which would inspect the first n lines to guess the classes for each line:

integer and double: straightforward
date/time: recgonise a few standard unambiguous formats (e.g. %Y-%m-%d) and selection of most common ISO_8601 formats
factor: never guessed
character: anything else

guess_classes() would return a vector of parsers, as well as printing a message informing the user what it guessed. The message would be designed so that you could copy and paste into a new read_csv call. (The message is always printed to encourage the user to incorporated into the read_csv call, making code more robust to changes in the underlying data).

guess_classes("iris.csv")
#> Guessing column types:
#> colclasses(double(), double(), double(), double(), character())

(although obviously those function names don't work because they're already claimed by R)

For user specification we might want a richer form that let you supply a default and exceptions:

read_csv("iris.csv", colclasses(.default = double(), Species = character())

Maybe thinking about it as a column specification would be better. Then you could supply column names:

# Read from first line
colspec(names = firstline(), classes = list(Species = character()), default_class = double())
# Make default names
colspec(names = default(prefix = "X"))
# Supply explicit names (if not present)
colspec(names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"))

The big advantage in assuming that we always have some specification of column names is then we can specify classes by name, which makes code more robust (it would be an error to specify the class for a column name that doesn't exist in the data).

spec() is probably sufficient for a name, so we'd have a spec() for user supplied specifications and guess_spec() to automatically determine. Then we need some way of describing the individual classes. I think it's relatively straightforward to see what each class needs (e.g. format for dates/times, levels for factor), the main problem is coming up with succinct yet clear function names that don't conflict with existing names.

I also wonder if we could deal with missing values through a column specification: double(na.value = 99) etc. The downside is that we'd need a special way of describing a global missing value setting, but it would make the specification more flexible. A complete set of arguments that we could move from read_csv to column specifications is:

na.strings
strip.white
allowEscapes

I'm imagining that the column specifications (e.g. spec_double() (which needs a better name)) would generate a RC object that's bound to a C++ object which defined parsing behaviour. Does that make sense?

Fix "C++Object" warning

When building/loading the package I get: "Warning message:
class "C++Object" is defined (with package slot ‘Rcpp’) but no metadata object found to revise subclass information---not exported? Making a copy in package ‘fastread’ "

Binary data frame file format

Just ideas about how to store a data frame in a custom binary file. related to #28.

We need:

the number of columns (space for an 32 bits int)
the names of the columns (null terminated strings)
the R types of the columns (INTSXP, REALSXP, ...), perhaps distinguishing factors, perhaps handling attributes too (times and dates).
data for each column. Data should include space for header of the SEXP and space for allocator so that we can do #28.

For STRSXP, I guess we'll have to rematerialize each CHARSXP so we'd have to come up with some smart way of serializing / deserializing the information

separate time

For example in the cran logs:

"date","time","size","r_version","r_arch","r_os","package","version","country","ip_id"
"2014-01-27","00:20:07",766603,"3.0.2","i686","linux-gnu","pomp","0.45-8","SE",1
"2014-01-27","00:18:09",429218,"3.0.1","x86_64","mingw32","Cubist","0.0.15","ES",2
"2014-01-27","00:18:11",307604,"3.0.1","x86_64","mingw32","desirability","1.6","ES",2
"2014-01-27","00:18:20",47108,"3.0.1","x86_64","mingw32","fishmove","0.2-1","ES",2
"2014-01-27","00:18:24",57993,"3.0.1","x86_64","mingw32","gamlss.tr","4.2-7","ES",2

The second column is the hour.

If we were reading both the first and second column, we could make a POSIXct object directly. But here they are split in date and time.

I can read the Date easily either as a Date #9 or a POSIXct but I'm not sure how to represent a time.

I can easily enough read it as a character string.

Perhaps we need something to express = "from these two columns, read a date time. ?

Source adapters

We can read from arbitrary connections now and make the data read in the connection available as a Source.

Would be interesting to have a Source implementation that filters data. This FilteredSource would read from the underlying connection, process each line and decide if it keeps it or not.

Interesting filters can be:

based on line numbers. For example expressing reading one line every 10 lines
based on string matching. perhaps we can process the line through a regex. Or simpler things like starts_with, ends_with, contains.

What's interesting about this is that this is independant from the parsing done in read_csv or other consumers of Source.

Of course, depending on the underlying source, it might be possible or not to calculate the number of lines in advance.

fast readLines

It might be useful to have a fast version of readLines too,
especially if you could specify the line break character. I'm sure we
could do better than base R.

Fast character count

For guessing line and field delimiters, it would be useful to have some way of quickly counting the number of each non-alphanumeric character in the first n bytes. The result would be a named integer vector, e.g. c("." = 10, "," = 100, "|" = 1, "\n" = 5).

All parsers need invalid argument

e.g. invalid = c("error", "warning", "silent"):

error = throw error message with field/line number and abort parsing
warning = warning with field/line number and replace with NA
silent = just replace with NA.

Read multiple files into one data frame

A common use case is reading multiple files using lapply + read.csv and then combining them into one file with rbind/rbind.fill/rbind_list. It would be nice if there was some way to do this in one step to avoid a copy.

Escaping quotes

Need to be able to pick between doubling (e.g "a""b") and C-style (e.g. "a\"b")

read entire file

a fast function to read in a complete
file into a single character vector (as per our conversation on
twitter).

https://twitter.com/hadleywickham/status/372358793432006656

https://gist.github.com/hadley/6353939

Need function to find "rivers" of white space

To be able to automatically derive fwf spec from file that uses empty columns to separate fields.

Input to functions

Maybe all read functions should take:

a character vector (possibly of length >1)
a connection
a list of connections

For connections, need to use the standard R policy of closing if we opened.

This would combine read_csv and read_csv_all into one function.

Need sloppy numeric parser

e.g. should just ignore non-numerics. Useful if you have "$1,200.45" etc.

Question a propos de fastread

Bonjour Romain,
je me permets une question sur fastread, y'a t'il une doc sur fastread ?
j'utilise data.table::fread, qui est très rapide, y'a t'il un benchmark ?

Amicalement
Colin

line filtering

It would be useful to supply some sort of line filter function,
which could be used to to only parsed selected lines (e.g. to skip the
first n lines, or to skip lines starting with a certain character)

Another benchmark option

http://stackoverflow.com/a/15058684/16632

Review pandas and data.table unit tests

And find tests for us to use.

[feature request] header = TRUE/FALSE option

Given data doesn't always have a header would be great to be able to directly specify. A cursory look through the source code looked like this was not currently supported.

Split on whitespace

For a read.table() equivalent, we need to be able to split on consecutive whitespace. (i.e. it should consume all spaces and tabs until a non-whitespace character is found).

Strip quote character

Need some way to guess number of columns

I don't think I can do this in R, because it needs to understand the various escaping mechanisms (commas in strings etc). It would be useful to have a function that read n lines, and parsed each line into a character vector (so you'd get a list of length n, containing character vectors that weren't necessarily all the same length).

character vectors as inputs

In the interests of being composable, not just fast, it would be
useful to be able to either take a file name or a character vector -
and if there are multiple entries in the character vector to treat
them each like a line. That's useful if you need to do some custom
preprocessing before you parse the file

Also include easy way to get into sqlite?

Should be able to generate good sql after reading first few columns.

Fixed width parsing

use custom allocators

Suppose we have a binary file of some sort where each vector of the data frame are contiguous. I'm pretty sure we could leverage R custom allocators (wch/r-source@ebf11b0)

/* R_allocator_t typedef is also declared in Rinternals.h 
   so we guard against random inclusion order */
#ifndef R_ALLOCATOR_TYPE
#define R_ALLOCATOR_TYPE
typedef struct R_allocator R_allocator_t;
#endif

typedef void *(*custom_alloc_t)(R_allocator_t *allocator, size_t);
typedef void  (*custom_free_t)(R_allocator_t *allocator, void *);

struct R_allocator {
    custom_alloc_t mem_alloc; /* malloc equivalent */
    custom_free_t  mem_free;  /* free equivalent */
    void *res;                /* reserved (maybe for copy) - must be NULL */
    void *data;               /* custom data for the allocator implementation */
};

So we would need to come up with a custom_alloc_t function that would return the appropriate location in the binary mmaped file rather than allocate.

Perhaps we can maintain a count of the vectors that are coming from this mmaped file and then unmap when mem_free is used on the last of them.

@s-u does that sound like a potential use case for this feature ? It seems that with this we could load data pretty instantly (just a bit more than cost of mmap I guess).

Date time parsers

We need to benchmark date time options. There are two basic types: fixed (e.g. ISO 8601 and Simon's customer) and user supplied (e.g. "%m-%d-%Y"). For fixed, we need to compared hand rolled solutions with a spirit qi parser, and for user supplied we need to compare strftime() and the Boost datetime equivalent.

Repo organisation suggestions

I think it's a bit nicer to make the repo a valid R package from the root directory. This doesn't need many changes:

just all directories inside fastread.R into the root dir
create a new bench/ dir to contain the existing benchmarks in the root dir

Can't build

Hi.

Interesting package but I can't build it. Compilation log below.

> devtools::install_github("fastread", "hadley")
Installing github repo fastread/master from hadley
Downloading master.zip from https://github.com/hadley/fastread/archive/master.zip
Installing package from /tmp/RtmpX6RYa1/master.zip
arguments 'minimized' and 'invisible' are for Windows only
Installing fastread
'/usr/lib64/R/bin/R' --vanilla CMD INSTALL '/tmp/RtmpX6RYa1/devtools687d5a9f9278/fastread-master' --library='/home/unikum/R/x86_64-unknown-linux-gnu-library/3.1'  \
  --install-tests 

* installing *source* package 'fastread' ...
** libs
g++ -I/usr/include/R/ -DNDEBUG -I../inst/include -D_FORTIFY_SOURCE=2 -I"/home/unikum/R/x86_64-unknown-linux-gnu-library/3.1/Rcpp/include" -I"/home/unikum/R/x86_64-unknown-linux-gnu-library/3.1/BH/include"   -fpic  -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4  -c MultipleConnectionReader.cpp -o MultipleConnectionReader.o
g++ -I/usr/include/R/ -DNDEBUG -I../inst/include -D_FORTIFY_SOURCE=2 -I"/home/unikum/R/x86_64-unknown-linux-gnu-library/3.1/Rcpp/include" -I"/home/unikum/R/x86_64-unknown-linux-gnu-library/3.1/BH/include"   -fpic  -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4  -c RcppExports.cpp -o RcppExports.o
g++ -I/usr/include/R/ -DNDEBUG -I../inst/include -D_FORTIFY_SOURCE=2 -I"/home/unikum/R/x86_64-unknown-linux-gnu-library/3.1/Rcpp/include" -I"/home/unikum/R/x86_64-unknown-linux-gnu-library/3.1/BH/include"   -fpic  -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4  -c count_char.cpp -o count_char.o
count_char.cpp: In function 'std::map<char, int> count_char_from_file(std::string, int, int, std::string, std::string, bool, bool, bool, bool)':
count_char.cpp:71:24: error: no matching function for call to 'std::basic_fstream<char>::basic_fstream(std::string&)'
   std::fstream stream(x);
                        ^
count_char.cpp:71:24: note: candidates are:
In file included from count_char.cpp:3:0:
/usr/include/c++/4.9.0/fstream:819:7: note: std::basic_fstream<_CharT, _Traits>::basic_fstream(const char*, std::ios_base::openmode) [with _CharT = char; _Traits = std::char_traits<char>; std::ios_base::openmode = std::_Ios_Openmode]
       basic_fstream(const char* __s,
       ^
/usr/include/c++/4.9.0/fstream:819:7: note:   no known conversion for argument 1 from 'std::string {aka std::basic_string<char>}' to 'const char*'
/usr/include/c++/4.9.0/fstream:806:7: note: std::basic_fstream<_CharT, _Traits>::basic_fstream() [with _CharT = char; _Traits = std::char_traits<char>]
       basic_fstream()
       ^
/usr/include/c++/4.9.0/fstream:806:7: note:   candidate expects 0 arguments, 1 provided
/usr/include/c++/4.9.0/fstream:779:11: note: std::basic_fstream<char>::basic_fstream(const std::basic_fstream<char>&)
     class basic_fstream : public basic_iostream<_CharT, _Traits>
           ^
/usr/include/c++/4.9.0/fstream:779:11: note:   no known conversion for argument 1 from 'std::string {aka std::basic_string<char>}' to 'const std::basic_fstream<char>&'
/usr/lib64/R/etc/Makeconf:137: recipe for target 'count_char.o' failed
make: *** [count_char.o] Error 1
ERROR: compilation failed for package 'fastread'
* removing '/home/unikum/R/x86_64-unknown-linux-gnu-library/3.1/fastread'
 Hide Traceback

 Rerun with Debug
 Ошибка: Command failed (1) 
13 stop("Command failed (", status, ")", call. = FALSE) 
12 system_check(r_path, options, c(r_env_vars(), env_vars), ...) 
11 force(code) 
10 in_dir(path, system_check(r_path, options, c(r_env_vars(), env_vars), 
    ...)) 
9 R(paste("CMD INSTALL ", shQuote(built_path), " ", opts, sep = ""), 
    quiet = quiet) 
8 install(pkg_path, quiet = quiet, ...) 
7 install_local_single(bundle, subdir = subdir, before_install = before_install, 
    ...) 
6 (function (url, name = NULL, subdir = NULL, config = list(), 
    before_install = NULL, ...) 
{
    if (is.null(name)) { ... 
5 mapply(install_url_single, url, name, MoreArgs = list(subdir = subdir, 
    config = config, before_install = before_install, ...)) 
4 install_url(conn$url, subdir = conn$subdir, config = conn$auth, 
    before_install = github_before_install, ...) 
3 FUN("fastread"[[1L]], ...) 
2 vapply(repo, install_github_single, FUN.VALUE = logical(1), username, 
    ref, pull, subdir, branch, auth_user, password, auth_token, 
    ..., dependencies = TRUE) 
1 devtools::install_github("fastread", "hadley") 
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=ru_RU.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=C               LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=ru_RU.UTF-8   
 [7] LC_PAPER=ru_RU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] RCurl_1.95-4.1 devtools_1.5   digest_0.6.4   evaluate_0.5.5 httr_0.3       memoise_0.2.1  parallel_3.1.0 stringr_0.6.2  tools_3.1.0    whisker_0.3-2

~~
Artem

Set up travis

fast scan

fast version of scan

Column name/named parser mismatch.

data <- read_delim(file = "sampled-1000.tsv.log-20130502.gz",
col_names = FALSE,
parsers = list(V1 = skip_parser(),
V2 = skip_parser(),
V3 = character_parser(),
V4 = skip_parser(),
V5 = character_parser(),
V6 = character_parser(),
V7 = skip_parser(),
V8 = skip_parser(),
V9 = character_parser(),
V10 = skip_parser(),
V11 = character_parser(),
V12 = character_parser(),
V13 = character_parser(),
V14 = character_parser(),
V15 = character_parser(),
V16 = character_parser()),
quote = "", delim = "\t")

Guessing column specification from first 1 lines of data
No column names in file, using X1-X1
Error: The following named parsers don't match the column names: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16

...I'm probably just missing something, but what's going on here?