A common use case is reading multiple files using lapply</co

Run 1-download.r from <a href="https://github.com/had

Making progress in read_csv_all . From all the <code c

Read multiple files into one data frame about readr HOT 7 CLOSED

tidyverse commented on June 15, 2024

Read multiple files into one data frame

from readr.

Comments (7)

hadley commented on June 15, 2024

Run 1-download.r from https://github.com/hadley/cran-logs-dplyr to see this bottleneck in a real problem.

from readr.

romainfrancois commented on June 15, 2024

Initially, I can come up with something if I can assume:

the number of rows in each file
that all are compatible in format

I can bring some of the logic from dplyr::rbind_list, e.g. the Promoter etc ... to handle type promotion

When we know the size in advance and there are no type promotion, this is likely to benefit parallelization.

I can see this becoming a theme across projects we do together (at least fastread and dplyr). I'll make sure to study the parallel package and perhaps its ancestors (mutlicore) which probably have interesting nuggets.

from readr.

hadley commented on June 15, 2024

Yes, I think that sounds like a reasonable approach, and parallelisation is definitely going to be key.

from readr.

romainfrancois commented on June 15, 2024

One thing to consider is that there is a limit of 128 connections. So we cannot open as many connections as files we want to participate in this.

I think it is unlikely that when we do this (reading from multiple) files, we already know the number of lines in each file. So we'll have to do it in steps:

count the number of lines for each file
allocate big enough data frame
fill the data

This might mean we have to run the show from R, (have the loop on the R side)

As an aside, here is counting the number of lines for all .csv.gz from cran rstudio logs in 2013:

> system.time( { sapply( list.files(path, full.names=TRUE), function(.){ con <- file( ., open = "r" ) ; on.exit(close(con)); count_lines(con) }) })
utilisateur     système      écoulé
     10.262       0.166      12.191

12 seconds, which is going to be small compared to parsing the data anyway.

I think it is worth paying that price rather than progressively grow the data. We might be able to divide this timings when using threads.

from readr.

romainfrancois commented on June 15, 2024

Making progress in read_csv_all. From all the csv.gz files from 2013, we can read them into a huge df like this:

df_fastread <- read_csv_all( files,
    classes = c("Date", "Time", "integer", "string", "string", "string", "string", "string", "string", "integer" ), header = TRUE
    )

That took :

utilisateur     système      écoulé
     36.318       1.028      41.159

One alternative is to read dfs in a list and fuse the list with dplyr::rbind_all. Perhaps somethinh like this:

df_rbind_list <- lapply(files, function(file){
    con <- file( file, open = "r" )
    on.exit(close(con))

    data <- read_csv(con , 
      classes = c("Date", "Time", "integer", "string", "string", "string", "string", "string", "string", "integer" ), header = TRUE, n = 0
    )
    data

  })
  rbind_all(df_rbind_list)

This unsurprisingly takes longer:

utilisateur     système      écoulé
     51.126       8.359      66.532

One thing to notice is that the initial counting takes a significant amount of time:

utilisateur     système      écoulé
     10.242       0.101      10.366

It might be more efficient to use an unzip + mmap strategy so that we only have to uncompress once. At the moment what I use is:

for each file, open a connection, count the number of lines, close the connection
create the big data frame
for each file, open a connection, and fill the appropriate chunk of the result data frame

So decompression of the data happens twice.

I don't think storing all the data in memory when it is read the first time really is an option, as we would have to do it for all the files, leading to a lot of memory use.

I'll experiment with alternative strategies later:

via a unique file. Essentially decompressing each input file into the same temp file. And then just read this file through the mmap source.
via several files. each file gets decompressed separately on disk.

The first option is simpler. The second option is easier to parallelise.

from readr.

hadley commented on June 15, 2024

Caching to disk would be especially important for url() connections, since the data might change between requests. (That might also allow you to combine url + gzfile, which would be really nice)

from readr.

hadley commented on June 15, 2024

Now that we have fast rbinding in dplyr, I don't think this is an important issue.

from readr.

Read multiple files into one data frame about readr HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent