Code Monkey home page Code Monkey logo

Comments (7)

hadley avatar hadley commented on June 15, 2024

Run 1-download.r from https://github.com/hadley/cran-logs-dplyr to see this bottleneck in a real problem.

from readr.

romainfrancois avatar romainfrancois commented on June 15, 2024

Initially, I can come up with something if I can assume:

  • the number of rows in each file
  • that all are compatible in format

I can bring some of the logic from dplyr::rbind_list, e.g. the Promoter etc ... to handle type promotion

When we know the size in advance and there are no type promotion, this is likely to benefit parallelization.

I can see this becoming a theme across projects we do together (at least fastread and dplyr). I'll make sure to study the parallel package and perhaps its ancestors (mutlicore) which probably have interesting nuggets.

from readr.

hadley avatar hadley commented on June 15, 2024

Yes, I think that sounds like a reasonable approach, and parallelisation is definitely going to be key.

from readr.

romainfrancois avatar romainfrancois commented on June 15, 2024

One thing to consider is that there is a limit of 128 connections. So we cannot open as many connections as files we want to participate in this.

I think it is unlikely that when we do this (reading from multiple) files, we already know the number of lines in each file. So we'll have to do it in steps:

  • count the number of lines for each file
  • allocate big enough data frame
  • fill the data

This might mean we have to run the show from R, (have the loop on the R side)

As an aside, here is counting the number of lines for all .csv.gz from cran rstudio logs in 2013:

> system.time( { sapply( list.files(path, full.names=TRUE), function(.){ con <- file( ., open = "r" ) ; on.exit(close(con)); count_lines(con) }) })
utilisateur     système      écoulé
     10.262       0.166      12.191

12 seconds, which is going to be small compared to parsing the data anyway.

I think it is worth paying that price rather than progressively grow the data. We might be able to divide this timings when using threads.

from readr.

romainfrancois avatar romainfrancois commented on June 15, 2024

Making progress in read_csv_all. From all the csv.gz files from 2013, we can read them into a huge df like this:

df_fastread <- read_csv_all( files,
    classes = c("Date", "Time", "integer", "string", "string", "string", "string", "string", "string", "integer" ), header = TRUE
    )

That took :

utilisateur     système      écoulé
     36.318       1.028      41.159

One alternative is to read dfs in a list and fuse the list with dplyr::rbind_all. Perhaps somethinh like this:

df_rbind_list <- lapply(files, function(file){
    con <- file( file, open = "r" )
    on.exit(close(con))

    data <- read_csv(con , 
      classes = c("Date", "Time", "integer", "string", "string", "string", "string", "string", "string", "integer" ), header = TRUE, n = 0
    )
    data

  })
  rbind_all(df_rbind_list)

This unsurprisingly takes longer:

utilisateur     système      écoulé
     51.126       8.359      66.532

One thing to notice is that the initial counting takes a significant amount of time:

utilisateur     système      écoulé
     10.242       0.101      10.366

It might be more efficient to use an unzip + mmap strategy so that we only have to uncompress once. At the moment what I use is:

  • for each file, open a connection, count the number of lines, close the connection
  • create the big data frame
  • for each file, open a connection, and fill the appropriate chunk of the result data frame

So decompression of the data happens twice.

I don't think storing all the data in memory when it is read the first time really is an option, as we would have to do it for all the files, leading to a lot of memory use.

I'll experiment with alternative strategies later:

  • via a unique file. Essentially decompressing each input file into the same temp file. And then just read this file through the mmap source.
  • via several files. each file gets decompressed separately on disk.

The first option is simpler. The second option is easier to parallelise.

from readr.

hadley avatar hadley commented on June 15, 2024

Caching to disk would be especially important for url() connections, since the data might change between requests. (That might also allow you to combine url + gzfile, which would be really nice)

from readr.

hadley avatar hadley commented on June 15, 2024

Now that we have fast rbinding in dplyr, I don't think this is an important issue.

from readr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.