Comments (7)
Run 1-download.r
from https://github.com/hadley/cran-logs-dplyr to see this bottleneck in a real problem.
from readr.
Initially, I can come up with something if I can assume:
- the number of rows in each file
- that all are compatible in format
I can bring some of the logic from dplyr::rbind_list
, e.g. the Promoter
etc ... to handle type promotion
When we know the size in advance and there are no type promotion, this is likely to benefit parallelization.
I can see this becoming a theme across projects we do together (at least fastread and dplyr). I'll make sure to study the parallel package and perhaps its ancestors (mutlicore) which probably have interesting nuggets.
from readr.
Yes, I think that sounds like a reasonable approach, and parallelisation is definitely going to be key.
from readr.
One thing to consider is that there is a limit of 128 connections. So we cannot open as many connections as files we want to participate in this.
I think it is unlikely that when we do this (reading from multiple) files, we already know the number of lines in each file. So we'll have to do it in steps:
- count the number of lines for each file
- allocate big enough data frame
- fill the data
This might mean we have to run the show from R, (have the loop on the R side)
As an aside, here is counting the number of lines for all .csv.gz
from cran rstudio logs in 2013:
> system.time( { sapply( list.files(path, full.names=TRUE), function(.){ con <- file( ., open = "r" ) ; on.exit(close(con)); count_lines(con) }) })
utilisateur système écoulé
10.262 0.166 12.191
12 seconds, which is going to be small compared to parsing the data anyway.
I think it is worth paying that price rather than progressively grow the data. We might be able to divide this timings when using threads.
from readr.
Making progress in read_csv_all
. From all the csv.gz
files from 2013, we can read them into a huge df like this:
df_fastread <- read_csv_all( files,
classes = c("Date", "Time", "integer", "string", "string", "string", "string", "string", "string", "integer" ), header = TRUE
)
That took :
utilisateur système écoulé
36.318 1.028 41.159
One alternative is to read dfs in a list and fuse the list with dplyr::rbind_all
. Perhaps somethinh like this:
df_rbind_list <- lapply(files, function(file){
con <- file( file, open = "r" )
on.exit(close(con))
data <- read_csv(con ,
classes = c("Date", "Time", "integer", "string", "string", "string", "string", "string", "string", "integer" ), header = TRUE, n = 0
)
data
})
rbind_all(df_rbind_list)
This unsurprisingly takes longer:
utilisateur système écoulé
51.126 8.359 66.532
One thing to notice is that the initial counting takes a significant amount of time:
utilisateur système écoulé
10.242 0.101 10.366
It might be more efficient to use an unzip + mmap strategy so that we only have to uncompress once. At the moment what I use is:
- for each file, open a connection, count the number of lines, close the connection
- create the big data frame
- for each file, open a connection, and fill the appropriate chunk of the result data frame
So decompression of the data happens twice.
I don't think storing all the data in memory when it is read the first time really is an option, as we would have to do it for all the files, leading to a lot of memory use.
I'll experiment with alternative strategies later:
- via a unique file. Essentially decompressing each input file into the same temp file. And then just read this file through the mmap source.
- via several files. each file gets decompressed separately on disk.
The first option is simpler. The second option is easier to parallelise.
from readr.
Caching to disk would be especially important for url()
connections, since the data might change between requests. (That might also allow you to combine url + gzfile, which would be really nice)
from readr.
Now that we have fast rbinding in dplyr, I don't think this is an important issue.
from readr.
Related Issues (20)
- Reading a number with the incorrect decimal mark does not fail
- `read_tsv()` gives problems on gzipped file, not when uncompressed
- row column wrong in problems() with read_fwf
- FR: Option to load missing reasons / codes as separate columns HOT 1
- type_convert() does not parse IEEE 754 double values (NaN, Inf, -Inf)
- write_csv freezes/fails when writing to many files in a short amount of time.
- CRAN warnings in r-devel-windows-x86_64 HOT 4
- Release readr 2.1.5
- FR: Make `write_*()` return `file`
- Reconsider/remove message after `read_*()` HOT 1
- `spec_csv` errors on one-line literal data HOT 1
- read_delim parsing issue with compressed file
- A more informative error when the user accidentally spreads `levels` in `col_factor(...)`
- type_convert ignores leading whitespace when imputing column type, and also imputes before trimming
- read_tsv() unexpectedly slow for tables with a large number of columns HOT 1
- reading in dirty csv's with extra quote's should give a warning
- missing "id" option in read_table HOT 1
- Patch `read_rds` for CVE-2024-27322 HOT 1
- Bring back stringsAsFactors?
- `parse_number()` option to ignore numeric columns HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from readr.