Code Monkey home page Code Monkey logo

filearray's Introduction

File-Backed Array for Out-of-memory Computation

R-check CRAN status Develop

Stores large arrays in files to avoid occupying large memories. Implemented with super fast gigabyte-level multi-threaded reading/writing via OpenMP. Supports multiple non-character data types (double, float, integer, complex, logical and raw).

Speed comparisons with lazyarray (zstd-compressed out-of-memory array), and in-memory operation. The speed test was conducted on an MacBook Air (M1, 2020, 8GB RAM), with 8-threads. filearray is uniformly faster than lazyarray. Random access has almost the same speed as the native array operation in R. (The actual speed may vary depending on the storage type and memory size)

Installation

install.packages("filearray")

Install Develop Version

The internal functions are written in C++. To avoid compiling the packages, you can install from my personal repository. It's automatically updated every hour. Currently available on Windows and osx (Intel chip) only.

options(repos = c(
    dipterix = 'https://dipterix.r-universe.dev',
    CRAN = 'https://cloud.r-project.org'))

install.packages('filearray')

Alternatively, you can compile from Github repository. This requires proper compilers (rtools on windows, or xcode-select --install on osx, or build-essentials on linux).

# install.packages("remotes")
remotes::install_github("dipterix/filearray")

Basic Usage

Create/load file array

library(filearray)
file <- tempfile()
x <- filearray_create(file, c(100, 100, 100, 100))

# load existing
x <- filearray_load(file)

See more: help("filearray")

Assign & subset array

x[,,,1] <- rnorm(1e6)
x[1:10,1,1,1]

Generics

typeof(x)
max(x, na.rm = TRUE)
apply(x, 3, min, na.rm = TRUE)

val = x[1,1,5,1]
fwhich(x, val, arr.ind = TRUE)

See more: help("S3-filearray"), help("fwhich")

Map-reduce

Process segments of array and reduce to save memories.

# Identical to sum(x, na.rm = TRUE)
mapreduce(x, 
          map = \(data){ sum(data, na.rm = TRUE) }, 
          reduce = \(mapped){ do.call(sum, mapped) })

See more: help("mapreduce")

Collapse

Transform data, and collapse (calculate sum or mean) along margins.

a <- x$collapse(keep = 4, method = "mean", transform = "asis")

# equivalent to
b <- apply(x[], 4, mean)

a[1] - b[1]

Available transform for double/integer numbers are:

  • asis: no transform
  • 10log10: 10 * log10(v)
  • square: v * v
  • sqrt: sqrt(v)

For complex numbers, transform is a little bit different:

  • asis: no transform
  • 10log10: 10 * log10(|x|^2) (power to decibel unit)
  • square: |x|^2
  • sqrt: |x| (modulus)
  • normalize: x / |x| (unit length)

Notes

I. Notes on precision

  1. complex numbers: In native R, complex numbers are combination of two double numbers - real and imaginary (total 16 bytes). In filearray, complex numbers are coerced to two float numbers and store each number in 8 bytes. This conversion will gain performance speed, but lose precision at around 8 decimal place. For example, 1.0000001 will be store as 1, or 123456789 will be stored as 123456792 (first 7 digits are accurate).

  2. float type: Native R does not have float type. All numeric values are stored in double precision. Since float numbers use half of the space, float arrays can be faster when hard drive speed is the bottle-neck (see performance comparisons). However coercing double to float comes at costs: a). float number has less precision b). float number has smaller range ($3.4\times 10^{38}$) than double ($1.7\times 10^{308}$) hence use with caution when data needs high precision or the max is super large.

  3. collapse function: when data range is large (say x[[1]]=1, but x[[2]]=10^20), collapse method might lose precision. This is double only uses 8 bytes of memory space. When calculating summations, R internally uses long double to prevent precision loss, but current filearray implementation uses double, causing floating error around 16 decimal place.

II. Cold-start vs warm-start

As of version 0.1.1, most file read/write operations are switched from fopen to memory map for two simplify the logic (buffer size, kernel cache...), and to boost the writing/some types of reading speed. While sacrificing the speed of reading large block of data from 2.4GB/s to 1.7GB/s, the writing speed was boosted from 300MB/s to 700MB/s, and the speed of random accessing small slices of data was increased from 900MB/s to 2.5GB/s. As a result, some functions can reach to really high speed (close to in-memory calls) while using much less memory.

The additional performance improvements brought by the memory mapping approach might be impacted by "cold" start. When reading/writing files, most modern systems will cache the files so that it can load up these files faster next time. I personally call it a cold start. Memory mapping have a little bit extra overhead during the cold start, resulting in decreased performance (but it's still fast). Accessing the same data after the cold start is called warm start. When operating with warm starts, filearray is as fast as native R arrays (sometimes even faster due to the indexing method and fewer garbage collections). This means filearray reaches its best performance when the arrays are re-used.

III. Using traditional HDD?

filearray relies on SSD, especially NVMe SSD that allows you to fast-access random hard disk address. If you use HDD, filearray can provide very limited improvement. One personal suggestion is that if you are using Windows machine, you can use software programs such as PrimoCache, which allows the computer to use RAM as L2 cache to access your files. For OSX I believe the built-in system has RAM cache for disk files.

If you use filearray to direct access to HDD, please set number of threads to 1 via filearray::filearray_threads(1) at start up.

filearray's People

Contributors

dipterix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

filearray's Issues

Submit to CRAN - filearray

  • Recompile and check spellings
  • Local check + run tests + lint + reverse deps
  • Submit to check on windows
  • Submit to check on Solaris
  • Submit to rhub for other checks
  • Update version, push to Github
  • Submit to CRAN
  • Release on Github

`fmap` trying to use uninitialized variables, causing undefined-behavior

Reported by clang-UBSAN from CRAN

> ### Name: fmap
> ### Title: Map multiple file arrays and save results
> ### Aliases: fmap fmap2 fmap_element_wise
> 
> ### ** Examples
> 
> 
> 
> set.seed(1)
> x1 <- filearray_create(tempfile(), dimension = c(100,20,3))
> x1[] <- rnorm(6000)
> x2 <- filearray_create(tempfile(), dimension = c(100,20,3))
> x2[] <- rnorm(6000)
> 
> # Add two arrays
> output <- filearray_create(tempfile(), dimension = c(100,20,3))
> fmap(list(x1, x2), function(input){
+     input[[1]] + input[[2]]
+ }, output)
save.cpp:50:28: runtime error: signed integer overflow: 4616189618054758400 * 2000 cannot be represented in type 'long'
    #0 0x7f17d676584c in FARR_subset_assign_sequential_bare(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, long const&, SEXPREC*, unsigned int, SEXPREC*, long) /data/gannet/ripley/R/packages/tests-clang-SAN/filearray/src/save.cpp:50:28
    #1 0x7f17d67532ac in FARR_buffer_map(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Rcpp::Function_Impl<Rcpp::PreserveStorage> const&, int const&, int) /data/gannet/ripley/R/packages/tests-clang-SAN/filearray/src/map.cpp:143:13
    #2 0x7f17d66d65b9 in _filearray_FARR_buffer_map /data/gannet/ripley/R/packages/tests-clang-SAN/filearray/src/RcppExports.cpp:273:34
    #3 0x557c1628e7f0 in R_doDotCall /data/gannet/ripley/R/svn/R-devel/src/main/dotcode.c:880:17
    #4 0x557c163f137f in bcEval /data/gannet/ripley/R/svn/R-devel/src/main/eval.c:7682:21
    #5 0x557c163d2855 in Rf_eval /data/gannet/ripley/R/svn/R-devel/src/main/eval.c:748:8
    #6 0x557c16436ad6 in R_execClosure /data/gannet/ripley/R/svn/R-devel/src/main/eval.c
    #7 0x557c164324d3 in Rf_applyClosure /data/gannet/ripley/R/svn/R-devel/src/main/eval.c:1844:16
    #8 0x557c163f71f0 in bcEval /data/gannet/ripley/R/svn/R-devel/src/main/eval.c:7094:12
    #9 0x557c163d2855 in Rf_eval /data/gannet/ripley/R/svn/R-devel/src/main/eval.c:748:8
    #10 0x557c16436ad6 in R_execClosure /data/gannet/ripley/R/svn/R-devel/src/main/eval.c
    #11 0x557c164324d3 in Rf_applyClosure /data/gannet/ripley/R/svn/R-devel/src/main/eval.c:1844:16
    #12 0x557c163d31b8 in Rf_eval /data/gannet/ripley/R/svn/R-devel/src/main/eval.c:871:12
    #13 0x557c164ffab6 in Rf_ReplIteration /data/gannet/ripley/R/svn/R-devel/src/main/main.c:262:2
    #14 0x557c16503140 in R_ReplConsole /data/gannet/ripley/R/svn/R-devel/src/main/main.c:314:11
    #15 0x557c16502f36 in run_Rmainloop /data/gannet/ripley/R/svn/R-devel/src/main/main.c:1192:5
    #16 0x557c16503282 in Rf_mainloop /data/gannet/ripley/R/svn/R-devel/src/main/main.c:1199:5
    #17 0x557c16093d8c in main /data/gannet/ripley/R/svn/R-devel/src/main/Rmain.c:29:5
    #18 0x7f17e6c2954f in __libc_start_call_main (/lib64/libc.so.6+0x2954f) (BuildId: 9c5863396a11aab52ae8918ae01a362cefa855fe)
    #19 0x7f17e6c29608 in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x29608) (BuildId: 9c5863396a11aab52ae8918ae01a362cefa855fe)
    #20 0x557c15fd1244 in _start (/data/gannet/ripley/R/R-clang-SAN/bin/exec/R+0x31b244)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior save.cpp:50:28 in 
Reference class object of class "FileArray"
Mode: readwrite 
Dimension: 100x20x3 
# of partitions: 3 
Partition size: 1 
Storage type: double (internal size: 8)
Location: /tmp/RtmpXmNcyt/file36aaae68a9e2a2 

The bug is caused by the following code:

filearray/src/save.cpp

Lines 49 to 50 in c99c122

for(part_end = part_start; slice_idx2 > *cum_part; cum_part++, part_end++){}
skip_end = (*cum_part) * unit_partlen - (from + len);

When the buffer size exceed array lengths, slice_idx2 may exceed nparts (number of partitions), and pointer cum_part will go beyond the end of the vector (that's why *cum_part becomes 4616189618054758400)

The solution could be instead of calculating skip_end, simply replace the following code

filearray/src/save.cpp

Lines 78 to 81 in c99c122

write_len = part_nelem - read_start;
if( part == part_end ){
write_len -= skip_end;
}

to:

write_len = part_nelem - read_start;

if(nwrite + write_len > len) {
    write_len = len - nwrite;
}
if(write_len <= 0) {
    break;
}

One test fails on PowerPC (Fixed)

@dipterix Could you please look into this?

R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: powerpc-apple-darwin10.0.0d2 (32-bit)

> library(testthat)
> 
> library(filearray)
OpenMP not detected. Using single thread only.
> 
> test_check("filearray")
BIBINPUTS               .:.:/opt/local/Library/Frameworks/R.framework/Resources/share/texmf/bibtex/bib::/opt/local/Library/Frameworks/R.framework/Resources/share/texmf/bibtex/bib:
BSTINPUTS               .:.:/opt/local/Library/Frameworks/R.framework/Resources/share/texmf/bibtex/bst::/opt/local/Library/Frameworks/R.framework/Resources/share/texmf/bibtex/bst:
CCACHE_DIR              /opt/local/var/macports/build/.ccache
COLUMNS                 80
COMMAND_MODE            legacy
DEVELOPER_DIR           /Developer
DISPLAY                 /tmp/launch-zLBRd4/:0
DYLD_FALLBACK_LIBRARY_PATH
                        /opt/local/Library/Frameworks/R.framework/Resources/lib:/opt/local/Library/Frameworks/R.framework/Resources/lib:/opt/local/Library/Frameworks/R.framework/Resources/lib:/opt/local/Library/Frameworks/R.framework/Resources/lib:/opt/local/Library/Frameworks/R.framework/Resources/lib
DYLD_LIBRARY_PATH       /opt/local/lib/libgcc:/opt/local/lib/libgcc:/opt/local/lib/libgcc
EDITOR                  vi
HOME                    /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_R_R-filearray/R-filearray/work/.home
LANG                    en_US.UTF-8
LANGUAGE                C
LC_COLLATE              C
LINES                   24
LN_S                    ln -s
MAKE                    make
NO_PROXY                *.local,169.254/16
PAGER                   /opt/local/bin/less
PATH                    /opt/local/bin:/opt/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin
PWD                     /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_R_R-filearray/R-filearray/work/filearray-0.1.5/filearray.Rcheck/tests
R_ARCH                  
R_BATCH                 
R_BROWSER               /usr/bin/open
R_BZIPCMD               /opt/local/bin/bzip2
R_CMD                   /opt/local/Library/Frameworks/R.framework/Resources/bin/Rcmd
R_DEFAULT_PACKAGES      
R_DOC_DIR               /opt/local/Library/Frameworks/R.framework/Resources/doc
R_ENVIRON               
R_ENVIRON_USER          
R_GZIPCMD               /opt/local/bin/gzip
R_HOME                  /opt/local/Library/Frameworks/R.framework/Resources
R_INCLUDE_DIR           /opt/local/Library/Frameworks/R.framework/Resources/include
R_LIBS                  /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_R_R-filearray/R-filearray/work/filearray-0.1.5/filearray.Rcheck
R_LIBS_SITE             /opt/local/Library/Frameworks/R.framework/Resources/site-library
R_LIBS_USER             /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_R_R-filearray/R-filearray/work/.home/Library/R/Power
                        Macintosh/4.3/library
R_OSTYPE                unix
R_PAPERSIZE             letter
R_PAPERSIZE_USER        letter
R_PDFVIEWER             /usr/bin/open
R_PLATFORM              powerpc-apple-darwin10.0.0d2
R_PRINTCMD              lpr
R_PROFILE               
R_PROFILE_USER          
R_RD4PDF                times,inconsolata,hyper
R_SESSION_TMPDIR        /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_R_R-filearray/R-filearray/work/.tmp/Rtmp8SJbon
R_SHARE_DIR             /opt/local/Library/Frameworks/R.framework/Resources/share
R_STRIP_SHARED_LIB      strip -x
R_STRIP_STATIC_LIB      strip -S
R_SYSTEM_ABI            macos,gcc,gxx,gfortran,gfortran
R_TESTS                 
R_TEXI2DVICMD           /opt/local/bin/texi2dvi
R_UNZIPCMD              /opt/local/bin/unzip
R_VERSION               4.3.1
R_ZIPCMD                /opt/local/bin/zip
SED                     /usr/bin/sed
SHLVL                   5
TAR                     /opt/local/bin/gtar
TESTTHAT                true
TESTTHAT_IS_CHECKING    true
TESTTHAT_PKG            filearray
TEXINPUTS               .:.:/opt/local/Library/Frameworks/R.framework/Resources/share/texmf/tex/latex::/opt/local/Library/Frameworks/R.framework/Resources/share/texmf/tex/latex:
TMPDIR                  /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_R_R-filearray/R-filearray/work/.tmp
USER                    root
_R_CHECK_INTERNALS2_    1
_R_CHECK_LICENSE_       TRUE
_R_CHECK_PACKAGE_NAME_
                        filearray
_R_SHLIB_BUILD_OBJECTS_SYMBOL_TABLES_
                        TRUE
__CF_USER_TEXT_ENCODING
                        0x0:0:0
[ FAIL 1 | WARN 0 | SKIP 3 | PASS 133 ]

══ Skipped tests (3) ═══════════════════════════════════════════════════════════
• On CRAN (3): 'test-collapse.R:87:5', 'test-collapse.R:198:5',
  'test-collapse.R:310:5'

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-map.R:52:5'): map arrays ─────────────────────────────────────
output[] (`actual`) not equal to `b` (`expected`).

actual vs expected
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ 470.729
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ -146.253
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ 732.162
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ 19.434
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ 13947.981
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ 330.581
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ 811.361
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ 715.993
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ -1089.565
- 1968084071671644348527897663031879476955617889699603688900237419077151061474164088289218211399543970587333969889174470151015299072781666953738662667618568086126156341729585445016368989335574697735956656806382217189919174243337515561254819725312.000
+ 2764.628
and 3371190 more ...

[ FAIL 1 | WARN 0 | SKIP 3 | PASS 133 ]
Error: Test failures
Execution halted

P. S. Generally speaking, usual suspects for causes for errors in such cases are assumed little-endianness and assumed 64-bitness. Less common are 4-byte bool, IBM format for long double (non-IEEE) and PPC-specific rounding.

Subset filearray into filearray proxy

The original issue: dipterix/lazyarray#3

Is it possible that subsetting a lazyarray again yields a lazyarray?

I am a bit puzzled whether I use your package correctly, e.g.

# `arr` from readme.md
inds <- arr > 0.5 # error
inds <- arr[] > 0.5

During this call, arr[] fully populates the memory, i.e. the whole lazy-aspect is gone?

Original reply:

Hi @chrisdane , the development for this package has been paused in favor of https://github.com/dipterix/filearray , a very similar package that offers better performance and more functions. This package (lazyarray) is still on CRAN because some of my old projects are still depending on it, but soon the migration will complete. I'm sorry for the inconvenience.

Back to your question. It's not straightforward to subset lazyarray/filearray in that way for now because I'm dealing with arrays with sizes of 10GB+. Your proposed operations might need to create a new array on disk. This could very easily fill up the hard disks if not carefully treated.

It's true that once you call [, the data will be loaded into memory, hence the "lazy" aspect goes away.

What I could do, however, is I might be able to set some lazy-evaluated proxies. The proxies does not evaluate the arrays immediately. Instead, they only evaluate when you subset the arrays:

# No evaluation, inds is just a proxy array
inds <- arr > 0.5

# evaluates `arr>0.5` on the fly
inds[,,1]

# or 
arr[inds]

Does that resolve your problems?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.