Code Monkey home page Code Monkey logo

marshal's Introduction

R-CMD-check port4me status badge

marshal: Unified API for Marshalling R Objects

Introduction

Some types of R objects can be used only in the R session they were created. If used as-is in another R process, such objects often result in an immediate error or in obscure and hard-to-troubleshoot outcomes. Because of this, they cannot be saved to file and re-used at a later time. They may also not be exported to a parallel worker when doing parallel processing. These objects are sometimes referred to as non-exportable or non-serializable objects. For example, assume we load an HTML document using the xml2 package:

file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- xml2::read_html(file)

Next, imagine that we would save this document object doc to file and quit R;

saveRDS(doc, "html.rds")
quit()

Then, if we try to use this saved xml2 object in another R session, we'll find that it will not work;

doc2 <- readRDS("html.rds")
xml2::xml_length(doc2)
#> Error in xml_length.xml_node(doc2) : external pointer is not valid

This is because xml2 objects only work in the R process that created them.

One solution to this problem is to use "marshalling" to encode the R object into an exportable representation that then can be used to re-create a copy of that object in another R process that imitates the original object.

The marshal package provides generic functions marshal() and unmarshal() for marshalling and unmarshalling R objects of certain class. This makes it possible to save otherwise non-exportable objects to file and then be used in a future R session, or to transfer them to another R process to be used there.

Proposed API

The long-term goal with this package is for it to provide a de-facto standard and API for marshalling and unmarshalling objects in R. To achieve this, this package proposes three generic functions:

  1. marshallable() - check whether an R object can be marshalled or not

  2. marshal() - marshal an R object

  3. unmarshal() - reconstruct a marshalled R object

If we return to our xml2 object, the marshal package implements an S3 marhal() method for different xml2 classes that takes care of everything for us. We can use this when we save the object;

file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- xml2::read_html(file)

saveRDS(marshal::marshal(doc), "html.rds")

quit()

Later, in another R session, we can reconstruct this xml2 HTML document by using:

doc2 <- marshal::unmarshal(readRDS("html.rds"))
xml2::xml_length(doc2)
[1] 2

Currently supported packages

In order to test the proposed solution and API, this package will implement S3 marshal() methods for some common R packages and their non-exportable classes. Note that the long-term goals is that these S3 methods should be implemented by these packages themselves, such that the marshal package will only provide a light-weight API.

The A Future for R: Non-Exportable Objects vignette has a collection of packages and classes that cannot be exported out of the box. This package has marshalling prototypes for objects from the following packages:

  • caret
  • data.table
  • keras
  • ncdf4
  • parsnip
  • raster
  • rstan
  • terra
  • xgboost
  • XML
  • xml2

It also has implementations that will throw an error for objects from the following packages, because they cannot be marshalled, at least not at the moment:

  • DBI
  • magick
  • parallel

The plan is to improve on add support for more R packages and object classes.

Installation

The marshal package is not, yet, on CRAN. In the meanwhile, it can be installed from the R Universe as:

install.packages("marshal", repos = c("https://henrikbengtsson.r-universe.dev", getOption("repos")))

Pre-release version

To install the pre-release version that is available in Git branch develop on GitHub, use:

remotes::install_github("HenrikBengtsson/marshal", ref = "develop")

This will install the package from source.

marshal's People

Contributors

henrikbengtsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

damodardhakad

marshal's Issues

Plan for CRAN release

Thanks for the work on this useful package! We are currently implementing something similar for the mlr3 ecosystem. Is there a plan for when this package will make it to CRAN so we could potentially build upon it?

Package 'marshal': serialize objects

https://en.m.wikipedia.org/wiki/Marshalling_(computer_science)

This package should provide an API for serializing and un-serializing R objects.

It should use S3 generic functions so it can be extended and customized downstream.

It could also have plug-in support for different types of serialization protocols.

Marshalling and de-marshalling comes before serialization.

The API should also provide methods that compared objects to known accept and reject lists. It sounds be possible to update these lists too.

This package should also provide methods to scan for references such as external pointers.

DESIGN: global state dict

When marshaling container objects I think it is important that a global hashmap is initialized by the top-level marshal() call, which can be used to preserve reference identities, i.e. by book-keeping which objects were already marshaled.

The code below (hopefully) illustrates this problem, which is a major challenge for R6.

library(marshal)

# this is our custom environment class generator for which we want to implement a custom marshaler
custom_env = function(data)  {
  e = new.env()
  e$data = data
  e$other_fn = function() {
    print("do some stuff")
  }
  class(e) = "custom_env"
  return(e)
}

# because the `$other_fn` can simply be re-created, we don't want to marshal it, i.e. we only have to marshal the `data` field.
registerS3method("marshal", "custom_env", function(x, ...) {
  structure(list(marshaled = x$data), class = c("custom_env_marshaled", "marshaled"))
})
registerS3method("unmarshal", "custom_env_marshaled", function(x, ...) {
  custom_env(x$marshaled)
})

# a problem then arises when the marshal method for container objects redirects work by calling marshal on its
# contents.
container = function(...) {
  structure(list(...), class = "container")
}

# Below, `marshal()` is applied to the same (identical) environment twice
registerS3method("marshal", "container", function(x, ...) {
  structure(list(marshaled = lapply(x, marshal)), class = c("container_marshaled", "marshaled"))
})

registerS3method("unmarshal", "container_marshaled", function(x, ...) {
  do.call(container, args = lapply(x[[1L]], unmarshal))
})

ce = custom_env(1)

cont = structure(list(a = ce, b = ce), class = "container")

cont_rec = unmarshal(marshal(cont))
identical(cont[[1]], cont[[2]])
#> [1] TRUE
identical(cont_rec[[1]], cont_rec[[2]])
#> [1] FALSE

Created on 2024-03-15 with reprex v2.0.2

Add can_be_marshalled()

Add a function, can_be_marshalled(), that can be used to test if an object can be marshalled or not.

tibble: A `tbl` may contain an external pointer via attribute `problems` set by readr

A tbl may contain an external pointer via attribute problems, e.g.

spc_tbl_ [25,000 ร— 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Index        : num [1:25000] 1 2 3 4 5 6 7 8 9 10 ...
 $ Height_Inches: num [1:25000] 65.8 71.5 69.4 68.2 67.8 ...
 $ Weight_Pounds: num [1:25000] 113 136 153 142 144 ...
 - attr(*, "spec")=
  .. cols(
  ..   Index = col_double(),
  ..   Height_Inches = col_double(),
  ..   Weight_Pounds = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

DESIGN: Save package (versions) instead of unmarshal function

I think it might be an idea to replace the default marshal() implementation by not storing the unmarshal function in the marshalled object, but instead by saving the required package (versions).
I.e. I am talking about this line here:

res[["unmarshal"]] <- unmarshal_default

This would mean, that instead of only having to implement the marshal.<class> method, one would have to implement.
marshal.<class> and unmarshal.<class>. The unmarshal generic could then verify that the packages required to unmarshal the object are loaded (including the package that contains the unmarshal generic) before dispatching onto the method.

The advantages of this approach would be:

  • memory efficiency: saving the required package (versions) uses less space then saving the unmarshal function. Especially with the --with-keep.source option, the size of marshaled objects can explode with the current approach (admittedly, one could also just remove the srcrefs manually to address the latter problem). This is somewhat reminicent of the R6 problems we were facing in mlr3 and our workaround now is similar to what I am suggesting here, i.e. not store the methods alongside the objects but in the package.
  • One could also think about even storing the package versions to give even better error messages and make this part of the standard (e.g. a compatibility matrix is stored alongside the package implementing the (un)marshal method and can be consulted when calling (un)marshal. Here we would have to take care what happens when reading an object written by package version A with package version B, where B < A, as package with version B does not know whether its format can be read by package with version A, as it did not know of its existence when it was released.

The disadvantages would be:

  • The package that implements the (un)marshal methods must be loaded, which is not the case right now.
    However, if this was to become a standard, the package that implements (un)marshal should usually also be the package that implements the actual functionality to do the marshaling.
  • In the other approach, the unmarshal function is ensured to be the same that was used to marshal the object.

DESIGN: `clone` / `inplace` argument

When marshaling objects that have reference semantics, having a clone / inplace parameter for the (un)marshal generics might be handy.

The pseudocode below illustrates a call to marshal(), where cloning is not necessary and another call to unmarshal() where it is necessary.

g <- function() {
  x_marshaled <- callr::r({
    x <- f(...)
    marshal(x, clone = FALSE)
  })
  x_unmarshaled <- unmarshal(x_marshaled, clone = TRUE)
  y <- h(x_unmarshaled)
  return(list(x_marshaled, y))
}

To stay on the safe side, marshal methods for objects with reference semantics should always clone by default and not modify th object that is being marshaled in-place. Because marshal() if often called right before sending the object to another process, it might be worth to optimize the special case where in-place modifications are allowed (or in general, the object that is being marshaled is not being further used).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.