Code Monkey home page Code Monkey logo

scrapbook's Introduction

scrapbook logo

scrapbook

CI image Documentation Status badge badge Code style: black

The scrapbook library records a notebook’s data values and generated visual content as "scraps". Recorded scraps can be read at a future time.

See the scrapbook documentation for more information on how to use scrapbook.

Use Cases

Notebook users may wish to record data produced during a notebook's execution. This recorded data, scraps, can be used at a later time or passed in a workflow to another notebook as input.

Namely, scrapbook lets you:

  • persist data and visual content displays in a notebook as scraps
  • recall any persisted scrap of data
  • summarize collections of notebooks

Python Version Support

This library's long term support target is Python 3.6+. It currently also supports Python 2.7 until Python 2 reaches end-of-life in 2020. After this date, Python 2 support will halt, and only 3.x versions will be maintained.

Installation

Install using pip:

pip install scrapbook

For installing optional IO dependencies, you can specify individual store bundles, like s3 or azure:

pip install scrapbook[s3]

or use all:

pip install scrapbook[all]

Models and Terminology

Scrapbook defines the following items:

  • scraps: serializable data values and visualizations such as strings, lists of objects, pandas dataframes, charts, images, or data references.
  • notebook: a wrapped nbformat notebook object with extra methods for interacting with scraps.
  • scrapbook: a collection of notebooks with an interface for asking questions of the collection.
  • encoders: a registered translator of data to/from notebook storage formats.

scrap model

The scrap model houses a few key attributes in a tuple, including:

  • name: The name of the scrap
  • data: Any data captured by the scrapbook api call
  • encoder: The name of the encoder used to encode/decode data to/from the notebook
  • display: Any display data used by IPython to display visual content

API

Scrapbook adds a few basic api commands which enable saving and retrieving data including:

  • glue to persist scraps with or without display output
  • read_notebook reads one notebook
  • scraps provides a searchable dictionary of all scraps by name
  • reglue which copies a scrap from another notebook to the current notebook
  • read_notebooks reads many notebooks from a given path
  • scraps_report displays a report about collected scraps
  • papermill_dataframe and papermill_metrics for backward compatibility for two deprecated papermill features

The following sections provide more detail on these api commands.

glue to persist scraps

Records a scrap (data or display value) in the given notebook cell.

The scrap (recorded value) can be retrieved during later inspection of the output notebook.

"""glue example for recording data values"""
import scrapbook as sb

sb.glue("hello", "world")
sb.glue("number", 123)
sb.glue("some_list", [1, 3, 5])
sb.glue("some_dict", {"a": 1, "b": 2})
sb.glue("non_json", df, 'arrow')

The scrapbook library can be used later to recover scraps from the output notebook:

# read a notebook and get previously recorded scraps
nb = sb.read_notebook('notebook.ipynb')
nb.scraps

scrapbook will imply the storage format by the value type of any registered data encoders. Alternatively, the implied encoding format can be overwritten by setting the encoder argument to the registered name (e.g. "json") of a particular encoder.

This data is persisted by generating a display output with a special media type identifying the content encoding format and data. These outputs are not always visible in notebook rendering but still exist in the document. Scrapbook can then rehydrate the data associated with the notebook in the future by reading these cell outputs.

With display output

To display a named scrap with visible display outputs, you need to indicate that the scrap is directly renderable.

This can be done by toggling the display argument.

# record a UI message along with the input string
sb.glue("hello", "Hello World", display=True)

The call will save the data and the display attributes of the Scrap object, making it visible as well as encoding the original data. This leans on the IPython.core.formatters.format_display_data function to translate the data object into a display and metadata dict for the notebook kernel to parse.

Another pattern that can be used is to specify that only the display data should be saved, and not the original object. This is achieved by setting the encoder to be display.

# record an image without the original input object
sb.glue("sharable_png",
  IPython.display.Image(filename="sharable.png"),
  encoder='display'
)

Finally the media types that are generated can be controlled by passing a list, tuple, or dict object as the display argument.

sb.glue("media_as_text_only",
  media_obj,
  encoder='display',
  display=('text/plain',) # This passes [text/plain] to format_display_data's include argument
)

sb.glue("media_without_text",
  media_obj,
  encoder='display',
  display={'exclude': 'text/plain'} # forward to format_display_data's kwargs
)

Like data scraps, these can be retrieved at a later time be accessing the scrap's display attribute. Though usually one will just use Notebook's reglue method (described below).

read_notebook reads one notebook

Reads a Notebook object loaded from the location specified at path. You've already seen how this function is used in the above api call examples, but essentially this provides a thin wrapper over an nbformat's NotebookNode with the ability to extract scrapbook scraps.

nb = sb.read_notebook('notebook.ipynb')

This Notebook object adheres to the nbformat's json schema, allowing for access to its required fields.

nb.cells # The cells from the notebook
nb.metadata
nb.nbformat
nb.nbformat_minor

There's a few additional methods provided, most of which are outlined in more detail below:

nb.scraps
nb.reglue

The abstraction also makes saved content available as a dataframe referencing each key and source. More of these methods will be made available in later versions.

# Produces a data frame with ["name", "data", "encoder", "display", "filename"] as columns
nb.scrap_dataframe # Warning: This might be a large object if data or display is large

The Notebook object also has a few legacy functions for backwards compatibility with papermill's Notebook object model. As a result, it can be used to read papermill execution statistics as well as scrapbook abstractions:

nb.cell_timing # List of cell execution timings in cell order
nb.execution_counts # List of cell execution counts in cell order
nb.papermill_metrics # Dataframe of cell execution counts and times
nb.papermill_record_dataframe # Dataframe of notebook records (scraps with only data)
nb.parameter_dataframe # Dataframe of notebook parameters
nb.papermill_dataframe # Dataframe of notebook parameters and cell scraps

The notebook reader relies on papermill's registered iorw to enable access to a variety of sources such as -- but not limited to -- S3, Azure, and Google Cloud.

scraps provides a name -> scrap lookup

The scraps method allows for access to all of the scraps in a particular notebook.

nb = sb.read_notebook('notebook.ipynb')
nb.scraps # Prints a dict of all scraps by name

This object has a few additional methods as well for convenient conversion and execution.

nb.scraps.data_scraps # Filters to only scraps with `data` associated
nb.scraps.data_dict # Maps `data_scraps` to a `name` -> `data` dict
nb.scraps.display_scraps # Filters to only scraps with `display` associated
nb.scraps.display_dict # Maps `display_scraps` to a `name` -> `display` dict
nb.scraps.dataframe # Generates a dataframe with ["name", "data", "encoder", "display"] as columns

These methods allow for simple use-cases to not require digging through model abstractions.

reglue copys a scrap into the current notebook

Using reglue one can take any scrap glue'd into one notebook and glue into the current one.

nb = sb.read_notebook('notebook.ipynb')
nb.reglue("table_scrap") # This copies both data and displays

Any data or display information will be copied verbatim into the currently executing notebook as though the user called glue again on the original source.

It's also possible to rename the scrap in the process.

nb.reglue("table_scrap", "old_table_scrap")

And finally if one wishes to try to reglue without checking for existence the raise_on_missing can be set to just display a message on failure.

nb.reglue("maybe_missing", raise_on_missing=False)
# => "No scrap found with name 'maybe_missing' in this notebook"

read_notebooks reads many notebooks

Reads all notebooks located in a given path into a Scrapbook object.

# create a scrapbook named `book`
book = sb.read_notebooks('path/to/notebook/collection/')
# get the underlying notebooks as a list
book.notebooks # Or `book.values`

The path reuses papermill's registered iorw to list and read files form various sources, such that non-local urls can load data.

# create a scrapbook named `book`
book = sb.read_notebooks('s3://bucket/key/prefix/to/notebook/collection/')

The Scrapbook (book in this example) can be used to recall all scraps across the collection of notebooks:

book.notebook_scraps # Dict of shape `notebook` -> (`name` -> `scrap`)
book.scraps # merged dict of shape `name` -> `scrap`

scraps_report displays a report about collected scraps

The Scrapbook collection can be used to generate a scraps_report on all the scraps from the collection as a markdown structured output.

book.scraps_report()

This display can filter on scrap and notebook names, as well as enable or disable an overall header for the display.

book.scraps_report(
  scrap_names=["scrap1", "scrap2"],
  notebook_names=["result1"], # matches `/notebook/collections/result1.ipynb` pathed notebooks
  header=False
)

By default the report will only populate with visual elements. To also report on data elements set include_data.

book.scraps_report(include_data=True)

papermill support

Finally the scrapbook provides two backwards compatible features for deprecated papermill capabilities:

book.papermill_dataframe
book.papermill_metrics

Encoders

Encoders are accessible by key names to Encoder objects registered against the encoders.registry object. To register new data encoders simply call:

from encoder import registry as encoder_registry
# add encoder to the registry
encoder_registry.register("custom_encoder_name", MyCustomEncoder())

The encode class must implement two methods, encode and decode:

class MyCustomEncoder(object):
    def encode(self, scrap):
        # scrap.data is any type, usually specific to the encoder name
        pass  # Return a `Scrap` with `data` type one of [None, list, dict, *six.integer_types, *six.string_types]

    def decode(self, scrap):
        # scrap.data is one of [None, list, dict, *six.integer_types, *six.string_types]
        pass  # Return a `Scrap` with `data` type as any type, usually specific to the encoder name

This can read transform scraps into a json object representing their contents or location and load those strings back into the original data objects.

text

A basic string storage format that saves data as python strings.

sb.glue("hello", "world", "text")

json

sb.glue("foo_json", {"foo": "bar", "baz": 1}, "json")

pandas

sb.glue("pandas_df",pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}), "pandas")

papermill's deprecated record feature

scrapbook provides a robust and flexible recording schema. This library replaces papermill's existing record functionality.

Documentation for papermill record exists on ReadTheDocs. In brief, the deprecated record function:

pm.record(name, value): enables values to be saved with the notebook [API documentation]

pm.record("hello", "world")
pm.record("number", 123)
pm.record("some_list", [1, 3, 5])
pm.record("some_dict", {"a": 1, "b": 2})

pm.read_notebook(notebook): pandas could be used later to recover recorded values by reading the output notebook into a dataframe. For example:

nb = pm.read_notebook('notebook.ipynb')
nb.dataframe

Rationale for Papermill record deprecation

Papermill's record function was deprecated due to these limitations and challenges:

  • The record function didn't follow papermill's pattern of linear execution of a notebook. It was awkward to describe record as an additional feature of papermill, and really felt like describing a second less developed library.
  • Recording / Reading required data translation to JSON for everything. This is a tedious, painful process for dataframes.
  • Reading recorded values into a dataframe would result in unintuitive dataframe shapes.
  • Less modularity and flexiblity than other papermill components where custom operators can be registered.

To overcome these limitations in Papermill, a decision was made to create Scrapbook.

scrapbook's People

Contributors

choldgraf avatar chyzzqo avatar hoangthienan95 avatar koek67 avatar mseal avatar rgbkrk avatar tanguycdls avatar tirkarthi avatar trallard avatar willingc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapbook's Issues

Not being able to access parameters when load the whole directory

Hello, I just use papermill and scrapbook recently and they are fantastic. I have a small issue now and I'm sorry if it's a little naive.

When I load a single notebook with scrapbook, I can access the parameters I use for papermill

import scrapbook as sb
nbs = sb.read_notebook(path/to/notebook)
nbs.parameters

However, when I load all the notebooks from a directory, I can not find where is the parameters for each notebook.
book = sb.read_notebooks('result/')

book = sb.read_notebooks(directory/to/notebooks)

Can I know how to access the parameters of each notebook in the latter case?

Combine scraps and snaps

Scraps vs snaps was confusing people. We'll combine them into just scraps and have some helper methods for filtering on scraps with display attributes.

Implement

pass # TODO: Implement
registry = DataTranslatorRegistry()
registry.register('unicode', UnicodeTranslator())
registry.register('json', JsonTranslator())
# registry.register('arrow', ArrowDataframeTranslator())


This issue was generated by todo based on a TODO comment in 7b3871a when #3 was merged. cc @MSeal.

Should snaps and scraps combine?

Today we had sketch saving snaps and glue saving scraps (see readme for detailed differences). I was thinking that the saved concepts could all be scraps while the glue and sketch methods determine the metadata associated with a scrap to indicate if it should be rendered or not.

This would require some careful functions for backwards compatibility with record methods as they were separated in papermill, but that doesn't mean we have to keep to the abstraction here.

Thoughts?

Remove papermill dependency

We should find a way to make scrapbook's I/O requirements not directly depend on papermill. Perhaps by optionally importing papermill IO if it's found and using a separate I/O registry if running without? The tradeoff between DRY and dependency isolation here needs a good conversation.

Highlight demo fails on Binder with KeyError: "kernel_manager_class"

When I run the highlight demo on binder I get this error in the third cell:

In [3]:  pm.execute_notebook('./highlight_dates.ipynb', './outcomes/highlight_dates_run_one.ipynb', new_dates_one);

Executing: 0%
0/13 [00:00<?, ?cell/s]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/traitlets/traitlets.py in get(self, obj, cls)
    527         try:
--> 528             value = obj._trait_values[self.name]
    529         except KeyError:

KeyError: 'kernel_manager_class'

During handling of the above exception, another exception occurred:

...
ImportError: cannot import name 'AsyncKernelManager' from 'jupyter_client' (/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_client/__init__.py)

Reverse operation for `glue`

I started building[1] on an example from the docs here, but wasn't able to find a method to remove these sample Scraps added with glue:

sb.glue("hello", "world")
sb.glue("number", 123)
sb.glue("some_list", [1, 3, 5])
sb.glue("some_dict", {"a": 1, "b": 2})
sb.glue("non_json", df, 'pandas')

Does a reverse operation to glue exist? If not, will it be possible to add it to enable ungluing of unwanted Scraps from notebooks?


[1] a modeling "metapipeline" composed of several serially dependent "classic" modeling pipelines, all executed (using papermill) on the basis of the same template notebook (factory) and producing a model whose MLflow run ID (generated inside the output notebook) will be passed to the next modeling stage using a Scrap glued to the output notebook (which requires much less boilerplate than Luigi by the way, many thanks!)

Not being able to extract scraps

Hello, I am trying to extract data after using papermill. I have multiple values within the notebook, but none of them are visible.

My script

import papermill as pm
import os
import scrapbook as sb

# Set up files
dir_name = os.path.dirname(os.path.realpath(__file__))
input = os.path.join(dir_name, "test.ipynb")
output = os.path.join(dir_name, "out.ipynb")

# Inject notebook settings
# pm.execute_notebook(
#     input,
#     output,
#     parameters=dict(msg="Bye")
# )

# Import scraps
# sb.glue("result", "world")
nb = sb.read_notebook(output)
# Extract parameters
print(nb.scrap_dataframe)
nb.reglue("result")
# print(nb.scraps.data_dict)

# Clean up not needed, because notebook will just be overriden

And this is the link to the notebook i ran output.ipynb https://pastebin.com/raw/mC0ybwy5

Cool project, however, some suggestions for improvements

  1. it is not obvious how to display a saved display (image)...
  2. There is a an unnecessary ambiguity for the word 'display' here. it denotes type of scrap as well as scrap output flag upon saving
  3. as I glue a display, it is always shown twice. plt.ioff() doesn't help.

here is the snippet for displaying image:

def display_image_scrap(nb, key):

    def stringToRGB(base64_string):
        imgdata = base64.b64decode(base64_string)
        image = Image.open(io.BytesIO(imgdata))
        return image
    
    image = stringToRGB(nb.scraps[key].display['data']['image/png'])
    imsize_inches = (np.array(image.size) / 70).astype('int8').tolist()
    
    fig = plt.figure(figsize=imsize_inches)
    ax = fig.add_axes([0, 0, 1, 1])
    ax.imshow(np.array(image), interpolation='antialiased')
    plt.axis('off')

due to (3) I had to disable image scraping for the moment. :(

Dev Docs

In conversation with @MSeal I raised an issue that as a new developer it is somewhat difficult to begin contributing to this project. Out of concerns around bus-factor and long term sustainability, we arrived at the idea that we should have better developer documentation.

This came up in part because of #37 and how much is changing as a result of introducing external reference based storage. I in part wouldn't know how to review it because I don't know how it relates to the current way of internally organizing all of the logic.

My proposal is that before making massive architectural changes, we should try to express how the pieces are intended to work together today. That way the changes can be described at a higher level that is distinct from the details of the implementation.

In the nteract meeting, @captainsafia suggested that we create this issue and tag @willingc who has been amazing for helping solidify efforts like these.

scrap.Scrap as a dataclass

Hey there!

While I was fiddling around with scrapbook, I ended up writing a dataclass for scrap.Scrap items.
I noticed in the source that you have a comment that says:

# dataclasses would be nice here...
Scrap = namedtuple("Scrap", ["name", "data", "encoder", "display"])
Scrap.__new__.__defaults__ = (None,)

So I figured you could be interested by the implementation I made! :)
See below:

@dataclass
class Scrap(collections.abc.Mapping):
    name: str
    data: Any
    encoder: str
    display: str = None

    def keys(self):
        return self.__dataclass_fields__.keys() # pylint: disable=no-member

    def __getitem__(self, key: str) -> Any:
        if isinstance(key, str):
            if not hasattr(self, key):
                raise KeyError(key)
            return getattr(self, key)
        else:
            raise TypeError(f"Unsupported key type: {type(key)}")

    def __len__(self):
        return len(self.__dataclass_fields__.keys()) # pylint: disable=no-member

    def __iter__(self):
        return (attr for attr in dataclasses.astuple(self))

    def asdict(self) -> dict:
        return dataclasses.asdict(self)

    def astuple(self) -> tuple:
        return dataclasses.astuple(self)
  • keys, __getitem__, __len__ are for the Mapping protocol, to allow the Scrap to be converted to a dict using the ** operator ({**scrap})
  • __iter__ is to support unpacking the scrap
  • asdict, astuple are for convenience

Incompatible with JupyterLab

I have verified compatibility of the library in Jupyter Notebooks but for some reason when I use JupyterLab on the same kernel I receive the following:

In Cell:
import scrapbook as sb
dir(sb)

Out Cell:
['__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__path__',
'__spec__']

I am working in an Anaconda environment using the following installation method:

In Cell:
!pip install --upgrade nteract-scrapbook

0.3.0 Release

There's been a lot of merged PR's sitting around for a few months. Going to get a 0.3.0 release going soon and make the remote storage change a 0.4 release target (now that other repos are wrapping up releases). Changelog to follow in the next couple days.

Encoder abstract class and `pickle`/`dill` encoders

@willingc @MSeal
It would be helpful to provide a base class (interface) for encoders so that people don't get confused when trying to implement new encoders. Also I would like to have a "pickler" as one of the builtin encoders since oftentimes I need to embed open matplotlib figures for rework later; the use cases aren't limited to matplotlib figures either - its the picklable objects: users should have the freedom to save and restore any variable of their choice in interactive notebooks.

Implementation details follow

  • scrapbook_ext.py
import scrapbook as sb

import scrapbook.encoders
import scrapbook.scraps

import abc

# encoder class interface
class BaseEncoder(abc.ABC):
    def name(self):
        ...

    def encodable(self, data):
        ...

    def encode(self, scrap: sb.scraps.Scrap, **kwargs):
        ...

    def decode(self, scrap: sb.scraps.Scrap, **kwargs):
        ...

# pickle encoder
import base64
import pickle

import functools
# TODO ref https://stackoverflow.com/a/38755760
def pipeline(*funcs):
    return lambda x: functools.reduce(lambda f, g: g(f), list(funcs), x)


class PickleEncoder(BaseEncoder):
    ENCODER_NAME = 'pickle'

    def name(self):
        return self.ENCODER_NAME

    def encodable(self, data):
        # TODO
        return True

    def encode(self, scrap: sb.scraps.Scrap, **kwargs):
        _impl = pipeline(
            functools.partial(pickle.dumps, **kwargs),
            # NOTE .decode() makes sure its a UTF-8 string instead of bytes
            lambda x: base64.b64encode(x).decode()
        )
        return scrap._replace(
            data=_impl(scrap.data)
        )

    def decode(self, scrap: sb.scraps.Scrap, **kwargs):
        _impl = pipeline(
            base64.b64decode,
            functools.partial(pickle.loads, **kwargs)
        )
        return scrap._replace(data=_impl(scrap.data))


# TODO dill encoder
# NOTE dill has a function `.pickles` to check if an object is encodable. so `encodable` does not have to return `True` regardless of data like `PickleEncoder` does

def register():
    sb.encoders.registry.register(PickleEncoder())

Usage examples

  • notebook.ipynb
import scrapbook as sb
import scrapbook_ext as sb_ext
# register the encoder(s); currently required as the above implementation is a separate module
sb_ext.register()

import matplotlib.pyplot as plt

import numpy as np
import io
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

fig, ax = plt.subplots(figsize=(5, 3.5))
ax.plot(t, s)

ax.set(xlabel='time (s)', ylabel='voltage (mV)')
ax.grid()

# glue this figure to the notebook
sb.glue("figure:test", fig)
# sb.glue("figure:test", fig, encoder='pickle')
  • another_notebook.ipynb
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import scrapbook as sb
import scrapbook_ext as sb_ext
# register the encoder(s); currently required as the above implementation is a separate module
sb_ext.register()

nb = sb.read_notebook('notebook.ipynb')
# display the figure
nb.scraps['figure:test'].data

See also: Example project
https://github.com/oakaigh/scrapbook-ext

Operations on scraps

Hello, I'm using Papermill to create a a directory with output notebooks. In those notebooks I have used sb.glue("Word Count", df_sum, "display") to glue a data frame then using sb.reglue in another. So, in this summary notebook I have the reglue("Word Count") for each of the files in the directory, which then displays ~70 of the same data frames run on different data using Papermill. I would like to take a sum of the values of a certain row across all of the data frames in this summary notebook that uses reglue. Is there a way that you suggest I go about doing this? I am having trouble because when I reglue I am unsure of how I can actually "touch" the data in the reglued data frame to do an operation across many, such as the sum. Thanks in advance!

How do you add custom metadata as you would to display?

When I went to use scrapbook as a replacement for a display call I discovered that you can no longer use custom metadata.

For example, I used to do something like

    display(IPython.display.SVG(flag._repr_mimebundle_()['text/html']), metadata={"filename": flag_def['name']})

To dump the content of a VDOM object to be able to then extract as an SVG automatically with name "flag_def['name'] by using nbconvert.

This is a particular use case that happens to reflect some of the aspects of how scrapbook could be used (since I can use this with papermill to change out the meaning of flag and flag_def parametrically).

However, it more generally points to the need for an escape hatch to attach additional metadata to the displayed object that should be associated with the displayed object in a persistent fashion.

If we think it would be helpful to add a metadata keyword arg to the glue function (that would just get passed through), I would be happy to make the contribution.

Pandas DataFrame example raises error "No encoder found "

Running this example from the docs:

import scrapbook as sb
import pandas as pd

df = pd.DataFrame({"a": [1,2], "b": [3,4]})

sb.glue("hello", "world")
sb.glue("number", 123)
sb.glue("some_list", [1, 3, 5])
sb.glue("non_json", df, encoder='pandas')
sb.glue("some_dict", {"a": 1, "b": 2})

Causes this error:

---------------------------------------------------------------------------
ScrapbookMissingEncoder                   Traceback (most recent call last)
<ipython-input-2-730353fe4323> in <module>()
      7 sb.glue("number", 123)
      8 sb.glue("some_list", [1, 3, 5])
----> 9 sb.glue("non_json", df, encoder='pandas')
     10 sb.glue("some_dict", {"a": 1, "b": 2})

2 frames
/usr/local/lib/python3.6/dist-packages/scrapbook/encoders.py in encode(self, scrap, **kwargs)
    110             raise ScrapbookMissingEncoder(
    111                 'No encoder found for "{data_type}" data type!'.format(
--> 112                     data_type=encoder
    113                 )
    114             )

ScrapbookMissingEncoder: No encoder found for "None" data type!

Here is a reproducible Colab Notebook demonstrating this issue.

What happened to pm.display?

I am trying to move us from an old version (0.18) to the most recent and a few things changed :)

The main one that I am struggling with is pm.display.
In our setup we would have a small notebook (ipynb) file, which through the execute_notebook parameters we would provide certain functions + parameters to run. This allows us to create functions that we can properly lint, unit test, etc. and 'inject' in a notebook that we can run with papermill. It's awesome.

These functions that we would inject were using pm.display and pm.record to display and record objects. Mostly plots and dataframes. It is unclear to me exactly how this is supposed to work now. I see that with scrapbook I can glue scraps, which looks like what I need. (Also has the display=True, which is nice) But this glue is more restrictive in what it can and cannot save, and so it can't be used for displaying everything. (e.g. multi index dataframes)

Any help would be appreciated. Thanks!

Problem with reading notebooks directly from S3 Buckets

I am trying to read notebooks in a s3 bucket that is located in the non-AWS S3 compatible object store using:

import scrapbook as sb
book = sb.read_notebooks('s3://my-bucket/')

But I get an error saying my Access Key Id does not exist:

ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.

I can access the bucket contents from boto3 successfully.

Am I missing anything here in the configuration with respect to endpoint url?

Possible to use with figures / plots?

I'm playing around with using Scrapbook to take plots in one notebook and display them in another. However, doing so seems to generate an error that doesn't quite make sense to me.

Here's the code I'm using to generate the error:

data = np.random.randn(2, 100)
fig, ax = plt.subplots()
ax.scatter(*data, c=data[1])
sb.glue("fig1", fig, display=True)

This is generating the following validation error:

Scrap (name=fig1) contents do not conform to required type structures: <Figure size 432x288 with 1 Axes> is not of type 'object', 'array', 'boolean', 'string', 'number', 'integer'

I see the same behavior with things like Altair figures:

ch = alt.Chart(data=df).mark_point().encode(
    x='a',
    y='b'
)
sb.glue('altair', ch, display=True)

yields a similar error.

It's strange to me that the figures aren't validated as type "object" but either way, perhaps I am not using scrapbook properly here? Or perhaps I am trying to use scrapbook in a way that is not intended? Let me know if I should change something (and I'm happy to add an example to do the docs or something)

Error message: Mime type unknown is not currently supported

I am trying the scrapbook tutorial.

import scrapbook as sb

sb.glue("hello", "world")
sb.glue("number", 123)
sb.glue("some_list", [1, 3, 5])
sb.glue("some_dict", {"a": 1, "b": 2})

This is the output I see

Mime type unknown is not currently supported.
Mime type unknown is not currently supported.
Mime type unknown is not currently supported.
Mime type unknown is not currently supported.

Add name

# TODO: Add name
def load_data(self, storage_type, scrap):
"""
Finds the register for the given storage_type and loads the scrap into


This issue was generated by todo based on a TODO comment in 4bcfaab. It's been assigned to @MSeal because they committed the code.

Add complete data ref for basic data payload

We need an end-to-end working example for gluing a data reference and recalling it. A binder which enables this will will dictate the API improvements to enable the functionality.

Logic around interpreting(…encoder=None, display=['any/mimetype'])

This is arising from some confusion I'm having around the current glue displaying api. I am likely to create a few issues like this… I'm guessing most of them will be closed without action.

# TODO: default to 'display' encoder when encoder is None and object is a display object type?
if display is None:
display = encoder == "display"

The comment seems to say one thing is intended and then the behaviour is otherwise, where if the display is not None, and encoder is undefined, then it is assumed to not be display.

However, if I'm defining a display type (so display would be not None but also would not evaluate to False), it would seem to be reasonable to encode it as a display as well (without needing to declare it as the display encoder).

So should the logic earlier include something like

if encoder is None and display:
    encoder = "display"

?

Scrap of type Int not supported

It seems that this doesn't work:

import scrapbook as sb
sb.glue('number', 123)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-3-876a51da60d5> in <module>
----> 1 sb.glue('number', 123)

~/minimamba3/envs/subnotebook-dev/lib/python3.8/site-packages/scrapbook/utils.py in wrapper(*args, **kwds)
     63         if not is_kernel():
     64             warnings.warn("No kernel detected for '{fname}'.".format(fname=f.__name__))
---> 65         return f(*args, **kwds)
     66 
     67     return wrapper

~/minimamba3/envs/subnotebook-dev/lib/python3.8/site-packages/scrapbook/api.py in glue(name, data, encoder, display, return_output)
     74     if not encoder:
     75         try:
---> 76             encoder = encoder_registry.determine_encoder_name(data)
     77         except NotImplementedError:
     78             if display is not None:

~/minimamba3/envs/subnotebook-dev/lib/python3.8/site-packages/scrapbook/encoders.py in determine_encoder_name(self, data)
     90             if encoder.encodable(data):
     91                 return name
---> 92         raise NotImplementedError(
     93             "Scrap of type {stype} has no supported encoder registered".format(stype=type(data))
     94         )

NotImplementedError: Scrap of type <class 'int'> has no supported encoder registered

Allow saving of dataframes

Surprised that (despite the documentation) support for dataframes doesn't seem to be available - according to the docs you can use the 'arrow' format, but in the code there are a couple of exceptions stating that arrow support is not currently available. I've just used the JSON datatype to save, but obviously not good for larger artifacts.

scrapbook does not handle string records from pm<1.0

I have notebooks that use papermill.record API. For the smooth transition I want to switch to sb.read_notebook that suppose? to handle records from both pm.record and sb.glue. I've got the error when there is a string record

nb = sb.read_notebook('testpm09.ipynb')
nb.scraps
Traceback (most recent call last):
File "", line 1, in
File "[CONDAENVS]:\env\lib\site-packages\scrapbook\models.py", line 175, in scraps
self._scraps = self._fetch_scraps()
File "[CONDAENVS]:\env\lib\site-packages\scrapbook\models.py", line 145, in _fetch_scraps
output_data_scraps = self._extract_output_data_scraps(output)
File "[CONDAENVS]:\env\lib\site-packages\scrapbook\models.py", line 115, in _extract_output_data_scraps
scrap = self._extract_papermill_output_data(sig, payload)
File "[CONDAENVS]:\env\lib\site-packages\scrapbook\models.py", line 109, in _extract_papermill_output_data
return encoder_registry.decode(Scrap(name, data, encoder))
File "[CONDAENVS]:\env\lib\site-packages\scrapbook\encoders.py", line 91, in decode
return loader.decode(scrap, **kwargs)
File "[CONDAENVS]:\env\lib\site-packages\scrapbook\encoders.py", line 125, in decode
scrap = scrap.replace(data=json.loads(scrap.data))
File "[CONDAENVS]:\env\lib\json_init
.py", line 319, in loads
return _default_decoder.decode(s)
File "[CONDAENVS]:\env\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "[CONDAENVS]:\env\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
==================================
pm.record("string","foo")

Python 3 only release change

We need to change the setup.py to indicate the library is python 3.5+ only, and do a pass for any bytes/str checks in the code that should be simplified to the python 3 variant only.

We'll probably pin this to 0.4 release rather than doing a major release since this project is still a beta project without a 1.0 release yet.

All old versions have been dropped from pypi?

I noticed a new release went out today. Was it intended that all old versions were removed from pypi? My scripts started breaking that I can't find version 0.2.0.

I am going to try and update to the newest version. It was just a surprising break.

image

Consider dynamically importing IPython in scrapbook.api.glue

Hello!

Right now scrapbook (and papermill) incurs a fairly significant cost at import time (2 seconds) because of the transitive dependency on IPython. It would be great if this were incurred only if the the code is actually executed. We are looking at things like unit test performance and larger codebases.

Totally understand if this is not super-high priority!

Authorize filter arguments to filter notebook to read given their name

Hello, thanks a lot for the package ! I have a small suggestion to improve the user experience.

When reading multiple notebooks from a folder it could be nice to be able to filter which notebook you want to open. In my use case our ipynb files are quite large (+1Mb) and so reading a folder with a large amount of notebooks can be slow.

def read_notebooks(path, filter_notebooks=None):
    """
    Returns a Scrapbook including the notebooks read from the
    directory specified by `path`.

    Parameters
    ----------
    path : str
        Path to directory containing notebook `.ipynb` files.
    filter_notebooks: function
       Functions used by filter to filter out notebooks by their name

    Returns
    -------
    scrapbook : object
        A `Scrapbook` object.

    """
    scrapbook = Scrapbook()
    for notebook_path in sorted(filter(filter_notebooks, list_notebook_files(path))):
        fn = os.path.splitext(os.path.basename(notebook_path))[0]
        scrapbook[fn] = read_notebook(notebook_path)
    return scrapbook

My proposal would be a simple filter here in the code. I can open a pr if you want.

If you have a better idea to load our notebooks faster I would be happy to hear any suggestion, but given nformat api i'm not sure we can load scrapbook metadata faster ?

Issue with scrapbook glue in R ipynb:

Hi,

I am running scrapbook to persist a pandas dataframe from an R ipynb using the code below:
I created an identical python notebook that works, it just seems to break as per below...
Any ideas, i would love to run my R and python notebooks with papermill and
persist the return values with scrapbook

------------- Notebook Using R ipynb kernel and R reticulate to sseemlessly use python ---------

library("reticulate")
use_condaenv("kitchen-nb")
pd <- import("pandas")
py_config()
sb <- import("scrapbook", convert = T)

python: /Users/jacques/opt/anaconda3/envs/kitchen-nb/bin/python
libpython: /Users/jacques/opt/anaconda3/envs/kitchen-nb/lib/libpython3.7m.dylib
pythonhome: /Users/jacques/opt/anaconda3/envs/kitchen-nb:/Users/jacques/opt/anaconda3/envs/kitchen-nb
version: 3.7.0 | packaged by conda-forge | (default, Nov 12 2018, 12:34:36) [Clang 4.0.1 (tags/RELEASE_401/final)]
numpy: /Users/jacques/opt/anaconda3/envs/kitchen-nb/lib/python3.7/site-packages/numpy
numpy_version: 1.21.6
pandas: /Users/jacques/opt/anaconda3/envs/kitchen-nb/lib/python3.7/site-packages/pandas

start <- 1
end <- 10
result <- iris[start:end, ]
result <- r_to_py(result) # will convert R data frame to pandas data frame... 
sb$glue("result", result, encoder <- "pandas")
result

--------- Then later from Python Code I am trying ------------

nb = scrapbook.read_notebook("mnt/wolfstack/notebooks/output-notebook-1-5.ipynb")
nb.scraps

It returns: (nothing)
Scraps()

Jacques

What happens when running `glue` from an ipython console

It is convenient to use an ipython console to quickly look for scraps using sb.read_notebooks, so I wondered what would happen if I try glue outside a notebook environment like this.

When I tried this, there was no error message or anything so my question is:

Should there be an error message? Does anything happen? Should something happen?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.