brightway-lca / bw_processing Goto Github PK

View Code? Open in Web Editor NEW

7.0 3.0 5.0 570 KB

Tools to create structured arrays in a common format

Home Page: https://docs.brightway.dev/projects/bw-processing/

License: BSD 3-Clause "New" or "Revised" License

Python 88.40% Jupyter Notebook 11.60%

data python life-cycle-assessment bw3

bw_processing's Introduction

bw-processing

Library for storing numeric data for use in matrix-based calculations. Designed for use with the Brightway life cycle assessment framework.

Background
Concepts
Install
Usage
Contributing
Maintainers
License

Background

The Brightway LCA framework has stored data used in constructing matrices in binary form as numpy arrays for years. This package is an evolution of that approach, and adds the following features:

Consistent names for row and column fields. Previously, these changed for each matrix, to reflect the role each row or column played in the model. Now they are always the same for all arrays ("row" and "col"), making the code simpler and easier to use.
Provision of metadata. Numpy binary files are only data - bw_processing also produces a metadata file following the data package standard. Things like data license, version, and unique id are now explicit and always included.
Support for vector and array data. Vector (i.e. only one possible value per input) and array (i.e. many possible values, also called presamples) data are now both natively supported in data packages.
Portability. Processed arrays can include metadata that allows for reindexing on other machines, so that processed arrays can be distributed and reused. Before, this was not possible, as integer IDs were randomly assigned on each computer, and would be different from machine to machine or even across Brightway projects.
Dynamic data sources. Instead of requiring that data for matrix construction be present and savedd on disk, it can now be generated dynamically, either through code running locally or on another computer system. This is a big step towards embeddding life cycle assessment in a web of environmental models.
Use PyFilesystem2 for file IO. The use of this library allows for data packages to be stored on your local computer, or on many logical or virtual file systems.
Simpler handling of numeric values whose sign should be flipped. Sometimes it is more convenient to specify positive numbers in dataset definitions, even though such numbers should be negative when inserted into the resulting matrices. For example, in the technosphere matrix in life cycle assessment, products produced are positive and products consumed are negative, though both values are given as positive in datasets. Brightway used to use a type mapping dictionary to indicate which values in a matrix should have their sign flipped after insertion. Such mapping dictionaries are brittle and inelegant. bw_processing uses an optional boolean vector, called flip, to indicate if any values should be flipped.
Separation of uncertainty distribution parameters from other data. Fitting data to a probability density function (PDF), or an estimate of such a PDF, is only one approach to quantitative uncertainty analysis. We would like to support other approaches, including direct sampling from real data. Therefore, uncertainty distribution parameters are stored separately, only loaded if needed, and are only one way to express quantitative uncertainty.

Concepts

Data packages

Data objects can be vectors or arrays. Vectors will always produce the same matrix, while arrays have multiple possible values for each element of the matrix. Arrays are a generalization of the presamples library.

Data needed for matrix construction

Vectors versus arrays

Persistent versus dynamic

Persistent data is fixed, and can be completely loaded into memory and used directly or written to disk. Dynamic data is only resolved as the data is used, during matrix construction and iteration. Dynamic data is provided by interfaces - Python code that either generates the data, or wraps data coming from other software. There are many possible use cases for data interfaces, including:

Data that is provided by an external source, such as a web service
Data that comes from an infinite python generator
Data from another programming language
Data that needs processing steps before it can be directly inserted into a matrix

Only the actual numerical values entered into the matrix is dynamic - the matrix index values (and optional flip vector) are still static, and need to be provided as Numpy arrays when adding dynamic resources.

Interfaces must implement a simple API. Dynamic vectors must support the python generator API, i.e. implement __next__().

Dynamic arrays must pretend to be Numpy arrays, in that they need to implement .shape and .__getitem__(args).

.shape must return a tuple of two integers. The first should be the number of elements returned, though this is not used. The second should be the number of columns available - an integer. This second value can also be None, if the interface is infinite.
.__getitem__(args) must return a one-dimensional Numpy array corresponding to the column args[1]. This method is called when one uses code like some_array[: 20]. In our case, we will always take all rows (the :), so the first value can be ignored.

Here are some example interfaces (also given in bw_processing/examples/interfaces.py):

import numpy as np


class ExampleVectorInterface:
    def __init__(self):
        self.rng = np.random.default_rng()
        self.size = self.rng.integers(2, 10)

    def __next__(self):
        return self.rng.random(self.size)


class ExampleArrayInterface:
    def __init__(self):
        rng = np.random.default_rng()
        self.data = rng.random((rng.integers(2, 10), rng.integers(2, 10)))

    @property
    def shape(self):
        return self.data.shape

    def __getitem__(self, args):
        if args[1] >= self.shape[1]:
            raise IndexError
        return self.data[:, args[1]]

Interface dehydrating and rehydrating

Serialized datapackages cannot contain executable code, both because of our chosen data formats, and for security reasons. Therefore, when loading a datapackage with an interface, that interface object needs to be reconstituted as Python code - we call this cycle dehydration and rehydration. Dehydration happens automatically when a datapackage is finalized with finalize_serialization(), but rehydration needs to be done manually using rehydrate_interface(). For example:

from fsspec.implementations.zip import ZipFileSystem
from bw_processing import load_datapackage

my_dp = load_datapackage(ZipFileSystem("some-path.zip"))
my_dp.rehydrate_interface("some-resource-name", ExampleVectorInterface())

You can list the dehydrated interfaces present with .dehydrated_interfaces().

You can store useful information for the interface object initialization under the resource key config. This can be used in instantiating an interface if you pass initialize_with_config:

from fsspec.implementations.zip import ZipFileSystem
from bw_processing import load_datapackage
import requests
import numpy as np


class MyInterface:
    def __init__(self, url):
        self.url = url

    def __next__(self):
        return np.array(requests.get(self.url).json())


my_dp = load_datapackage(ZipFileSystem("some-path.zip"))
data_obj, resource_metadata = my_dp.get_resource("some-interface")
print(resource_metadata['config'])
>>> {"url": "example.com"}

my_dp.rehydrate_interface("some-interface", MyInterface, initialize_with_config=True)
# interface is substituted, need to retrieve it again
data_obj, resource_metadata = my_dp.get_resource("some-interface")
print(data_obj.url)
>>> "example.com"

Policies

Data package policies define how the data should be used. Policies apply to the entire data package; you may wish to adjust what is stored in which data packages to get the effect you desire.

There are two policies that apply to all data resources:

sum_intra_duplicates (default True): What to do if more than one data point for a given matrix element is given in each vector or array resource. If true, sum these values; otherwise, the last value provided is used.

sum_inter_duplicates (default: False): What to do if data from a given resource overlaps data already present in the matrix. If true, add the given value to the existing value; otherwise, the existing values will be overwritten.

There are three policies that apply only to array data resources, where a different column from the array is used in matrix construction each time the array is iterated over:

combinatorial (default False): If more than one array resource is available, this policy controls whether all possible combinations of columns are guaranteed to occur. If combinatorial is True, we use itertools.combinations to generate column indices for the respective arrays; if False, column indices are either completely random (with replacement) or sequential.

Note that you will get StopIteration if you exhaust all combinations when combinatorial is True.

Note that combinatorial cannot be True if infinite array interfaces are present.

sequential (default False): Array resources have multiple columns, each of which represents a valid system state. Default behaviour is to choose from these columns at random (including replacement), using a RNG and the data package seed value. If sequential is True, columns in each array will be chosen in order starting from column zero, and will rewind to zero if the end of the array is reached.

Note that if combinatorial is True, sequential is ignored; instead, the column indices are generated by itertools.combinations.

Please make sure you understand how combinatorial and sequential interact! There are three possibilities:

combinatorial and sequential are both False. Columns are returned at random.
combinatorial is False, sequential is True. Columns are returned in increasing numerical order without any interaction between the arrays.
combinatorial is True, sequential is ignored: Columns are returned in increasing order, such that all combinations of the different array resources are provided. StopIteration is raised if you try to consume additional column indices.

Install

Install using pip or conda (channel cmutel). Depends on numpy and pandas (for reading and writing CSVs).

Has no explicit or implicit dependence on any other part of Brightway.

Usage

The main interface for using this library is the Datapackage class. However, instead of creating an instance of this class directly, you should use the utility functions create_datapackage and load_datapackage.

A datapackage is a set of file objects (either in-memory or on disk) that includes a metadata file object, and one or more data resource files objects. The metadata file object includes both generic metadata (i.e. when it was created, the data license) and metadata specific to each data resource (how it can be used in calculations, its relationship to other data resources). Datapackages follow the data package standard.

Creating datapackages

Datapackages are created using create_datapackage, which takes the following arguments:

dirpath: str or pathlib.Path object. Where the datapackage should be saved. None for in-memory datapackages.
name: str: The name of the overall datapackage. Make it meaningful to you.
id_: str, optional. A unique id for this package. Automatically generated if not given.
metadata: dict, optional. Any additional metadata, such as license and author.
overwrite: bool, default False. Overwrite an existing resource with the same dirpath and name.
compress: bool, default False. Save to a zipfile, if saving to disk.

Calling this function return an instance of Datapackage. You still need to add data.

Contributing

Your contribution is welcome! Please follow the pull request workflow, even for minor changes.

When contributing to this repository with a major change, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository.

Please note we have a code of conduct, please follow it in all your interactions with the project.

Documentation and coding standards

Maintainers

Chris Mutel

License

bw_processing's People

Contributors

Stargazers

Watchers

Forkers

tsgfan07 macmribo stew-mcd tngtudor

bw_processing's Issues

del_resource should support resource groups

no error when addding array with add_*_vector

Resource key sare not exactly the same as filenames

Difference is ".npy" at end of file.

Error out when adding existing resource to package

Rename `fixtures` repository with data into `data` for tests

The current tests/fixtures repository only contains data. Refactor its name to data and create a fixtures repository for real fixtures.

Add function for filtering resource groups

Needed in case you want to filter out distributions for one resource group in a bigger package

move to cookiecutterlib

Current

older version of the brightwaylib cookiecutter is used

Expected

Use the newer version of the brightwaylib cookiecutter + cruft

Storing dataframes in parquet files the same way we can for CSV files

Add function with health check for datapackages

Some things to consider:

Multiple CSV metadata files reference same data columns
CSV dataframe doesn't have id column
All lengths consistent

Refactor `add_data` as fixtures in the tests

Refactor the different add_data() functions into parameterised fixtures that can be automatically reused.

MultiMonteCarlo crashes

Hi,

I have following code to try to run a MultiMonteCarlo

import bw2data as bd
from bw2calc import LCA
from bw2calc.monte_carlo import MultiMonteCarlo

bd.projects.set_current("ecoinvent_391") 
ei = bd.Database("ecoinvent_391_cutoff")


act = ei.get_node('a8fe0b37705fe611fac8004ca6cb1afd')
act2 = ei.get_node('413bc4617794c6e8b07038dbeca64adb')

method = ('CML v4.8 2016', 'climate change', 'global warming potential (GWP100)')

demands = [{act: 1,
            act2: 10}]

mc = MultiMonteCarlo(demands, method=method)
mc.calculate()

But it crashes with this trace

Traceback (most recent call last):
  File ".../bw2calc/monte_carlo.py", line 217, in calculate
    results = pool.map(
              ^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 540, in _handle_tasks
    put(task)
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 205, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_thread.RLock' object

I am not sure if I am supposed to run it differently- MultiLCA does not take the use_distributions and even trying to copy and change the MultiLCA code to include it, does seem to run a normal LCI.

I also tried it with python3.9 but get the same error. The error occurs on Windows and Linux.
I guess is that
Any help is highly appreciated :)

python 3.8 tests fail

Current

Tests with python 3.8 fail with errors similar to

__________ test_save_load_parquet_file[indices_vector-vector-indices] __________

arr_fixture_name = 'indices_vector', meta_object = 'vector'
meta_type = 'indices'
tmp_path_factory = TempPathFactory(_given_basetemp=None, _trace=<pluggy._tracing.TagTracerSub object at 0x7f420e731070>, _basetemp=PosixPath('/tmp/pytest-of-runner/pytest-0'), _retention_count=3, _retention_policy='all')
request = <FixtureRequest for <Function test_save_load_parquet_file[indices_vector-vector-indices]>>
    @pytest.mark.parametrize("arr_fixture_name, meta_object, meta_type", ARR_LIST)
    def test_save_load_parquet_file(
        arr_fixture_name, meta_object, meta_type, tmp_path_factory, request
    ):
   
        arr = request.getfixturevalue(arr_fixture_name)  # get fixture from name
        file = tmp_path_factory.mktemp("data") / (arr_fixture_name + ".parquet")
        with file as fp:
            save_arr_to_parquet(
                file=fp, arr=arr, meta_object=meta_object, meta_type=meta_type
            )
>       with file as fp:
tests/io_parquet_helpers.py:39:

The interesting part to "debug" / fix: E ValueError: I/O operation on closed path

tests/io_parquet_helpers.py:39: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/pathlib.py:1067: in __enter__
    self._raise_closed()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = PosixPath('/tmp/pytest-of-runner/pytest-0/data0/indices_vector.parquet')

    def _raise_closed(self):
>       raise ValueError("I/O operation on closed path")
E       ValueError: I/O operation on closed path

Expected

Find a solution to keep bw_processing compatible with python 3.8
deprecate 3.8
- update python version requirement
- remove 3.8 from testing

Move from `fs` (pyfilesystem) to `fsspec`

As suggested by @cmutel, moving to fsspec would solve two issues with fs (pyfilesystem):

The package is currently not under active development
Code of the sort if not _WINDOWS_PLATFORM: import grp causes an error in WASM (Pyodide/JupyterLite-XEUS):

Note that this was patched in the current https://live.brightway.dev release:

import micropip
await micropip.install(
    'https://files.brightway.dev/fs-2.5.1-py2.py3-none-any.whl'
)

Still, it would be great to move to fsspec, which is supported by both Pyodide and emscipten-forge (via the pure Python channel).

This would close:

brightway-lca/brightway-live#53

build fails because of missing dep

Travis build failed.

Here's a hint on why:

...
/lib/python3.9/site-packages/bw_processing/io_helpers.py", line 2, in <module>

    from fs.base import FS

ModuleNotFoundError: No module named 'fs'

but the Azure pipeline went through.

Do we:

keep azure pipelines only and remove the .travis config
keep both, and fix the travis one
?

Add more arguments for the creation of parquet files

For instance parameter the compression algorithm. We could add the possibility to use a compression dictionary and chose the compression algorithm.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

This repository currently has no open or pending branches.

Detected dependencies

github-actions

.github/workflows/python-package-deploy.yml

actions/checkout v4

actions/setup-python v5

.github/workflows/python-test.yml

actions/checkout v4

actions/setup-python v5

codecov/codecov-action v4

pep621

pyproject.toml

setuptools >=68.0

Check this box to trigger a request for Renovate to run again on this repository

See if a Datapackage object is non mutable or not

For the moment, Datapackage objects are supposed to be non mutable. In that case, we should force the finalization of such object and test if the object has been finalized before any operation can be applied to it.

If a Datapackage object is mutable, then introduce such mechanism.

Utility function load load processed packages

Support presample arrays

Different dtype: row, col, then lots of data (many columns).

Add setup.py to project

Maybe we could add a dummy setup.py file for backward compatibility?

Python 3.8 fails the tests across OSes for the parquet protocol

Code review for `FilteredDatapackage`

I would be very happy to have some help making sure my code is doing what I think it is. In 2.5, we use data packages to store processed arrays and metadata defining generic interfaces to external data sources which return processed arrays. These data packages can (n theory) contain data for multiple matrices, or multiple data resources for the same matrix.

To dispatch data to the correct matrix builders, we use an object called FilteredDatapackage. These objects are created by one and only one method, filter_by_attribute. In order for the code flow to work correctly and not use too much memory, we need FilteredDatapackage to avoid copies wherever possible.

This is where I need help. I think that filter_by_attribute creates a "shallow" copy, e.g. while .resources is a new object (a list), the objects in that list are the same as in the parent Datapackage. But I am not 100% sure, and the question on e.g. whether Numpy create a view or a copy are not always clear for me. I also don't know how to write tests (of course, one could iterate and check the id() of objects, but is there something else? Maybe to also check memory usage?) for ensuring my assumptions are correct.

In the code review, you may notice that get_resource uses a cache, and that this cache would not be shared across instances of FilteredDatapackage. This is OK, as each data resource (the actual underlying numpy array, which can in theory be very large) would only ever be loaded once, by the matrix constructor for that particular matrix.

Introduce parquet files in `file_reader` and `file_writer`

In io_helpers.py, modify both file_reader and file_writer to introduce parquet files.

Add corresponding tests and be cautious with all consequences of this change.

Refactor `datapackage.py`

Datapackage has several add_xxx methods that could be refactored for make the code shorter and easier to maintain.

Make parquet file helpers public?

For the moment, all parquet file helpers are private, i.e. can not be accessed through the library. Maybe it would be useful to make them accessible so that users can create the needed parquet files outside of the library?

finalize_serialization() should set _finalized to True for Datapackage object

Private attribute _finalized of Datapackage object must be set to True when finalized.

`Datapackage.write_modified` should be updated

First, it should be able to deal with the parquet file format and second it should be written in a more robust way.

Multiple readings of same proxied data resource in `FilteredDataPackage` causes errors

This problem occurs only when proxied data like arrays (i.e. proxy=True) are read by more than one FilteredDatapackage. This problem is inherent to the current way that proxied arrays are being used.

We currently use functools.partial to provide different arguments to the different open functions for Numpy, CSV, JSON, etc. If there is only one instance of Datapackage, then the proxy is resolved and replaced with the in-memory data object. However, when two copies of FilteredDatapackage are present, the underlying file or buffer object will be read completely (exhausted) by the first object, and will raise an error when accessed a second time.

Support CSV dialect and table schema

Can do validation based on column names & types, and get primary key.

See:

Add unit tests for the saving/loading of parquet files

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

brightway-lca / bw_processing Goto Github PK

bw_processing's Introduction

bw-processing

Table of Contents

Background

Concepts

Data packages

Data needed for matrix construction

Vectors versus arrays

Persistent versus dynamic

Interface dehydrating and rehydrating

Policies

Install

Usage

Creating datapackages

Contributing

Documentation and coding standards

Maintainers

License

bw_processing's People

Contributors

Stargazers

Watchers

Forkers

bw_processing's Issues

Current

Expected

Current

Expected

Detected dependencies

Recommend Projects

Recommend Topics

Recommend Org