The genno from khaeru

Remove ixmp dependency in tests

This requires writing a fixture that populates a Computer with contents analogous to ixmp.testing.make_dantzig.

Add SDMX input/output

This would add .compat.sdmx, including computations like…

Convert sdmx.model.DataSet into Quantity.
Perform a specific SDMX query to retrieve data.
Convert Quantity into sdmx.model.DataSet.

Some issues to resolve here:

Quantity.attrs map well to SDMX attributes attached at the level of an entire data set. However, one powerful feature of SDMX is the ability to attach attributes to individual observations. This does not have a natural analogue in the xarray (thus genno) data model.

Transfer and refactor code from message_data.reporting

Only generalizable pieces, e.g. related to configuration, plugins, callbacks.

Make Quantity a full class

Currently Quantity() is a function with a name that makes it seem like a class.

This means it's not possible to do:

if isinstance(foo, Quantity)

…or to use it in type annotations for computation functions.

Using a metaclass like QuantityMeta should make it possible to do this.

Strict vs. permissive handling of missing/dimensionless units

Consider these cases:

>>> from genno import Quantity, computations
# Case A
>>> computations.add(Quantity(1.0, units="kg"), Quantity(2.0, units="tonne), Quantity(3.0))
ValueError: Units 'kg' and '' are incompatible
# Case B
>>> computations.add(Quantity(1.0, units="kg"), Quantity(2.0, units="tonne"), Quantity(3.0, units=""))
ValueError: Units 'kg' and '' are incompatible

In (A) collect_units() assigns dimensionless to the last operand. In (B), it is explicitly dimensionless. This arose in iiasa/message_ix#441, where computations.add() is applied to two quantities, one with units, the other dimensionless (because the ixmp parameter handled by ixmp.reporting.computations.data_for_quantity() was empty).

What should the behaviour be?

Some possibilities:

In (A), infer that operand(s) with missing units is in the same units as the first/others. Maybe only if the units are consistent?
In (B), infer that explicitly dimensionless operand(s) are in the same units as the first/others.
Add a (global?) configuration setting to toggle between different behaviours. (What should be the default?)

Change term ‘computations’?

The dask graph specification uses ‘computation’ for any dict value in the graph. A ‘task’—tuple with a callable first element—is one of four kinds of ‘computation’.

In contrast, genno uses ‘computation’ for callables used as those first elements of tasks. This is a little inconsistent; also it's a long word.

Consider alternatives.

Transfer & refactor initial code from ixmp.reporting

Add file-based caching

A caching pattern/task would:

Understand a configured cache directory.
Compute a hash of the arguments and inputs to a particular task.
If the corresponding cache file exists, load and return it.
Otherwise:
- Execute the task that generates the data,
- Cache the result, and
- Return it.

Existing code, from e.g. khaeru/data or transportenergy/ipcc-wg4-ar6-ch10 could be adapted for this.

Adjust for pyam 1.7.0

pyam 1.7.0 was released on 2022-12-19. Per IAMconsortium/pyam#708, specifically here, keyword arguments to IamDataFrame are directly fed to pandas.DataFrame.to_excel(). (See also the blame for this method. It appears at some point pyam forced engine="openpyxl" and accepted but ignored the keyword arguments.)

This causes failures in genno.compat.pyam.write_report(), e.g. here:

 genno/compat/pyam/computations.py:109: in write_report
    obj.to_excel(path, merge_cells=False)
/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pyam/core.py:2382: in to_excel
    excel_writer = pd.ExcelWriter(excel_writer, **kwargs)

(snip)

>       self._book = Workbook(self._handles.handle, **engine_kwargs)
E       TypeError: Workbook.__init__() got an unexpected keyword argument 'merge_cells'

/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pandas/io/excel/_xlsxwriter.py:216: TypeError

This is because pyam is now allowing pandas to select xlsxwriter as the engine, and the merge_cells keyword argument is not understood by this engine.

The fix is likely to (a) remove and (b) specify a minimum version of pyam to avoid the need for genno to handle the shift(s) in behaviour.

Improve typing

This issue is to collect type errors seen in downstream code that uses genno. These can be addressed by changes like those in #53, with reference to the typing and mypy docs.

Addressed in #55:

error: "Quantity" has no attribute "shift"
error: Unsupported operand types for * ("float" and "Quantity")
error: Unsupported operand types for - ("int" and "Quantity")

Others:

…

Document Computer.visualize()

Include ≥1 example in the built documentation.

A separate issue is to use these extensively to illustrate graphs.

`AttrSeries.expand_dims()` drops units

cf. iiasa/message_data#337

Document .add_queue()

Currently this is used internally by .config.parse_config(), but it could be further demonstrated on a documentation page.

Switch default Quantity: AttrSeries → SparseDataArray

Inherited from iiasa/ixmp#191:

xarray 0.13 includes support for converting pd.DataFrame to a pydata/sparse data structure.
This should mostly obviate the need for the custom AttrSeries class.
A PR should be opened to make the change, test performance, and make any necessary adjustments.

Resources:

Pint's technical commentary on container class hierarchy: https://pint.readthedocs.io/en/0.11/numpy.html#Technical-Commentary

AmphoraInc/xarray_mongodb, which integrates xarray, sparse, and pint, for reference (thanks @gidden).

As of genno 1.0, all code is tested with both AttrSeries and SparseDataArray to minimize surprises on switching.

#27 should probably be done first.

Update for xarray 2022.6.0

Nightly tests began to fail with the release of xarray 2022.6.0 e.g. here.

The failing tests are:

genno/tests/test_computations.py::test_broadcast_map[SparseDataArray-map_values0-kwarg0]
genno/tests/test_computations.py::test_index_to[SparseDataArray]
genno/tests/test_computations.py::test_pow[SparseDataArray]
genno/tests/test_computations.py::test_product0[SparseDataArray]
genno/tests/test_computations.py::test_product[SparseDataArray-dims0-64]
genno/tests/test_computations.py::test_product[SparseDataArray-dims1-8]
genno/tests/test_computations.py::test_product[SparseDataArray-dims2-4]

These all appear to fail on the f-string formatting of a log message in genno.util.collect_units():
```
log.debug(f"{arg} lacks units; assume dimensionless")
```
which raises: “RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.”
This is an upstream regression: pydata/xarray#6822

As mitigation:

SparseDataArray is not the default genno.Quantity class currently. If using AttrSeries (the default), genno remains usable.
If using SparseDataArray, use xarray < 2022.6.0

To resolve:

Follow the response to the upstream issue.
Make any adjustment necessary in genno itself.

Advertise or remove .config.CALLBACKS

#16 added this code, adapted from message_data:

genno/genno/config.py

Lines 102 to 103 in 91d906d

    
           # Also add the callbacks to the queue 
        
           queue.extend((("apply", cb), {}) for cb in CALLBACKS)

These "callbacks" are essentially the same as "handlers", simply without any arguments.
Perhaps the two can be merged, and the handles() decorator updated/renamed to cover both use-cases.

Also: add documentation!

Aggregate using wildcards

For a quantity <A:x-y-z> with labels on the x dimension like ‘foo’, ‘bar’, and ‘baz’, .aggregate() should accept a wildcard or regular expression like "b.?" that would aggregate labels ‘foo’, ‘bar’ to a group, but not ‘baz’.

From a card in the IIASA ‘Reporting’ project, created 2019-06-26.

Work around ModuleNotFound: `matplotlib._contour`

With the release of matplotlib v3.6.0, nightly tests began to fail due to has2k1/plotnine#619.

Temporary mitigations:

60567eb excludes this version from use in GitHub Actions CI jobs.

To resolve:

Once the upstream issue is closed, remove the mitigations.

Edit documentation

Ensure it is self-contained/standalone.
Incorporate text from message_ix reporting tutorial.

Increase test coverage to 100%

As of #3, there are about 27 lines uncovered out of 1515.

Use names ‘div’, ‘mul’, etc. consistent with standard library

In .computations:

Rename .product() → .mul().
Rename .ratio → .div().
Add .sub() for subtraction.

See the operator module in the standard library. Similar names are also used by numpy, etc.

This would require shadowing under the old names with depreciation markings.

.add() and .pow() are already correct.

Add a `Computation` abstract class

This can be the location for:

add_task(c: Computer) or similar for describing computations in c.
__call__(): the actual callable to be executed.
__repr__(): a more readable string representation for Computer.describe().
etc.

These should be easier to maintain if they are collected, instead of the separate pair of e.g. Computer.convert_pyam (for adding task(s)) and .compat.pyam.computations.as_pyam (the actual callable)

This will also alow to reduce complexity of this code in Computer.add():

genno/genno/core/computer.py

Lines 144 to 162 in 43a1702

    
           elif isinstance(data, str) and self.get_comp(data): 
        
               # *data* is the name of a pre-defined computation 
        
               name = data 
        
               if hasattr(self, f"add_{name}"): 
        
                   # Use a method on the current class to add. This invokes any 
        
                   # argument-handling conveniences, e.g. Computer.add_product() 
        
                   # instead of using the bare product() computation directly. 
        
                   return getattr(self, f"add_{name}")(*args, **kwargs) 
        
               else: 
        
                   # Get the function directly 
        
                   func = self.get_comp(name) 
        
                   # Rearrange arguments: key, computation function, args, … 
        
                   func, kwargs = partial_split(func, kwargs) 
        
                   return self.add(args[0], func, *args[1:], **kwargs) 
        
           elif isinstance(data, str) and data in dir(self): 
        
               # Name of another method, e.g. 'apply' 
        
               return getattr(self, data)(*args, **kwargs)

The Computer can:

Look up the Computation class in Computer.modules.
If it has an add_task() method, call that directly; else, simply instantiate.

Add ‘update’ and/or ‘merge’ computations

computations.concat() for AttrSeries with non-identical dims

If two AttrSeries with non-identical dims (e.g. ("x", "y") and ("y", "x")) are concatenated, the dimensions are not aligned automatically.

Add tests & fix.

Extend/override `dask.visualize()`

Because dask.visualize() is intended for use with dask's own collections/classes, it tries to generate labels suitable for that use-case. These end up being uninformative (e.g. blank) for genno graphs, e.g.:

This could be addressed by some combination of:

Extend genno's classes and objects (cf #30) to present the information expected by dask's labeling and other utilities.
Monkeypatch dask.base.* as necessary to get the desired behaviour.
Copy and modify to get the desired behaviour.

Transfer and refactor code from message_ix.reporting

Including:

Computations such as select.
message_ix.reporting.pyam → genno.compat.pyam.

Make Key.dims order-insensitive

The following should work:

>>> "X:a-b" == Key("X", "ba")
True

Cache based on function code / document caching based on file contents

(Transferred from the discussion of iiasa/message-ix-models#25.)

The main question is whether genno covers the following two features in its caching option which are covered by an implementation I recently did using joblib.Memory.

joblib.Memory doesn't only cache the input values but also the function code itself. This way if your function code change but the input stays the same it won't be tricked into wrongly thinking that it has the results cached already.
My usecase involved reading data from a file, doing some computation and providing the result as a pandas dataframe. I provide the function with a filename in form of a pathlib.Path or a str. That means my function now looks like this: read_and_compute_some_data(file: Union[str, pathlib.Path], ...) -> pd.DataFrame. Here the joblib.Memory caching decorator would simply save a hash of the name of the input file. In some way that's a problem since I'm not actually interested in the name of my datafile but the contents. For this I have created a small wrapper class InputFile for the filename which stores a hash of the files contents. As joblib uses pickle to serialize the data to binary I have modified the way InputFile is serialized by just considering the contents of the file and not the name.

Minimum working example of caching of the content of input files using joblib.Memory:

from joblib import Memory
from pathlib import Path
import hashlib
import pandas as pd

# setting the directory of the cache in the parent folder of the file
memory = Memory(Path(__file__).parent / ".joblib_cache") 

class InputFile:
    def __init__(self, file) -> None:
        self.file = file
        self.hash = self.calc_hash()

    def calc_hash(self) -> str:
        """Generate a hash from the contents of a file

        Parameters
        ----------
        file : str
            File to be hashed

        Returns
        -------
        str:
            Hexadecimal representation of the file hash

        Notes
        -----
        For details refer to https://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
        """
        with open(self.file, "rb") as f:
            file_hash = hashlib.md5()
            # we read the file at a rate of 8192 bytes a chunk
            # this takes advantage of the digest size of 128 bytes
            while chunk := f.read(8192):
                file_hash.update(chunk)
        return file_hash.hexdigest()

    def __getstate__(self) -> dict:
        """Custom __getstate__ function for using with Memory.cache from joblib.Memory

        Returns
        -------
        dict
            __dict__ minus the file name

        Notes
        -----
        """
        # this is to 'trick' pickle into only considering the hash of the contents
        # of the file and not the filename itself when checking if the have 
        # cached results. Of course you could also change it to include the filename
        # as well. A good way might be use both the filename (just the name
        # and not the entire path) and the hash of the contents. This would possibly
        # also make the cache independent of the user as it would no longer hash
        # the directory structure where the file is stored

        state = self.__dict__.copy()
        # remove the file from the state as we are just interested in the contents
        del state["file"]
        return state

    def __repr__(self) -> str:
        # this is just so that we get a nice representation of the class since 
        # joblib.memory also writes a json with with the input parameters of
        # the function call
        return f"{self.__class__}: {self.__dict__}"

# adding the decorator to make read_from_file cache-able 
# also caching this might be a bit pointless but I think it illustrates 
# the general layout of such a function
@memory.cache
def read_from_file(input_file):
    return pd.read_csv(input_file.file)

if __name__ == "__main__":
    # in the current configuration the second call read_from_file would hit the cache if the contents of file1.csv and file2.csv
    # are the same even though they have different names. 
    read_from_file(InputFile("file1.csv"))
    read_from_file(InputFile("file2.csv"))

Additionally, joblib.Memory also saves json files where the input values are stored which is a nice feature for book keeping.

	# Also add the callbacks to the queue
	queue.extend((("apply", cb), {}) for cb in CALLBACKS)

	elif isinstance(data, str) and self.get_comp(data):
	# data is the name of a pre-defined computation
	name = data

	if hasattr(self, f"add_{name}"):
	# Use a method on the current class to add. This invokes any
	# argument-handling conveniences, e.g. Computer.add_product()
	# instead of using the bare product() computation directly.
	return getattr(self, f"add_{name}")(args, *kwargs)
	else:
	# Get the function directly
	func = self.get_comp(name)
	# Rearrange arguments: key, computation function, args, …
	func, kwargs = partial_split(func, kwargs)
	return self.add(args[0], func, args[1:], *kwargs)

	elif isinstance(data, str) and data in dir(self):
	# Name of another method, e.g. 'apply'
	return getattr(self, data)(args, *kwargs)

khaeru / genno Goto Github PK

genno's People

Contributors

Watchers

Forkers

genno's Issues

Recommend Projects

Recommend Topics

Recommend Org