Code Monkey home page Code Monkey logo

genno's People

Contributors

behnam-zakeri avatar francescolovat avatar gidden avatar jihoon avatar jkikstra avatar khaeru avatar lauwien avatar zikolach avatar

Watchers

 avatar  avatar

Forkers

lauwien

genno's Issues

Add SDMX input/output

This would add .compat.sdmx, including computations like…

  • Convert sdmx.model.DataSet into Quantity.
  • Perform a specific SDMX query to retrieve data.
  • Convert Quantity into sdmx.model.DataSet.

Some issues to resolve here:

  1. Quantity.attrs map well to SDMX attributes attached at the level of an entire data set. However, one powerful feature of SDMX is the ability to attach attributes to individual observations. This does not have a natural analogue in the xarray (thus genno) data model.

Make Quantity a full class

Currently Quantity() is a function with a name that makes it seem like a class.

This means it's not possible to do:

if isinstance(foo, Quantity)

…or to use it in type annotations for computation functions.

Using a metaclass like QuantityMeta should make it possible to do this.

Strict vs. permissive handling of missing/dimensionless units

Consider these cases:

>>> from genno import Quantity, computations
# Case A
>>> computations.add(Quantity(1.0, units="kg"), Quantity(2.0, units="tonne), Quantity(3.0))
ValueError: Units 'kg' and '' are incompatible
# Case B
>>> computations.add(Quantity(1.0, units="kg"), Quantity(2.0, units="tonne"), Quantity(3.0, units=""))
ValueError: Units 'kg' and '' are incompatible

In (A) collect_units() assigns dimensionless to the last operand. In (B), it is explicitly dimensionless. This arose in iiasa/message_ix#441, where computations.add() is applied to two quantities, one with units, the other dimensionless (because the ixmp parameter handled by ixmp.reporting.computations.data_for_quantity() was empty).

What should the behaviour be?

Some possibilities:

  • In (A), infer that operand(s) with missing units is in the same units as the first/others. Maybe only if the units are consistent?
  • In (B), infer that explicitly dimensionless operand(s) are in the same units as the first/others.
  • Add a (global?) configuration setting to toggle between different behaviours. (What should be the default?)

Change term ‘computations’?

The dask graph specification uses ‘computation’ for any dict value in the graph. A ‘task’—tuple with a callable first element—is one of four kinds of ‘computation’.

In contrast, genno uses ‘computation’ for callables used as those first elements of tasks. This is a little inconsistent; also it's a long word.

Consider alternatives.

Transfer & refactor initial code from ixmp.reporting

  • Filter ixmp commits for only those that affect reporting code —done in #2
  • Set up packaging —#3
    • setup.{py,cfg}
    • Documentation using Sphinx
  • Set up CI using GitHub Actions
    • lint.yml —#3
    • pytest.yml —#7
    • Add badge
  • Ensure tests all pass —#3
  • Reorganized code into a coherent structure —#3
  • Set up RTD and add badge —#7
  • Set up Codecov and add badge —#7

Add file-based caching

A caching pattern/task would:

  • Understand a configured cache directory.
  • Compute a hash of the arguments and inputs to a particular task.
  • If the corresponding cache file exists, load and return it.
  • Otherwise:
    • Execute the task that generates the data,
    • Cache the result, and
    • Return it.

Existing code, from e.g. khaeru/data or transportenergy/ipcc-wg4-ar6-ch10 could be adapted for this.

Adjust for pyam 1.7.0

pyam 1.7.0 was released on 2022-12-19. Per IAMconsortium/pyam#708, specifically here, keyword arguments to IamDataFrame are directly fed to pandas.DataFrame.to_excel(). (See also the blame for this method. It appears at some point pyam forced engine="openpyxl" and accepted but ignored the keyword arguments.)

This causes failures in genno.compat.pyam.write_report(), e.g. here:

 genno/compat/pyam/computations.py:109: in write_report
    obj.to_excel(path, merge_cells=False)
/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pyam/core.py:2382: in to_excel
    excel_writer = pd.ExcelWriter(excel_writer, **kwargs)

(snip)

>       self._book = Workbook(self._handles.handle, **engine_kwargs)
E       TypeError: Workbook.__init__() got an unexpected keyword argument 'merge_cells'

/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pandas/io/excel/_xlsxwriter.py:216: TypeError

This is because pyam is now allowing pandas to select xlsxwriter as the engine, and the merge_cells keyword argument is not understood by this engine.

The fix is likely to (a) remove and (b) specify a minimum version of pyam to avoid the need for genno to handle the shift(s) in behaviour.

Improve typing

This issue is to collect type errors seen in downstream code that uses genno. These can be addressed by changes like those in #53, with reference to the typing and mypy docs.

Addressed in #55:

error: "Quantity" has no attribute "shift"
error: Unsupported operand types for * ("float" and "Quantity")
error: Unsupported operand types for - ("int" and "Quantity")

Others:

Document Computer.visualize()

Include ≥1 example in the built documentation.

A separate issue is to use these extensively to illustrate graphs.

Document .add_queue()

Currently this is used internally by .config.parse_config(), but it could be further demonstrated on a documentation page.

Switch default Quantity: AttrSeries → SparseDataArray

Inherited from iiasa/ixmp#191:

xarray 0.13 includes support for converting pd.DataFrame to a pydata/sparse data structure.
This should mostly obviate the need for the custom AttrSeries class.
A PR should be opened to make the change, test performance, and make any necessary adjustments.

Resources:

As of genno 1.0, all code is tested with both AttrSeries and SparseDataArray to minimize surprises on switching.

#27 should probably be done first.

Update for xarray 2022.6.0

Nightly tests began to fail with the release of xarray 2022.6.0 e.g. here.

  • The failing tests are:
    genno/tests/test_computations.py::test_broadcast_map[SparseDataArray-map_values0-kwarg0]
    genno/tests/test_computations.py::test_index_to[SparseDataArray]
    genno/tests/test_computations.py::test_pow[SparseDataArray]
    genno/tests/test_computations.py::test_product0[SparseDataArray]
    genno/tests/test_computations.py::test_product[SparseDataArray-dims0-64]
    genno/tests/test_computations.py::test_product[SparseDataArray-dims1-8]
    genno/tests/test_computations.py::test_product[SparseDataArray-dims2-4]
    
  • These all appear to fail on the f-string formatting of a log message in genno.util.collect_units():
    log.debug(f"{arg} lacks units; assume dimensionless")
    which raises: “RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.”
  • This is an upstream regression: pydata/xarray#6822

As mitigation:

  • SparseDataArray is not the default genno.Quantity class currently. If using AttrSeries (the default), genno remains usable.
  • If using SparseDataArray, use xarray < 2022.6.0

To resolve:

  • Follow the response to the upstream issue.
  • Make any adjustment necessary in genno itself.

Advertise or remove .config.CALLBACKS

#16 added this code, adapted from message_data:

genno/genno/config.py

Lines 102 to 103 in 91d906d

# Also add the callbacks to the queue
queue.extend((("apply", cb), {}) for cb in CALLBACKS)

These "callbacks" are essentially the same as "handlers", simply without any arguments.
Perhaps the two can be merged, and the handles() decorator updated/renamed to cover both use-cases.

Also: add documentation!

Edit documentation

  • Ensure it is self-contained/standalone.
  • Incorporate text from message_ix reporting tutorial.

Add a `Computation` abstract class

This can be the location for:

  • add_task(c: Computer) or similar for describing computations in c.
  • __call__(): the actual callable to be executed.
  • __repr__(): a more readable string representation for Computer.describe().
  • etc.

These should be easier to maintain if they are collected, instead of the separate pair of e.g. Computer.convert_pyam (for adding task(s)) and .compat.pyam.computations.as_pyam (the actual callable)

This will also alow to reduce complexity of this code in Computer.add():

elif isinstance(data, str) and self.get_comp(data):
# *data* is the name of a pre-defined computation
name = data
if hasattr(self, f"add_{name}"):
# Use a method on the current class to add. This invokes any
# argument-handling conveniences, e.g. Computer.add_product()
# instead of using the bare product() computation directly.
return getattr(self, f"add_{name}")(*args, **kwargs)
else:
# Get the function directly
func = self.get_comp(name)
# Rearrange arguments: key, computation function, args, …
func, kwargs = partial_split(func, kwargs)
return self.add(args[0], func, *args[1:], **kwargs)
elif isinstance(data, str) and data in dir(self):
# Name of another method, e.g. 'apply'
return getattr(self, data)(*args, **kwargs)

The Computer can:

  • Look up the Computation class in Computer.modules.
  • If it has an add_task() method, call that directly; else, simply instantiate.

Extend/override `dask.visualize()`

Because dask.visualize() is intended for use with dask's own collections/classes, it tries to generate labels suitable for that use-case. These end up being uninformative (e.g. blank) for genno graphs, e.g.:

visualize

This could be addressed by some combination of:

  1. Extend genno's classes and objects (cf #30) to present the information expected by dask's labeling and other utilities.
  2. Monkeypatch dask.base.* as necessary to get the desired behaviour.
  3. Copy and modify to get the desired behaviour.

Cache based on function code / document caching based on file contents

(Transferred from the discussion of iiasa/message-ix-models#25.)

The main question is whether genno covers the following two features in its caching option which are covered by an implementation I recently did using joblib.Memory.

  1. joblib.Memory doesn't only cache the input values but also the function code itself. This way if your function code change but the input stays the same it won't be tricked into wrongly thinking that it has the results cached already.
  2. My usecase involved reading data from a file, doing some computation and providing the result as a pandas dataframe. I provide the function with a filename in form of a pathlib.Path or a str. That means my function now looks like this: read_and_compute_some_data(file: Union[str, pathlib.Path], ...) -> pd.DataFrame. Here the joblib.Memory caching decorator would simply save a hash of the name of the input file. In some way that's a problem since I'm not actually interested in the name of my datafile but the contents. For this I have created a small wrapper class InputFile for the filename which stores a hash of the files contents. As joblib uses pickle to serialize the data to binary I have modified the way InputFile is serialized by just considering the contents of the file and not the name.

Minimum working example of caching of the content of input files using joblib.Memory:

from joblib import Memory
from pathlib import Path
import hashlib
import pandas as pd

# setting the directory of the cache in the parent folder of the file
memory = Memory(Path(__file__).parent / ".joblib_cache") 

class InputFile:
    def __init__(self, file) -> None:
        self.file = file
        self.hash = self.calc_hash()

    def calc_hash(self) -> str:
        """Generate a hash from the contents of a file

        Parameters
        ----------
        file : str
            File to be hashed

        Returns
        -------
        str:
            Hexadecimal representation of the file hash

        Notes
        -----
        For details refer to https://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
        """
        with open(self.file, "rb") as f:
            file_hash = hashlib.md5()
            # we read the file at a rate of 8192 bytes a chunk
            # this takes advantage of the digest size of 128 bytes
            while chunk := f.read(8192):
                file_hash.update(chunk)
        return file_hash.hexdigest()

    def __getstate__(self) -> dict:
        """Custom __getstate__ function for using with Memory.cache from joblib.Memory

        Returns
        -------
        dict
            __dict__ minus the file name

        Notes
        -----
        """
        # this is to 'trick' pickle into only considering the hash of the contents
        # of the file and not the filename itself when checking if the have 
        # cached results. Of course you could also change it to include the filename
        # as well. A good way might be use both the filename (just the name
        # and not the entire path) and the hash of the contents. This would possibly
        # also make the cache independent of the user as it would no longer hash
        # the directory structure where the file is stored

        state = self.__dict__.copy()
        # remove the file from the state as we are just interested in the contents
        del state["file"]
        return state

    def __repr__(self) -> str:
        # this is just so that we get a nice representation of the class since 
        # joblib.memory also writes a json with with the input parameters of
        # the function call
        return f"{self.__class__}: {self.__dict__}"

# adding the decorator to make read_from_file cache-able 
# also caching this might be a bit pointless but I think it illustrates 
# the general layout of such a function
@memory.cache
def read_from_file(input_file):
    return pd.read_csv(input_file.file)

if __name__ == "__main__":
    # in the current configuration the second call read_from_file would hit the cache if the contents of file1.csv and file2.csv
    # are the same even though they have different names. 
    read_from_file(InputFile("file1.csv"))
    read_from_file(InputFile("file2.csv"))

Additionally, joblib.Memory also saves json files where the input values are stored which is a nice feature for book keeping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.