khaeru / genno Goto Github PK
View Code? Open in Web Editor NEWEfficient, transparent computation on labelled, N-dimensional data
Home Page: https://genno.rtfd.io
License: GNU General Public License v3.0
Efficient, transparent computation on labelled, N-dimensional data
Home Page: https://genno.rtfd.io
License: GNU General Public License v3.0
This requires writing a fixture that populates a Computer with contents analogous to ixmp.testing.make_dantzig
.
This would add .compat.sdmx
, including computations like…
sdmx.model.DataSet
into Quantity
.Quantity
into sdmx.model.DataSet
.Some issues to resolve here:
Specifically, iiasa/ixmp#396.
Only generalizable pieces, e.g. related to configuration, plugins, callbacks.
Currently Quantity()
is a function with a name that makes it seem like a class.
This means it's not possible to do:
if isinstance(foo, Quantity)
…or to use it in type annotations for computation functions.
Using a metaclass like QuantityMeta
should make it possible to do this.
Consider these cases:
>>> from genno import Quantity, computations
# Case A
>>> computations.add(Quantity(1.0, units="kg"), Quantity(2.0, units="tonne), Quantity(3.0))
ValueError: Units 'kg' and '' are incompatible
# Case B
>>> computations.add(Quantity(1.0, units="kg"), Quantity(2.0, units="tonne"), Quantity(3.0, units=""))
ValueError: Units 'kg' and '' are incompatible
In (A) collect_units()
assigns dimensionless
to the last operand. In (B), it is explicitly dimensionless. This arose in iiasa/message_ix#441, where computations.add()
is applied to two quantities, one with units, the other dimensionless (because the ixmp parameter handled by ixmp.reporting.computations.data_for_quantity()
was empty).
What should the behaviour be?
Some possibilities:
The dask
graph specification uses ‘computation’ for any dict value in the graph. A ‘task’—tuple with a callable first element—is one of four kinds of ‘computation’.
In contrast, genno uses ‘computation’ for callables used as those first elements of tasks. This is a little inconsistent; also it's a long word.
Consider alternatives.
ixmp
commits for only those that affect reporting code —done in #2A caching pattern/task would:
Existing code, from e.g. khaeru/data or transportenergy/ipcc-wg4-ar6-ch10 could be adapted for this.
pyam 1.7.0 was released on 2022-12-19. Per IAMconsortium/pyam#708, specifically here, keyword arguments to IamDataFrame are directly fed to pandas.DataFrame.to_excel()
. (See also the blame for this method. It appears at some point pyam forced engine="openpyxl" and accepted but ignored the keyword arguments.)
This causes failures in genno.compat.pyam.write_report()
, e.g. here:
genno/compat/pyam/computations.py:109: in write_report
obj.to_excel(path, merge_cells=False)
/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pyam/core.py:2382: in to_excel
excel_writer = pd.ExcelWriter(excel_writer, **kwargs)
(snip)
> self._book = Workbook(self._handles.handle, **engine_kwargs)
E TypeError: Workbook.__init__() got an unexpected keyword argument 'merge_cells'
/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pandas/io/excel/_xlsxwriter.py:216: TypeError
This is because pyam is now allowing pandas to select xlsxwriter as the engine, and the merge_cells keyword argument is not understood by this engine.
The fix is likely to (a) remove and (b) specify a minimum version of pyam to avoid the need for genno to handle the shift(s) in behaviour.
This issue is to collect type errors seen in downstream code that uses genno. These can be addressed by changes like those in #53, with reference to the typing and mypy docs.
Addressed in #55:
error: "Quantity" has no attribute "shift"
error: Unsupported operand types for * ("float" and "Quantity")
error: Unsupported operand types for - ("int" and "Quantity")
Others:
…
Include ≥1 example in the built documentation.
A separate issue is to use these extensively to illustrate graphs.
cf. iiasa/message_data#337
Currently this is used internally by .config.parse_config()
, but it could be further demonstrated on a documentation page.
Inherited from iiasa/ixmp#191:
xarray 0.13 includes support for converting pd.DataFrame to a pydata/sparse data structure.
This should mostly obviate the need for the custom AttrSeries class.
A PR should be opened to make the change, test performance, and make any necessary adjustments.Resources:
- Pint's technical commentary on container class hierarchy: https://pint.readthedocs.io/en/0.11/numpy.html#Technical-Commentary
- AmphoraInc/xarray_mongodb, which integrates xarray, sparse, and pint, for reference (thanks @gidden).
As of genno 1.0, all code is tested with both AttrSeries and SparseDataArray to minimize surprises on switching.
#27 should probably be done first.
Nightly tests began to fail with the release of xarray 2022.6.0 e.g. here.
genno/tests/test_computations.py::test_broadcast_map[SparseDataArray-map_values0-kwarg0]
genno/tests/test_computations.py::test_index_to[SparseDataArray]
genno/tests/test_computations.py::test_pow[SparseDataArray]
genno/tests/test_computations.py::test_product0[SparseDataArray]
genno/tests/test_computations.py::test_product[SparseDataArray-dims0-64]
genno/tests/test_computations.py::test_product[SparseDataArray-dims1-8]
genno/tests/test_computations.py::test_product[SparseDataArray-dims2-4]
genno.util.collect_units()
:
log.debug(f"{arg} lacks units; assume dimensionless")
As mitigation:
To resolve:
For a quantity <A:x-y-z>
with labels on the x dimension like ‘foo’, ‘bar’, and ‘baz’, .aggregate()
should accept a wildcard or regular expression like "b.?"
that would aggregate labels ‘foo’, ‘bar’ to a group, but not ‘baz’.
With the release of matplotlib v3.6.0, nightly tests began to fail due to has2k1/plotnine#619.
Temporary mitigations:
To resolve:
As of #3, there are about 27 lines uncovered out of 1515.
In .computations
:
.product()
→ .mul()
..ratio
→ .div()
..sub()
for subtraction.See the operator
module in the standard library. Similar names are also used by numpy, etc.
This would require shadowing under the old names with depreciation markings.
.add()
and .pow()
are already correct.
This can be the location for:
add_task(c: Computer)
or similar for describing computations in c
.__call__()
: the actual callable to be executed.__repr__()
: a more readable string representation for Computer.describe()
.These should be easier to maintain if they are collected, instead of the separate pair of e.g. Computer.convert_pyam
(for adding task(s)) and .compat.pyam.computations.as_pyam
(the actual callable)
This will also alow to reduce complexity of this code in Computer.add()
:
Lines 144 to 162 in 43a1702
The Computer can:
Computer.modules
.add_task()
method, call that directly; else, simply instantiate.If two AttrSeries with non-identical dims (e.g. ("x", "y")
and ("y", "x")
) are concatenated, the dimensions are not aligned automatically.
Add tests & fix.
Because dask.visualize()
is intended for use with dask's own collections/classes, it tries to generate labels suitable for that use-case. These end up being uninformative (e.g. blank) for genno graphs, e.g.:
This could be addressed by some combination of:
Including:
select
.message_ix.reporting.pyam
→ genno.compat.pyam
.The following should work:
>>> "X:a-b" == Key("X", "ba")
True
(Transferred from the discussion of iiasa/message-ix-models#25.)
The main question is whether genno covers the following two features in its caching option which are covered by an implementation I recently did using joblib.Memory
.
pathlib.Path
or a str
. That means my function now looks like this: read_and_compute_some_data(file: Union[str, pathlib.Path], ...) -> pd.DataFrame
. Here the joblib.Memory
caching decorator would simply save a hash of the name of the input file. In some way that's a problem since I'm not actually interested in the name of my datafile but the contents. For this I have created a small wrapper class InputFile
for the filename which stores a hash of the files contents. As joblib uses pickle to serialize the data to binary I have modified the way InputFile
is serialized by just considering the contents of the file and not the name.Minimum working example of caching of the content of input files using joblib.Memory
:
from joblib import Memory
from pathlib import Path
import hashlib
import pandas as pd
# setting the directory of the cache in the parent folder of the file
memory = Memory(Path(__file__).parent / ".joblib_cache")
class InputFile:
def __init__(self, file) -> None:
self.file = file
self.hash = self.calc_hash()
def calc_hash(self) -> str:
"""Generate a hash from the contents of a file
Parameters
----------
file : str
File to be hashed
Returns
-------
str:
Hexadecimal representation of the file hash
Notes
-----
For details refer to https://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
"""
with open(self.file, "rb") as f:
file_hash = hashlib.md5()
# we read the file at a rate of 8192 bytes a chunk
# this takes advantage of the digest size of 128 bytes
while chunk := f.read(8192):
file_hash.update(chunk)
return file_hash.hexdigest()
def __getstate__(self) -> dict:
"""Custom __getstate__ function for using with Memory.cache from joblib.Memory
Returns
-------
dict
__dict__ minus the file name
Notes
-----
"""
# this is to 'trick' pickle into only considering the hash of the contents
# of the file and not the filename itself when checking if the have
# cached results. Of course you could also change it to include the filename
# as well. A good way might be use both the filename (just the name
# and not the entire path) and the hash of the contents. This would possibly
# also make the cache independent of the user as it would no longer hash
# the directory structure where the file is stored
state = self.__dict__.copy()
# remove the file from the state as we are just interested in the contents
del state["file"]
return state
def __repr__(self) -> str:
# this is just so that we get a nice representation of the class since
# joblib.memory also writes a json with with the input parameters of
# the function call
return f"{self.__class__}: {self.__dict__}"
# adding the decorator to make read_from_file cache-able
# also caching this might be a bit pointless but I think it illustrates
# the general layout of such a function
@memory.cache
def read_from_file(input_file):
return pd.read_csv(input_file.file)
if __name__ == "__main__":
# in the current configuration the second call read_from_file would hit the cache if the contents of file1.csv and file2.csv
# are the same even though they have different names.
read_from_file(InputFile("file1.csv"))
read_from_file(InputFile("file2.csv"))
Additionally, joblib.Memory also saves json files where the input values are stored which is a nice feature for book keeping.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.