pydata / xarray Goto Github PK
View Code? Open in Web Editor NEWN-D labeled arrays and datasets in Python
Home Page: https://xarray.dev
License: Apache License 2.0
N-D labeled arrays and datasets in Python
Home Page: https://xarray.dev
License: Apache License 2.0
This would allow you to have coordinates which are offsets from a time coordinates which comes in handy when dealing with forecast data where the 'time' coordinate might be the forecast run time and you then want a 'lead' coordinate which is an offset from the run time.
Fixed-width arrays are not terribly useful, so pandas's choice seems perfectly reasonable to me. This would let us use np.nan
for missing values and borrow functions like pandas.isnull
.
ds = xray.open_dataset('http://motherlode.ucar.edu/thredds/dodsC/grib/NCEP/GFS/Global_0p5deg/files/GFS_Global_0p5deg_20140303_0000.grib2', decode_cf=False)
In [4]: ds['lat'].dtype
Out[4]: dtype('O')
This makes serialization fail.
Relevant discussion: #153
Dataset reduce methods (#131) suggested to me that it would be nice to support applying functions which map over all data arrays in a dataset. The signature of Dataset.apply
could be modeled after GroupBy.apply
and the implementation would be similar to #137 (but simpler).
For example, I should be able to write ds.apply(np.mean)
.
Note: It's still worth having #137 as a separate implementation because it can do some additional validation for dimensions and skip variables where the aggregation doesn't make sense.
This would move the xray interface closer to pandas, which is probably a good thing.
To promote separation of concerns, the indexing_mode
logic from Variable (formerly XArray) should be moved into a separate array wrapper class. This will make things easier to test and also make it easier to write backends which support different types of array indexing.
This should be patterned off of pandas.DataFrame.dropna.
Note: this issue is mostly a TODO list for myself, but input would also be welcome.
In my latest PR #12, Dataset
is a bit of a mess. It has redundant properties (variables, dimensions, indices), which make it unnecessarily complex to work with. Furthermore, the "datastore" abstraction is leaky, because not all datastores support all the operations (e.g., deleting or renaming variables).
I would like to simply Dataset by restricting its contents to (ordered?) dictionaries of variables and attributes:
dimensions
attribute still exists, but is read-only (in the public API) and always equivalent to the map from all variable dimensions to their shapes. It will no longer be possible to create a dimension without a variable.
np.arange(data.shape[axis])
).pandas.Index
objects when stored in a dataset.
numpy.ndarray
subclasses (or at least API compatible) but are immutable.variables
will be read only. To modify a dataset's variables, use item syntax on the dataset (i.e, dataset['foo'] = foo
or del dataset['foo']
), which will validate and perform appropriate conversions:
pandas.Index
objects (as noted above)DatasetArray
is assigned, it's contents are merged in to the dataset, with an exception raised if the merge cannot be done safely. (To assign only the Array object, use the appropriate property of the dataset array [1]).Array
or DatasetArray
, it can be an iterable of the form (dimensions, data)
or (dimensions, data, attributes)
, which is unpacked as the arguments to create a new Array
.set_dimension
, set_variable
, create_variable
, etc.attributes
or metadata
.Creating a new dataset should be as simple as:
import numpy as np
import pandas as pd
import xray
variables = {'y': ('y', ['a', 'b', 'c']),
't': ('t', pd.date_range('2000-01-01', periods=5)),
'foo': (('t', 'x', 'y'), np.random.randn(5, 3, 10))}
attributes = {'title': 'nonsense'}
# from scratch
dataset = xray.Dataset(variables, attributes)
# or equivalently:
dataset = xray.Dataset()
for k, v in variables.items():
dataset[k] = v
dataset.attributes = attributes
[1] This property should probably be renamed from variable
to array
. Also, perhaps we should add values
as an alias of data
(to mirror pandas).
[2] Array
will need to be updated so it always copies on write if the underlying data is not stored as a numpy ndarray.
I actually don't know if our current objects support pickle. But we should add tests and make sure this works, for a dead simple form of inter-process communication.
Todo:
Currently:
np.datetime64
arrays or objects are converted to ns precisionArguably, we should convert everything to 'datetime64[ns]'
, if possible. This is the approach we now take for decoding NetCDF time variables (#126).
Reference discussion: #134. From @akleeman:
In #125 I went the route of forcing datetimes to be datetime64[ns]. This is probably part of a broader conversation, but doing so might save some future headaches. Of course ... it would also restrict us to nanosecond precision. Basically I feel like we should either force datetimes to be datetime64[ns] or make sure that operations on datetime objects preserve their type.
Probably worth getting this in and picking that conversation back up if needed. In which case could you add tests which make sure variables with datetime objects are still datetime objects after concatenation? If those start getting cast to datetime[ns] it'll start get confusing for users.
Also worth considering: how should datetime64[us] datetimes be handled? Currently they get cast to [ns] which, since datetimes do not, could get confusing.
It would be nice to allow DataArray objects without named dimensions (#116). But it doesn't make much sense to put arrays without named dimensions into a Dataset.
This suggests that we should change the current model for the internals of DataArray, which currently works by applying operations to an internal Dataset, and keeping track of the name of the name of the array of interest.
An alternate representation would use a fixed size list-like attribute coordinates
to keep track of coordinates. Putting a DataArray without named dimensions into a Dataset will raise an error.
Positives:
Negatives:
ds['foo'].groupby('bar')
if "bar" is not a dimension in ds['foo']
, unless we keep around some sort of reference to the dataset in the array. Perhaps this tradeoff is worth it: ds['foo'].groupby(ds['bar'])
isn't so terrible.CC @mrocklin, I mentioned this up briefly in the context of #116 during PyData.
Like pandas, we should wrap bottleneck to create fast moving window operations and missing value operation that can be applied to xray data arrays.
As xray is designed to make it straightforward to work with high dimensional arrays, it would be particularly convenient if bottleneck had fast functions for N > 3 dimensions (see pydata/bottleneck/issues/84) but we should wrap bottleneck regardless for functions like rolling_mean, rolling_sum, rolling_min, etc.
It is actually possible to create variable length strings in NETCDF4-python if you set dtype=str
(yes, the python type) and fill the values as object ndarray. We should support this sort of serialization in addition to the serialization as higher dimensional arrays of character strings.
Add summary methods to Dataset object. For example, it would be great if you could summarize a entire dataset in a single line.
(1) Mean of all variables in dataset.
mean_ds = ds.mean()
(2) Mean of all variables in dataset along a dimension:
time_mean_ds = ds.mean(dim='time')
In the case where a dimension is specified and there are variables that don't use that dimension, I'd imagine you would just pass that variable through unchanged.
Related to #122.
For users of pandas, the xray interface would be more obvious if we referred to what we currently call "coordinates" as "indices."
This would entail renaming the coordinates
property to indices
, xray.Coordinate
to xray.Index
and the xray.Coordinate.as_index
property to as_pandas_index
(all with deprecation warnings).
Possible downsides:
xray.Index
and pandas.Index
:
xray.Variable
object, and thus is dimension aware and has attributes.pandas.Index
objects as indices, nor will it properly convert an xray.Index into a pandas.Index.It would be nice to create in-memory netCDF4 objects. This is difficult with the netCDF4 library, which requires a filename (possibly one that it can mmap, but probably not, based on its opendap documentation).
One solution is to call os.mkfifo (in *nix) or its windows equivalent (if the library is available) using tempfile.mktemp as the path. Pass this to the netCDF4 object. dumps() is equivalent to calling sync, close, reading from the pipe, then deleting the result.
We may actually be able to use the same functionality in reverse for creating a netCDF4 object from a StringIO.
It shouldn't be necessary to put arrays in a Dataset to make a DataArray.
See also: #85 (comment)
Apparently CDAT has a number of useful modules for working with weather and climate data, especially for things like computing climatologies (related: #112). There's no point in duplicating that work in xray, of course (also, climatologies may be too domain specific for xray), so we should make it possible to use both xray and CDAT interchangeably.
Unfortunately, I haven't used CDAT, so it not obvious to me what the right interface is. Also, CDAT seems to be somewhat difficult (impossible?) to install as a Python library, so it may be hard to setup automated testing.
It should be possible to easily convert back and forth between Datasets and DataArrays that consist of each variable in a Dataset stacked along a "variable" dimension.
The implementation for to_array
should be mostly equivalent to DataArray.from_series(ds.to_dataframe().stack())
, although for performance reasons we probably don't actually want to convert everything into intermediate representations as pandas object.
Similarly, to_dataset(dimension)
would be roughly equivalent to Dataset.from_dataframe(array.to_series().stack(dimension))
.
One of the beautiful things about the netCDF data model is that the variables can be read individually. I'm suggesting adding a variables
keyword (or something along those lines) to the open_dataset
function to support selecting one or more or all variables in a file. This will allow for faster reads and smaller memory usage when the full set of variables is not needed.
For example, even after decoding a character array or indexing an variable, the chunksize is not updated. This means that netCDF4 reports an error when trying to save such a file.
Perhaps we should add some sort of sanity check to chunksize when writing a dataset? Possibly issuing a warning?
Thanks @ToddSmall for reporting this issue.
Should match the pandas function: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.html
It is very convenient to be able to write array.groupby(some_var).apply(some_func)
. It would be nice if apply
also worked for GroupBy
objects created by dataset.groupby(some_var)
in the same way.
Now that we have a working Dataset.concat
method, this should be easy to implement. Basically, it's just syntactic sugar for:
Dataset.concat(some_func(group) for _, group in dataset.groupby(some_var)
For arrays, we also have some logic to guarantee that array dimensions are all ordered in the same way that they were ordered in the original arrays, even if we did groupby with squeeze=True
.
Trying to re-align slocum with xray... it seems that deleting a variable from a Dataset object erases the dimensions dictionary:
In [47]: fcst.dimensions
Out[47]: Frozen(OrderedDict([(u'lat', 9), (u'lon', 9), (u'height_above_ground4', 1), (u'time', 65)]))
In [48]: del fcst['height_above_ground4']
In [49]: fcst.dimensions
Out[49]: Frozen(OrderedDict())
Also, the removed variable still appears as a dimension in the remaining variables' coordinate systems:
In [57]: fcst.variables
Out[57]: Frozen(_VariablesDict([(u'lat', <xray.XArray (lat: 9): object>), (u'lon', <xray.XArray (lon: 9): object>), (u'time', <xray.XArray (time: 65): datetime64[ns]>), (u'u-component_of_wind_height_above_ground', <xray.XArray (time: 65, height_above_ground4: 1, lat: 9, lon: 9): float32>), (u'v-component_of_wind_height_above_ground', <xray.XArray (time: 65, height_above_ground4: 1, lat: 9, lon: 9): float32>)]))
Is this intentional? I'm using delitem as replacement for the old polyglot's Dataset.squeeze() - perhaps that's abuse?
My analysis:
Positives:
dict
is more "pythonic":
dict
is fasterdict
comes with syntax built into the language: {}
repr()
on a dict
is more readable: {'x': 0, 'y': 1}
vs OrderedDict([('x', 0), ('y', 1)]).
dict
would make it simpler to implement new features in xray, because we don't need to spend time thinking about the order of items. It would also let us simplify some existing features.dict
would better align our internal data model with this fact.Neutral:
Negatives:
encoding
attribute for datasets unless absolutely necessary, so this is only way sane to keep track of variable/attribute order.Your thoughts?
I'd like to rename the Dataset/DataArray methods labeled
and indexed
, so that they are more obviously variants on a theme, similarly to how pandas distinguishes between methods .loc
and .iloc
(and .at
/.iat
, etc.). Some options include:
indexed
to ilabeled
.indexed
/labeled
to isel
/sel
.I like option 2 (particularly because it's shorter), but to avoid confusion with the select
method, we would need to also rename select
/unselect
to something else. I would suggest select_vars
and drop_vars
.
Right now, it tries to convert a string argument into a list of characters, which is almost certainly not the right behavior.
In your main README file you ask the questions "Why not Pandas?" and "Why not Iris?" Another question you might want to ask is "Why not CDAT?" It was written quite a long time ago now but is still used extensively by UV-CDAT and thus by the ESGF. In particular, the cdms2, cdutil, genutil and MV2 libraries within CDAT do some of what xray does.
Example from @akleeman:
>>> ahps['time'].values[0]
numpy.datetime64('2013-07-01T05:00:00.000000000-0700')
>>> ahps['time'][0].values
array(1372680000000000000L, dtype='datetime64[ns]')
This may be something broken in numpy but we should be able to fix it on our end, too.
This is unlikely to be difficult since all of our dependencies support Python 3, but it will definitely take some work.
Thanks @takluyver for pointing this out in #113. We really should have resolved this some time ago.
Perhaps ds.load_into_memory()
or ds.cache()
?
The obvious libraries to wrap are pytables or h5py:
http://www.pytables.org
http://h5py.org/
Both provide at least some support for in-memory operations (though I'm not sure if they can pass around HDF5 file objects without dumping them to disk).
From a cursory look at the documentation for both projects, the h5py appears to offer a simpler API that would be easier to map to our existing data model.
The tutorial provides an example of how to use xray's virtual_variables
. The same functionality is not availble from a Dataset object created by open_dataset.
Tutorial:
In [135]:
foo_values = np.random.RandomState(0).rand(3, 4)
times = pd.date_range('2000-01-01', periods=3)
ds = xray.Dataset({'time': ('time', times),
'foo': (['time', 'space'], foo_values)})
ds['time.dayofyear']
Out[135]:
<xray.DataArray 'time.dayofyear' (time: 3)>
array([1, 2, 3], dtype=int32)
Attributes:
Empty
however, reading a time coordinate / variable from a netCDF4 file, and applying the same logic raises an error:
In [136]:
ds = xray.open_dataset('sample_for_xray.nc')
ds['time']
Out[136]:
<xray.DataArray 'time' (time: 4)>
array([1979-09-16 12:00:00, 1979-10-17 00:00:00, 1979-11-16 12:00:00,
1979-12-17 00:00:00], dtype=object)
Attributes:
dimensions: 1
long_name: time
type_preferred: int
In [137]:
ds['time.dayofyear']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-137-bfe1ae778782> in <module>()
----> 1 ds['time.dayofyear']
/Users/jhamman/anaconda/lib/python2.7/site-packages/xray-0.2.0.dev_cc5e1b2-py2.7.egg/xray/dataset.pyc in __getitem__(self, key)
408 """Access the given variable name in this dataset as a `DataArray`.
409 """
--> 410 return data_array.DataArray._constructor(self, key)
411
412 def __setitem__(self, key, value):
/Users/jhamman/anaconda/lib/python2.7/site-packages/xray-0.2.0.dev_cc5e1b2-py2.7.egg/xray/data_array.pyc in _constructor(cls, dataset, name)
95 if name not in dataset and name not in dataset.virtual_variables:
96 raise ValueError('name %r must be a variable in dataset %r'
---> 97 % (name, dataset))
98 obj._dataset = dataset
99 obj._name = name
ValueError: name 'time.dayofyear' must be a variable in dataset <xray.Dataset>
Dimensions: (time: 4, x: 275, y: 205)
Coordinates:
time X
x X
y X
Noncoordinates:
Wind 0 2 1
Attributes:
sample data for xray from RASM project
Is there a reason that the virtual time variables are only available if the dataset is created from a pandas date_range? Lastly, this could be related to the #118 .
One option: add a batch_apply
method:
This would be a shortcut for split-apply-combine with groupby/apply if the grouping over a dimension is only being done for efficiency reasons.
This function should take several parameters:
dimension
to group over.batchsize
to group over on this dimension (defaulting to 1
).func
to apply to each group.At first, this function would be useful just to avoid memory issues. Eventually, it would be nice to add a n_jobs
parameter which would automatically dispatch to multiprocessing/joblib. We would need to get pickling (issue #24) working first to be able to do this.
It should look in each array for the dimension values if it is provided as a string. Currently, it does not.
At PyData SV, @mrocklin suggested that by default, array broadcasting should fall back on numpy's shape based broadcasting. This would also simplify directly constructing DataArray objects (#115).
The trick will be to make this work with xray's internals, which currently assume that dimensions are always named by strings.
I'm thinking now that instead of trying to preserve attributes on variables, it would be preferable to drop all attributes (not just conflicting ones) when doing mathematical operations. This would keep things a bit simpler/easier to understand. Thoughts?
This may just be a documentation issue but the summary apply and combine methods for the Dataset.GroupBy
object seem to be missing.
In [146]:
foo_values = np.random.RandomState(0).rand(3, 4)
times = pd.date_range('2000-01-01', periods=3)
ds = xray.Dataset({'time': ('time', times),
'foo': (['time', 'space'], foo_values)})
ds.groupby('time').mean() #replace time with time.month after #121 is adressed
# ds.groupby('time').apply(np.mean) # also Errors here
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-146-eec1e73cff23> in <module>()
3 ds = xray.Dataset({'time': ('time', times),
4 'foo': (['time', 'space'], foo_values)})
----> 5 ds.groupby('time').mean()
6 ds.groupby('time').apply(np.mean)
AttributeError: 'DatasetGroupBy' object has no attribute 'mean'
Adding this functionality, if not already present, seems like a really nice addition to the package.
This will make np.asscalar
work.
CC @ToddSmall
Currently, variable attributes are checked for equality before allowing for a merge via a call to xarray_equal
. It should be possible to merge datasets even if some of the variable metadata disagrees (conflicting attributes should be dropped). This is already the behavior for global attributes.
The right design of this feature should probably include some optional argument to Dataset.merge
indicating how strict we want the merge to be. I can see at least three versions that could be useful:
We can argue about which of these should be the default option. My inclination is to be as flexible as possible by using 1 or 2 in most cases.
Reduction operations currently drop all Variable
and Dataset
attrs
when a reduction operation is performed. I'm proposing adding a keyword to these methods to allow for copying of the original Variable
or Dataset
attrs
.
The default value of the keep_attrs
keyword would be False
.
For example:
new = ds.mean(keep_attrs=True)
returns new
with all the Variable
and Dataset
attrs
as ds
contained.
see this related issue: pandas-dev/pandas#5487
this is actually not hard to do, and might allow you to push some of your backends to pandas.
Both Issues https://github.com/akleeman/xray/pull/20 and https://github.com/akleeman/xray/pull/21 are dealing with similar conceptual issues. Namely sometimes the user may want fine control over how a dataset is stored (integer packing, time units and calendars ...). Taking time as an example, the current model interprets the units and calendar in order to create a DatetimeIndex, but then throws out those attributes so that if the dataset were re-serialized the units may not be preserved.
One proposed solution to this issue is to include a distinct set of encoding attributes that would hold things like 'scale_factor', and 'add_offset' allowing something like this
ds['time'] = ('time', pd.date_range('1999-01-05', periods=10))
ds['time'].encoding['units'] = 'days since 1989-08-19'
ds.dump('netcdf.nc')
> ncdump -h
...
int time(time) ;
time:units = "days since 1989-08-19" ;
...
The encoding attributes could also handle masking, scaling, compression etc ...
This would make it less ambiguous that this is the preferred way to access and manipulate data in xray.
On a related note, I would like to make XArray
more of an internal implementation detail that we only expose to advanced users.
I'm noticing a problem parsing the time variable for at least the noleap
calendar for a properly formatted time dimension. Any thoughts on why this is?
ncdump -c -t sample_for_xray.nc
netcdf sample_for_xray {
dimensions:
time = UNLIMITED ; // (4 currently)
y = 205 ;
x = 275 ;
variables:
double Wind(time, y, x) ;
Wind:units = "m/s" ;
Wind:long_name = "Wind speed" ;
Wind:coordinates = "latitude longitude" ;
Wind:dimensions = "2" ;
Wind:type_preferred = "double" ;
Wind:time_rep = "instantaneous" ;
Wind:_FillValue = 9.96920996838687e+36 ;
double time(time) ;
time:calendar = "noleap" ;
time:dimensions = "1" ;
time:long_name = "time" ;
time:type_preferred = "int" ;
time:units = "days since 0001-01-01 0:0:0" ;
// global attributes:
...
data:
time = "1979-09-16 12", "1979-10-17", "1979-11-16 12", "1979-12-17" ;
ds = xray.open_dataset('sample_for_xray.nc')
print ds['time']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-46-65c280e7a283> in <module>()
1 ds = xray.open_dataset('sample_for_xray.nc')
----> 2 print ds['time']
/home/jhamman/anaconda/lib/python2.7/site-packages/xray/common.pyc in __repr__(self)
40
41 def __repr__(self):
---> 42 return array_repr(self)
43
44 def _iter(self):
/home/jhamman/anaconda/lib/python2.7/site-packages/xray/common.pyc in array_repr(arr)
122 summary = ['<xray.%s %s(%s)>'% (type(arr).__name__, name_str, dim_summary)]
123 if arr.size < 1e5 or arr._in_memory():
--> 124 summary.append(repr(arr.values))
125 else:
126 summary.append('[%s values with dtype=%s]' % (arr.size, arr.dtype))
/home/jhamman/anaconda/lib/python2.7/site-packages/xray/data_array.pyc in values(self)
147 def values(self):
148 """The variables's data as a numpy.ndarray"""
--> 149 return self.variable.values
150
151 @values.setter
/home/jhamman/anaconda/lib/python2.7/site-packages/xray/variable.pyc in values(self)
217 def values(self):
218 """The variable's data as a numpy.ndarray"""
--> 219 return utils.as_array_or_item(self._data_cached())
220
221 @values.setter
/home/jhamman/anaconda/lib/python2.7/site-packages/xray/utils.pyc in as_array_or_item(values, dtype)
56 # converted into an integer instead :(
57 return values
---> 58 values = as_safe_array(values, dtype=dtype)
59 if values.ndim == 0 and values.dtype.kind == 'O':
60 # unpack 0d object arrays to be consistent with numpy
/home/jhamman/anaconda/lib/python2.7/site-packages/xray/utils.pyc in as_safe_array(values, dtype)
40 """Like np.asarray, but convert all datetime64 arrays to ns precision
41 """
---> 42 values = np.asarray(values, dtype=dtype)
43 if values.dtype.kind == 'M':
44 # np.datetime64
/home/jhamman/anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
458
459 """
--> 460 return array(a, dtype, copy=False, order=order)
461
462 def asanyarray(a, dtype=None, order=None):
/home/jhamman/anaconda/lib/python2.7/site-packages/xray/variable.pyc in __array__(self, dtype)
121 if dtype is None:
122 dtype = self.dtype
--> 123 return self.array.values.astype(dtype)
124
125 def __getitem__(self, key):
TypeError: Cannot cast datetime.date object from metadata [D] to [ns] according to the rule 'same_kind'
This file is available here: ftp://ftp.hydro.washington.edu/pub/jhamman/sample_for_xray.nc
I noticed that pytables, h5py and blz all use the shorter attrs
instead of attributes
as the name of the attributes mapping. What do you guys think about switching over?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.