pydata / xarray Goto Github PK

View Code? Open in Web Editor NEW

3.4K 109.0 1.0K 42.66 MB

N-D labeled arrays and datasets in Python

Home Page: https://xarray.dev

License: Apache License 2.0

Python 99.60% CSS 0.11% HTML 0.02% Shell 0.04% Makefile 0.12% Batchfile 0.11%

python netcdf numpy pandas xarray dask

xarray's People

Contributors

Stargazers

Watchers

Forkers

ebrevdo takluyver akleeman alimanfoo shoyer toddsmall aykuznetsova pombredanne cossatot josephwinston kjseefried lakshmanok gitter-badger weathergod kirknorth eriknw iamjeffg sjpfenninger ajdawson tjlang benjwadams perrette suhasdl hdfeos koldunovn shabbychef clarkfitzg david-ian-brown slharris rabernat jstenar melsiddieg aashish24 kjordahl ahill818 nicwayand eotp r-ataylor petercable mrstu drewokane cpaulik scottza steamelephant pizzathief sktrill psederberg femtotrader tkanmae xebadir iaceth jinzhu2014 deanpospisil nedlrichards fmaussion jcmgray pwolfram andrejsim acrosby benbovy oxphos jcrist swnesbitt wholmgren tritemio dgergel cci-tools cdeil hottwaj maciekswat ocefpaf sean-t-moore mangecoeur mikegraham joonro mogismog tsupinie ashang ibab crusaderky bartnijssen max-sixty mountain-hydrology-research-group omad robintw wqshen colingannon gdmcbain rileywilliams adamchainz chiaral chris-b1 chunweiyuan agilevic serazing burnpanck smartass101 phdkiran hoonhout spencerkclark

xarray's Issues

Allow datetime.timedelta coordinates.

This would allow you to have coordinates which are offsets from a time coordinates which comes in handy when dealing with forecast data where the 'time' coordinate might be the forecast run time and you then want a 'lead' coordinate which is an offset from the run time.

Prefer to store strings in object arrays (like pandas)

Fixed-width arrays are not terribly useful, so pandas's choice seems perfectly reasonable to me. This would let us use np.nan for missing values and borrow functions like pandas.isnull.

OpenDAP loaded Dataset has lon/lats with type 'object'.

ds = xray.open_dataset('http://motherlode.ucar.edu/thredds/dodsC/grib/NCEP/GFS/Global_0p5deg/files/GFS_Global_0p5deg_20140303_0000.grib2', decode_cf=False)
In [4]: ds['lat'].dtype
Out[4]: dtype('O')

This makes serialization fail.

Expose a public interface for CF encoding/decoding functions

Relevant discussion: #153

Dataset.apply method

Dataset reduce methods (#131) suggested to me that it would be nice to support applying functions which map over all data arrays in a dataset. The signature of Dataset.apply could be modeled after GroupBy.apply and the implementation would be similar to #137 (but simpler).

For example, I should be able to write ds.apply(np.mean).

Note: It's still worth having #137 as a separate implementation because it can do some additional validation for dimensions and skip variables where the aggregation doesn't make sense.

Move DataArray.data to DataArray.values?

This would move the xray interface closer to pandas, which is probably a good thing.

TODO: move the indexing related code from Variable

To promote separation of concerns, the indexing_mode logic from Variable (formerly XArray) should be moved into a separate array wrapper class. This will make things easier to test and also make it easier to write backends which support different types of array indexing.

Add a DataArray.dropna() method for removing missing values

This should be patterned off of pandas.DataFrame.dropna.

Simplified Dataset API

Note: this issue is mostly a TODO list for myself, but input would also be welcome.

In my latest PR #12, Dataset is a bit of a mess. It has redundant properties (variables, dimensions, indices), which make it unnecessarily complex to work with. Furthermore, the "datastore" abstraction is leaky, because not all datastores support all the operations (e.g., deleting or renaming variables).

I would like to simply Dataset by restricting its contents to (ordered?) dictionaries of variables and attributes:

The dimensions attribute still exists, but is read-only (in the public API) and always equivalent to the map from all variable dimensions to their shapes. It will no longer be possible to create a dimension without a variable.
- Thus, if a variable is creating with a dimension that is not already a dataset coordinate, a new variable for the coordinate will be created from scratch (defaulting to np.arange(data.shape[axis])).
All "coordinate" variables (i.e., 1-dimensional with name equal to their dimension) are cast to pandas.Index objects when stored in a dataset.
- Note: Pandas index objects are numpy.ndarray subclasses (or at least API compatible) but are immutable.
- When time axes are converted, their "units" attribute will be removed (since it is no longer descriptive).
The public interface to variables will be read only. To modify a dataset's variables, use item syntax on the dataset (i.e, dataset['foo'] = foo or del dataset['foo']), which will validate and perform appropriate conversions:
- Dimensions are checked for conflicts.
- Coordinate variables are converted into pandas.Index objects (as noted above)
- If a DatasetArray is assigned, it's contents are merged in to the dataset, with an exception raised if the merge cannot be done safely. (To assign only the Array object, use the appropriate property of the dataset array [1]).
- If the item assigned is not an Array or DatasetArray, it can be an iterable of the form (dimensions, data) or (dimensions, data, attributes), which is unpacked as the arguments to create a new Array.
- Thus, we can get rid of set_dimension, set_variable, create_variable, etc.
We can expose attributes (unprotected) under the attribute attributes or metadata.
Loading and saving datasets in different file formats will have to create a new file from scratch [2]. But it's rare to want to actually mutate a netCDF file in place, and if you do, there are existing tools to do that. I think it's really hard to get the store abstraction right, and we should probably leave that to specialized libraries.
There is a case to be made that variable and attribute names should be required to be valid identifiers for typical file formats (minimally, strings). This could be implemented in a custom OrderedDict subclass. Or the user could simply be expected to behave responsibly, and exceptions will be raised when trying to save something with invalid identifiers (depending on the file format). I am certainly opposed to extensive validation, which would slow things down (e.g., confirming that values are safe for netCDF3).

Creating a new dataset should be as simple as:

import numpy as np
import pandas as pd
import xray

variables = {'y': ('y', ['a', 'b', 'c']),
             't': ('t', pd.date_range('2000-01-01', periods=5)),
             'foo': (('t', 'x', 'y'), np.random.randn(5, 3, 10))}
attributes = {'title': 'nonsense'}

# from scratch
dataset = xray.Dataset(variables, attributes)

# or equivalently:
dataset = xray.Dataset()
for k, v in variables.items():
    dataset[k] = v
dataset.attributes = attributes

[1] This property should probably be renamed from variable to array. Also, perhaps we should add values as an alias of data (to mirror pandas).
[2] Array will need to be updated so it always copies on write if the underlying data is not stored as a numpy ndarray.

TODO: xray objects support the pickle protocol

I actually don't know if our current objects support pickle. But we should add tests and make sure this works, for a dead simple form of inter-process communication.

More consistent datetime conversion

Todo:

Decide on rules for datetime conversion
Implement them
Document them

Currently:

All np.datetime64 arrays or objects are converted to ns precision
We leave other datetime or datetime-like objects intact in numpy arrays of dtype=object.

Arguably, we should convert everything to 'datetime64[ns]', if possible. This is the approach we now take for decoding NetCDF time variables (#126).

Reference discussion: #134. From @akleeman:

In #125 I went the route of forcing datetimes to be datetime64[ns]. This is probably part of a broader conversation, but doing so might save some future headaches. Of course ... it would also restrict us to nanosecond precision. Basically I feel like we should either force datetimes to be datetime64[ns] or make sure that operations on datetime objects preserve their type.

Probably worth getting this in and picking that conversation back up if needed. In which case could you add tests which make sure variables with datetime objects are still datetime objects after concatenation? If those start getting cast to datetime[ns] it'll start get confusing for users.

Also worth considering: how should datetime64[us] datetimes be handled? Currently they get cast to [ns] which, since datetimes do not, could get confusing.

Restructure DataArray internals to not use a Dataset?

It would be nice to allow DataArray objects without named dimensions (#116). But it doesn't make much sense to put arrays without named dimensions into a Dataset.

This suggests that we should change the current model for the internals of DataArray, which currently works by applying operations to an internal Dataset, and keeping track of the name of the name of the array of interest.

An alternate representation would use a fixed size list-like attribute coordinates to keep track of coordinates. Putting a DataArray without named dimensions into a Dataset will raise an error.

Positives:

This is a more transparent and obvious model for directly working with DataArray objects.
It will simplify making DataArrays without named dimensions.
It will make choices like when to drop other dataset variables in an data array operation more obvious: other variables will always be dropped, because we won't bother keeping track of a dataset anymore.
Related to my bullet 1, this will have positive performance implications for array indexing, since it will more obvious exactly which arrays you are indexing (currently indexing indexes every array in a dataset).

Negatives:

This will certainly add lines of code and complexity. Making an operation work for both Datasets and DataArrays will no longer be quite so simple.
It will no longer be as straightforward to access other related variables in a DataArray. In particular, it won't work to do ds['foo'].groupby('bar') if "bar" is not a dimension in ds['foo'], unless we keep around some sort of reference to the dataset in the array. Perhaps this tradeoff is worth it: ds['foo'].groupby(ds['bar']) isn't so terrible.

CC @mrocklin, I mentioned this up briefly in the context of #116 during PyData.

Wrap bottleneck for fast moving window aggregations

Like pandas, we should wrap bottleneck to create fast moving window operations and missing value operation that can be applied to xray data arrays.

As xray is designed to make it straightforward to work with high dimensional arrays, it would be particularly convenient if bottleneck had fast functions for N > 3 dimensions (see pydata/bottleneck/issues/84) but we should wrap bottleneck regardless for functions like rolling_mean, rolling_sum, rolling_min, etc.

Support variable length strings in NETCDF4 files

It is actually possible to create variable length strings in NETCDF4-python if you set dtype=str (yes, the python type) and fill the values as object ndarray. We should support this sort of serialization in addition to the serialization as higher dimensional arrays of character strings.

Dataset summary methods

Add summary methods to Dataset object. For example, it would be great if you could summarize a entire dataset in a single line.

(1) Mean of all variables in dataset.

mean_ds = ds.mean()

(2) Mean of all variables in dataset along a dimension:

time_mean_ds = ds.mean(dim='time')

In the case where a dimension is specified and there are variables that don't use that dimension, I'd imagine you would just pass that variable through unchanged.

Related to #122.

Rename "coordinates" to "indices"?

For users of pandas, the xray interface would be more obvious if we referred to what we currently call "coordinates" as "indices."

This would entail renaming the coordinates property to indices, xray.Coordinate to xray.Index and the xray.Coordinate.as_index property to as_pandas_index (all with deprecation warnings).

Possible downsides:

The xray data model would be less obvious to people familiar with the NetCDF.
There is some potential for confusion between xray.Index and pandas.Index:
- The only real difference is that xray's Index is a xray.Variable object, and thus is dimension aware and has attributes.
- In principle, xray.Index should have all the necessary properties to act like an index (or rather, it already has most of these properties and should get the rest).
- Unfortunately, pandas doesn't accept non-pandas.Index objects as indices, nor will it properly convert an xray.Index into a pandas.Index.

Cross-platform in-memory serialization of netcdf4 (like the current scipy-based dumps)

It would be nice to create in-memory netCDF4 objects. This is difficult with the netCDF4 library, which requires a filename (possibly one that it can mmap, but probably not, based on its opendap documentation).

One solution is to call os.mkfifo (in *nix) or its windows equivalent (if the library is available) using tempfile.mktemp as the path. Pass this to the netCDF4 object. dumps() is equivalent to calling sync, close, reading from the pipe, then deleting the result.

We may actually be able to use the same functionality in reverse for creating a netCDF4 object from a StringIO.

Direct constructor for DataArray objects

It shouldn't be necessary to put arrays in a Dataset to make a DataArray.

Functions for converting to and from CDAT cdms2 variables

Apparently CDAT has a number of useful modules for working with weather and climate data, especially for things like computing climatologies (related: #112). There's no point in duplicating that work in xray, of course (also, climatologies may be too domain specific for xray), so we should make it possible to use both xray and CDAT interchangeably.

Unfortunately, I haven't used CDAT, so it not obvious to me what the right interface is. Also, CDAT seems to be somewhat difficult (impossible?) to install as a Python library, so it may be hard to setup automated testing.

CC @DamienIrving

Dataset.to_array method

It should be possible to easily convert back and forth between Datasets and DataArrays that consist of each variable in a Dataset stacked along a "variable" dimension.

The implementation for to_array should be mostly equivalent to DataArray.from_series(ds.to_dataframe().stack()), although for performance reasons we probably don't actually want to convert everything into intermediate representations as pandas object.

Similarly, to_dataset(dimension) would be roughly equivalent to Dataset.from_dataframe(array.to_series().stack(dimension)).

Selective variable reads in open_dataset

One of the beautiful things about the netCDF data model is that the variables can be read individually. I'm suggesting adding a variables keyword (or something along those lines) to the open_dataset function to support selecting one or more or all variables in a file. This will allow for faster reads and smaller memory usage when the full set of variables is not needed.

Encoding preserves chunksize even if it no longer makes sense

For example, even after decoding a character array or indexing an variable, the chunksize is not updated. This means that netCDF4 reports an error when trying to save such a file.

Perhaps we should add some sort of sanity check to chunksize when writing a dataset? Possibly issuing a warning?

Thanks @ToddSmall for reporting this issue.

Implement DataArray.idxmax()

Should match the pandas function: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.html

TODO: Add an `apply` method for dataset `GroupBy` objects

It is very convenient to be able to write array.groupby(some_var).apply(some_func). It would be nice if apply also worked for GroupBy objects created by dataset.groupby(some_var) in the same way.

Now that we have a working Dataset.concat method, this should be easy to implement. Basically, it's just syntactic sugar for:
Dataset.concat(some_func(group) for _, group in dataset.groupby(some_var)

For arrays, we also have some logic to guarantee that array dimensions are all ordered in the same way that they were ordered in the original arrays, even if we did groupby with squeeze=True.

Dataset.delitem() kills dimensions dictionary

Trying to re-align slocum with xray... it seems that deleting a variable from a Dataset object erases the dimensions dictionary:

In [47]: fcst.dimensions
Out[47]: Frozen(OrderedDict([(u'lat', 9), (u'lon', 9), (u'height_above_ground4', 1), (u'time', 65)]))

In [48]: del fcst['height_above_ground4']

In [49]: fcst.dimensions
Out[49]: Frozen(OrderedDict())

Also, the removed variable still appears as a dimension in the remaining variables' coordinate systems:

In [57]: fcst.variables
Out[57]: Frozen(_VariablesDict([(u'lat', <xray.XArray (lat: 9): object>), (u'lon', <xray.XArray (lon: 9): object>), (u'time', <xray.XArray (time: 65): datetime64[ns]>), (u'u-component_of_wind_height_above_ground', <xray.XArray (time: 65, height_above_ground4: 1, lat: 9, lon: 9): float32>), (u'v-component_of_wind_height_above_ground', <xray.XArray (time: 65, height_above_ground4: 1, lat: 9, lon: 9): float32>)]))

Is this intentional? I'm using delitem as replacement for the old polyglot's Dataset.squeeze() - perhaps that's abuse?

Switch to using dict instead of OrderedDict for variables and attributes?

My analysis:

Positives:

dict is more "pythonic":
- dict is faster
- dict comes with syntax built into the language: {}
- repr() on a dict is more readable: {'x': 0, 'y': 1} vs OrderedDict([('x', 0), ('y', 1)]).
Using dict would make it simpler to implement new features in xray, because we don't need to spend time thinking about the order of items. It would also let us simplify some existing features.
We already don't check order for dataset or variable equality. Switching to dict would better align our internal data model with this fact.
It would be very poor form to write code that relies on attribute or variable order. We shouldn't provide this trap for users.

Neutral:

xray would still serialize data consistently, because the order of elements in a dictionary is (mostly) fixed by the Python implementation.

Negatives:

Users expect xray to write netCDFs that look exactly like the ones it reads. I am opposed to adding a separate encoding attribute for datasets unless absolutely necessary, so this is only way sane to keep track of variable/attribute order.

Your thoughts?

API: rename "labeled" and "indexed"

I'd like to rename the Dataset/DataArray methods labeled and indexed, so that they are more obviously variants on a theme, similarly to how pandas distinguishes between methods .loc and .iloc (and .at/.iat, etc.). Some options include:

Rename indexed to ilabeled.
Rename indexed/labeled to isel/sel.

I like option 2 (particularly because it's shorter), but to avoid confusion with the select method, we would need to also rename select/unselect to something else. I would suggest select_vars and drop_vars.

Dataset.concat should allow a string for the concat_over argument

Right now, it tries to convert a string argument into a list of characters, which is almost certainly not the right behavior.

Why not CDAT?

In your main README file you ask the questions "Why not Pandas?" and "Why not Iris?" Another question you might want to ask is "Why not CDAT?" It was written quite a long time ago now but is still used extensively by UV-CDAT and thus by the ESGF. In particular, the cdms2, cdutil, genutil and MV2 libraries within CDAT do some of what xray does.

BUG: selection of individual time values is not quite right

Example from @akleeman:

>>> ahps['time'].values[0] 
numpy.datetime64('2013-07-01T05:00:00.000000000-0700') 
>>> ahps['time'][0].values 
array(1372680000000000000L, dtype='datetime64[ns]')

This may be something broken in numpy but we should be able to fix it on our end, too.

Python 3 support

This is unlikely to be difficult since all of our dependencies support Python 3, but it will definitely take some work.

Fix circular imports

Thanks @takluyver for pointing this out in #113. We really should have resolved this some time ago.

Create a method for loading remote/disk datasets into memory

Perhaps ds.load_into_memory() or ds.cache()?

HDF5 backend for xray

The obvious libraries to wrap are pytables or h5py:
http://www.pytables.org
http://h5py.org/

Both provide at least some support for in-memory operations (though I'm not sure if they can pass around HDF5 file objects without dumping them to disk).

From a cursory look at the documentation for both projects, the h5py appears to offer a simpler API that would be easier to map to our existing data model.

virtual variables not available when using open_dataset

The tutorial provides an example of how to use xray's virtual_variables. The same functionality is not availble from a Dataset object created by open_dataset.

Tutorial:

In [135]:
foo_values = np.random.RandomState(0).rand(3, 4)
times = pd.date_range('2000-01-01', periods=3)
ds = xray.Dataset({'time': ('time', times),
                   'foo': (['time', 'space'], foo_values)})
ds['time.dayofyear']

Out[135]:
<xray.DataArray 'time.dayofyear' (time: 3)>
array([1, 2, 3], dtype=int32)
Attributes:
    Empty

however, reading a time coordinate / variable from a netCDF4 file, and applying the same logic raises an error:

In [136]:
ds = xray.open_dataset('sample_for_xray.nc')
ds['time']

Out[136]:
<xray.DataArray 'time' (time: 4)>
array([1979-09-16 12:00:00, 1979-10-17 00:00:00, 1979-11-16 12:00:00,
       1979-12-17 00:00:00], dtype=object)
Attributes:
    dimensions: 1
    long_name: time
    type_preferred: int

In [137]:
ds['time.dayofyear']

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-137-bfe1ae778782> in <module>()
----> 1 ds['time.dayofyear']

/Users/jhamman/anaconda/lib/python2.7/site-packages/xray-0.2.0.dev_cc5e1b2-py2.7.egg/xray/dataset.pyc in __getitem__(self, key)
    408         """Access the given variable name in this dataset as a `DataArray`.
    409         """
--> 410         return data_array.DataArray._constructor(self, key)
    411 
    412     def __setitem__(self, key, value):

/Users/jhamman/anaconda/lib/python2.7/site-packages/xray-0.2.0.dev_cc5e1b2-py2.7.egg/xray/data_array.pyc in _constructor(cls, dataset, name)
     95         if name not in dataset and name not in dataset.virtual_variables:
     96             raise ValueError('name %r must be a variable in dataset %r'
---> 97                              % (name, dataset))
     98         obj._dataset = dataset
     99         obj._name = name

ValueError: name 'time.dayofyear' must be a variable in dataset <xray.Dataset>
Dimensions:     (time: 4, x: 275, y: 205)
Coordinates:
    time            X                   
    x                        X          
    y                                X  
Noncoordinates:
    Wind            0        2       1  
Attributes:

sample data for xray from RASM project

Is there a reason that the virtual time variables are only available if the dataset is created from a pandas date_range? Lastly, this could be related to the #118 .

Better support for batched/out-of-core computation

One option: add a batch_apply method:

This would be a shortcut for split-apply-combine with groupby/apply if the grouping over a dimension is only being done for efficiency reasons.

This function should take several parameters:

The dimension to group over.
The batchsize to group over on this dimension (defaulting to 1).
The func to apply to each group.

At first, this function would be useful just to avoid memory issues. Eventually, it would be nice to add a n_jobs parameter which would automatically dispatch to multiprocessing/joblib. We would need to get pickling (issue #24) working first to be able to do this.

BUG: xray.DatasetArray.concat does not handle the "dimension" arg properly

It should look in each array for the dimension values if it is provided as a string. Currently, it does not.

Allow DataArray objects without named dimensions?

At PyData SV, @mrocklin suggested that by default, array broadcasting should fall back on numpy's shape based broadcasting. This would also simplify directly constructing DataArray objects (#115).

The trick will be to make this work with xray's internals, which currently assume that dimensions are always named by strings.

Add a "Getting Started" section to the documentation

Simpler handling of attributes under mathematical operations?

I'm thinking now that instead of trying to preserve attributes on variables, it would be preferable to drop all attributes (not just conflicting ones) when doing mathematical operations. This would keep things a bit simpler/easier to understand. Thoughts?

Dataset.groupby summary methods

This may just be a documentation issue but the summary apply and combine methods for the Dataset.GroupBy object seem to be missing.

In [146]:

foo_values = np.random.RandomState(0).rand(3, 4)
times = pd.date_range('2000-01-01', periods=3)
ds = xray.Dataset({'time': ('time', times),
                   'foo': (['time', 'space'], foo_values)})

ds.groupby('time').mean()  #replace time with time.month after #121 is adressed
# ds.groupby('time').apply(np.mean)  # also Errors here

 ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-146-eec1e73cff23> in <module>()
      3 ds = xray.Dataset({'time': ('time', times),
      4                    'foo': (['time', 'space'], foo_values)})
----> 5 ds.groupby('time').mean()
      6 ds.groupby('time').apply(np.mean)

AttributeError: 'DatasetGroupBy' object has no attribute 'mean'

Adding this functionality, if not already present, seems like a really nice addition to the package.

TODO: Add back in DataArray.item()

This will make np.asscalar work.

CC @ToddSmall

Consistent rules for handling merges between variables with different attributes

Currently, variable attributes are checked for equality before allowing for a merge via a call to xarray_equal. It should be possible to merge datasets even if some of the variable metadata disagrees (conflicting attributes should be dropped). This is already the behavior for global attributes.

The right design of this feature should probably include some optional argument to Dataset.merge indicating how strict we want the merge to be. I can see at least three versions that could be useful:

Drop conflicting metadata silently.
Don't allow for conflicting values, but drop non-matching keys.
Require all keys and values to match.

We can argue about which of these should be the default option. My inclination is to be as flexible as possible by using 1 or 2 in most cases.

Functions for converting DataArrays to and from iris.Cubes

keep attrs when reducing xray objects

Reduction operations currently drop all Variable and Dataset attrs when a reduction operation is performed. I'm proposing adding a keyword to these methods to allow for copying of the original Variable or Dataset attrs.

The default value of the keep_attrs keyword would be False.

For example:

new = ds.mean(keep_attrs=True)

returns new with all the Variable and Dataset attrs as ds contained.

Some previous discussion in #131 and #137.

ENH: NETCDF4 in pandas

see this related issue: pandas-dev/pandas#5487

this is actually not hard to do, and might allow you to push some of your backends to pandas.

Allow the ability to add/persist details of how a dataset is stored.

Both Issues https://github.com/akleeman/xray/pull/20 and https://github.com/akleeman/xray/pull/21 are dealing with similar conceptual issues. Namely sometimes the user may want fine control over how a dataset is stored (integer packing, time units and calendars ...). Taking time as an example, the current model interprets the units and calendar in order to create a DatetimeIndex, but then throws out those attributes so that if the dataset were re-serialized the units may not be preserved.

One proposed solution to this issue is to include a distinct set of encoding attributes that would hold things like 'scale_factor', and 'add_offset' allowing something like this

ds['time'] = ('time', pd.date_range('1999-01-05', periods=10))
ds['time'].encoding['units'] = 'days since 1989-08-19'
ds.dump('netcdf.nc')

> ncdump -h
...
    int time(time) ;
        time:units = "days since 1989-08-19" ;
...

The encoding attributes could also handle masking, scaling, compression etc ...

Rename `DatasetArray` to `DataArray`?

This would make it less ambiguous that this is the preferred way to access and manipulate data in xray.

On a related note, I would like to make XArray more of an internal implementation detail that we only expose to advanced users.

Problems parsing time variable using open_dataset

I'm noticing a problem parsing the time variable for at least the noleap calendar for a properly formatted time dimension. Any thoughts on why this is?

ncdump -c -t sample_for_xray.nc 
netcdf sample_for_xray {
dimensions:
    time = UNLIMITED ; // (4 currently)
    y = 205 ;
    x = 275 ;
variables:
    double Wind(time, y, x) ;
        Wind:units = "m/s" ;
        Wind:long_name = "Wind speed" ;
        Wind:coordinates = "latitude longitude" ;
        Wind:dimensions = "2" ;
        Wind:type_preferred = "double" ;
        Wind:time_rep = "instantaneous" ;
        Wind:_FillValue = 9.96920996838687e+36 ;
    double time(time) ;
        time:calendar = "noleap" ;
        time:dimensions = "1" ;
        time:long_name = "time" ;
        time:type_preferred = "int" ;
        time:units = "days since 0001-01-01 0:0:0" ;

// global attributes:
        ...
data:

 time = "1979-09-16 12", "1979-10-17", "1979-11-16 12", "1979-12-17" ;

ds = xray.open_dataset('sample_for_xray.nc')
print ds['time']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-65c280e7a283> in <module>()
      1 ds = xray.open_dataset('sample_for_xray.nc')
----> 2 print ds['time']

/home/jhamman/anaconda/lib/python2.7/site-packages/xray/common.pyc in __repr__(self)
     40 
     41     def __repr__(self):
---> 42         return array_repr(self)
     43 
     44     def _iter(self):

/home/jhamman/anaconda/lib/python2.7/site-packages/xray/common.pyc in array_repr(arr)
    122     summary = ['<xray.%s %s(%s)>'% (type(arr).__name__, name_str, dim_summary)]
    123     if arr.size < 1e5 or arr._in_memory():
--> 124         summary.append(repr(arr.values))
    125     else:
    126         summary.append('[%s values with dtype=%s]' % (arr.size, arr.dtype))

/home/jhamman/anaconda/lib/python2.7/site-packages/xray/data_array.pyc in values(self)
    147     def values(self):
    148         """The variables's data as a numpy.ndarray"""
--> 149         return self.variable.values
    150 
    151     @values.setter

/home/jhamman/anaconda/lib/python2.7/site-packages/xray/variable.pyc in values(self)
    217     def values(self):
    218         """The variable's data as a numpy.ndarray"""
--> 219         return utils.as_array_or_item(self._data_cached())
    220 
    221     @values.setter

/home/jhamman/anaconda/lib/python2.7/site-packages/xray/utils.pyc in as_array_or_item(values, dtype)
     56         # converted into an integer instead :(
     57         return values
---> 58     values = as_safe_array(values, dtype=dtype)
     59     if values.ndim == 0 and values.dtype.kind == 'O':
     60         # unpack 0d object arrays to be consistent with numpy

/home/jhamman/anaconda/lib/python2.7/site-packages/xray/utils.pyc in as_safe_array(values, dtype)
     40     """Like np.asarray, but convert all datetime64 arrays to ns precision
     41     """
---> 42     values = np.asarray(values, dtype=dtype)
     43     if values.dtype.kind == 'M':
     44         # np.datetime64

/home/jhamman/anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    458 
    459     """
--> 460     return array(a, dtype, copy=False, order=order)
    461 
    462 def asanyarray(a, dtype=None, order=None):

/home/jhamman/anaconda/lib/python2.7/site-packages/xray/variable.pyc in __array__(self, dtype)
    121         if dtype is None:
    122             dtype = self.dtype
--> 123         return self.array.values.astype(dtype)
    124 
    125     def __getitem__(self, key):

TypeError: Cannot cast datetime.date object from metadata [D] to [ns] according to the rule 'same_kind'

This file is available here: ftp://ftp.hydro.washington.edu/pub/jhamman/sample_for_xray.nc

Rename attributes to attrs?

I noticed that pytables, h5py and blz all use the shorter attrs instead of attributes as the name of the attributes mapping. What do you guys think about switching over?