Comments (26)
from hydrotools.
My inclination is to cache processed dataframes.
Likewise.
GCP is pretty resilient,
but there's a major bottleneck when it comes to processing the NetCDF
files. I'd like to see if we can optimize that aspect first, then explore
caching options.
Okay, I will look into optimizing that before moving on. Can you speculate on the bottleneck you mentioned?
from hydrotools.
from hydrotools.
Alright, that's helpful. I will fool around with it and see if I see any improvement in performance.
from hydrotools.
I took a look at the performance issues on Friday and found two places where we can make substantial improvements. For an initial baseline, I timed retrieving an analysis and assimilation cycle without filtering the results (all 2.7 million reaches included). That took ~48 seconds. Not horrible, but also not great. To get a sense of what the bottle neck might be, I did the same test, but filtered to just USGS gauges and that took substantially less time ~24 seconds.
Before I elaborate on the findings, I was able to get the time down to ~20 seconds for pull and processing all 2.7 million reaches for a single AnA cycle. To me, that is a success especially if we are to cache those results. Enough of the good news and on to the findings...
I found three places where we can substantially improve performance:
- Dropping variables the we are not concerned about at
xarray
dataset creation time. - Not decoding the time fields at
xarray
dataset creation time. - Post
xarray
dataset creation and creation ofpandas
df object, not casting thenwm_feature_id
field to a string. This had the largest effect on performance. It cut wall time pretty much in half.
def NWM_bytes_to_DataFrame(
path,
) -> pd.DataFrame:
vars_to_drop = [
"crs",
"nudge",
"velocity",
"qSfcLatRunoff",
"qBucket",
"qBtmVertRunoff",
]
# Load data as xarray DataSet
with xr.load_dataset(
path,
engine="h5netcdf",
mask_and_scale=False,
drop_variables=vars_to_drop, #1
decode_cf=False, #2
) as ds:
# Extract streamflow data to pandas DataFrame
df = pd.DataFrame(
{
"nwm_feature_id": ds["streamflow"].feature_id.values,
"value": ds["streamflow"].values,
}
)
# Scale data
scale_factor = ds["streamflow"].scale_factor[0]
value_date = pd.to_datetime(ds.time.values[0], unit="m")
start_date = pd.to_datetime(ds.reference_time.values[0], unit="m")
df.loc[:, "value"] = df["value"].mul(scale_factor)
# 3. This had the biggest impact on wall time. With casting, for a full AnA cycle, goes to ~50 seconds
# Convert feature IDs to strings
# df["nwm_feature_id"] = df["nwm_feature_id"].astype(str)
# Extract valid datetime
df["value_date"] = value_date
# Extract reference datetime
df["start_date"] = start_date
return df
Now with these findings being presented, my only question is, given the cast from int
to str
of the nwm_feature_id
field having the biggest impact on performance, what are your thoughts on leaving the field as an int
? This would break with the canonical dataframe format.
from hydrotools.
At the model level, it makes sense to store nwm_feature_id
as an int
. However, for this use case my preference is that all location identifiers be str
then pandas.Category
.
Having said that, I believe the transformation from int
to str
can be more efficiently executed outside the NWM_bytes_to_DataFrame
method (after calling pandas.concat
elsewhere).
Additionally, I wonder if there is any performance to be gained from using the xarray.Dataset.to_dataframe
method. See here docs. Hand-crafting pandas.Dataframe
apparently comes with significant overhead.
from hydrotools.
I agree that for the end user, storing them as strings makes more sense and it is just the standard that we have in place. I think it makes sense to cache results as int
's personally. I would imagine that int
comparison speeds would be faster than str
comparison, then again I guess the question is, if int
comparison then casting to str
is faster than just a str
comparison. I may be changing my mind, but then again, the overall size of the cache would be much higher if we are to store nwm_feature_id
as a string.
Having said that, I believe the transformation from int to str can be more efficiently executed outside the NWM_bytes_to_DataFrame method (after calling pandas.concat elsewhere).
Potentially, but I am not so sure of that. Given that a copy of the data will be made in memory when cast from int
to str
I would think (in most cases) it would actually be slower to cast after concatenating the results. As of now, we have a kind of divide and conquer strategy that should be sightly faster.
Additionally, I wonder if there is any performance to be gained from using the xarray.Dataset.to_dataframe method. See here docs. Hand-crafting pandas.Dataframe apparently comes with significant overhead.
Ive tested that hypothesis , I would not say robustly, and found that is actually not the case. My thinking is that passing the underlying numpy array to the pandas.DataFrame
constructor does not conduct a copy of the data in memory until you change it. But I could be wrong about that. Ill see what I can find in the source for my own curiosity sake. But to your point, this is definitely worth more investigation than ive done so far.
from hydrotools.
Having looked into the source I don't think that for this problem we will see any performance improvements using to_dataframe()
. Ive included the method used to create the dataframe by xarray below. Our code should be a minutely faster since we also use the underlying numpy
array to construct the df
, just in less code.
# from xarray.core.dataset.Dataset
# https://github.com/pydata/xarray/blob/a0c71c1508f34345ad7eef244cdbbe224e031c1b/xarray/core/dataset.py#L4970
def _to_dataframe(self, ordered_dims: Mapping[Hashable, int]):
columns = [k for k in self.variables if k not in self.dims]
data = [
self._variables[k].set_dims(ordered_dims).values.reshape(-1)
for k in columns
]
index = self.coords.to_index([*ordered_dims])
return pd.DataFrame(dict(zip(columns, data)), index=index)
from hydrotools.
Potentially, but I am not so sure of that. Given that a copy of the data will be made in memory when cast from int to str I would think (in most cases) it would actually be slower to cast after concatenating the results. As of now, we have a kind of divide and conquer strategy that should be slightly faster.
As an experiment I ran a test on my personal laptop that compared creating and concatenating 10,000 dataframes each with 1000 rows and two columns (nwm_feature_id
and data
) using multiprocessing (5 processes). I found casting nwm_feature_id
to str
after concatenation was 50% slower than before concatenation. At least on my laptop, it seems your intuition is correct.
I understand the efficiency arguments for caching nwm_feature_id
as int
. My reservations are related to usgs_site_code
which are also numbers, but should not be stored as int
(due to the leading zero problem). I'm concerned about maintaining a special separate standard at different parts of the pipeline for each location identifier (and any potential future identifiers that may also be number-based, but not actual integers).
from hydrotools.
I also had notes on memory-usage for the resulting Dataframe
.
str
: 647.4 MB
int
: 152.6 MB
str
as Category
: 95.5 MB
int
as Category
: 95.4 MB
from hydrotools.
I understand the efficiency arguments for caching nwm_feature_id as int. My reservations are related to usgs_site_code which are also numbers, but should not be stored as int (due to the leading zero problem). I'm concerned about maintaining a special separate standard at different parts of the pipeline for each location identifier (and any potential future identifiers that may also be number-based, but not actual integers).
Right, I agree with you. My plan was never to include the usgs_site_code
in a cached version since crosswalking and merging is so cheap computationally. I share your concern about splitting from the canonical format, just trying to figure out what makes the most sense moving forward.
from hydrotools.
I was able to get the time down to ~34 seconds doing the following:
max_feature_id = df["nwm_feature_id"].max()
max_digit_length = int(np.log10(max_feature_id)) + 1
df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")
from hydrotools.
I was able to get the time down to ~34 seconds doing the following:
max_feature_id = df["nwm_feature_id"].max() max_digit_length = int(np.log10(max_feature_id)) + 1 df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")
Yuck! But also, clever! 😆 Nice!
from hydrotools.
Only cause for concern is now nwm_feature_id
is byte
type not a decoded string. Going back to the drawing board. I found a really thorough and well explained stackoverflow post comparing astype
and apply(<T>)
that may be useful in the future or just good to know about.
from hydrotools.
Might try writing this a vectorized copy and see if it helps performance
# 3. This had the biggest impact on wall time. With casting, for a full AnA cycle, goes to ~50 seconds
# Convert feature IDs to strings
# df["nwm_feature_id"] = df["nwm_feature_id"].astype(str)
To something like
df["nwm_feature_id_str"] = str(df["nwm_feature_id"])
df.drop("nwm_feature_id")
df.rename({"nwm_feature_id_str":"nwm_feature_id"})
I'm curious if the type assignment from one column type to another is causing some significant slow down on that large of data. Essentially it has to create tmp buffer, copy the new transformed data to it, and then reassign that index. Maybe something like the above can speed that up a bit, but YMMV.
from hydrotools.
I tried something similar to that, but instead of using casting with the built-in str
, I used pandas astype
. Still hovering at 50 seconds.
Edit: Remember English
from hydrotools.
Ive spent some more time working on this and I think I may have come to a fair compromise solution. Just to recap a few of the things ive tried since I last commented:
- Using
cython
to cast from an int to a string. I found thatdf.apply(str)
was still slightly faster with a small array, butdf.apply(str)
became the clear winner as the array size increased. - Using
numpy.char.mod("%s", ds["streamflow"].feature_id.values)
showed comparable speed to
max_feature_id = df["nwm_feature_id"].max() max_digit_length = int(np.log10(max_feature_id)) + 1 df["nwm_feature_id"] = df["nwm_feature_id"].values.astype(f"|S{max_digit_length}")
but also returned encoded bytes rather than unicode strings.
- Next, I tried
np.char.array(ds["nwm_feature_id"].feature_id.values, copy=False, unicode=True)
, this was faster thandf.apply(str)
in small arrays, but as array size grew and the string length increased, it grew slower than thedf.apply(str)
approach ~39 seconds when retrieving an EAnA cycle. - Lastly, I tried
pd.Categorical(ds["streamflow"].feature_id.values)
. This implementation took ~33 seconds for an entire EAnA cycle. I think this implementation is preferable to 1,2, and 3 becausenwm_feature_id
values are in a type that is recognizable and does not require further casting by the user. Also its pretty fast. As noted in #25, there are known issues with categorical data, however I think these issues are outweighed by the functionality of the categorical datetype. Namely, you can compare integers to categorical types without issue. To that same point, then again, what do we gain from casting these fields to categorical values? It seems like if our only two main advantages are comparison to integer types and that it is not a byte string, I don't see what we benefit is over leaving the values asint
's in the first place. I think the best argument is that by using the categorical datatype, we minimize wall time and avoid altering the canonical format.
from hydrotools.
I don't see what we benefit is over leaving the values as int's in the first place. I think the best argument is that by using the categorical datatype, we minimize wall time and avoid altering the canonical format.
The categorical dtype flexes its strength more on large dataframes. I found categories beats both str
and int
in terms of memory efficiency (and storage as HDF) by a considerable margin on large dataframes. I also found that str
categories are comparable to int
categories. The added benefit of str
categories is that we can use a consistent dtype for all location IDs.
From a previous comment:
str
: 647.4 MB
int
: 152.6 MB
str
asCategory
: 95.5 MB
int
asCategory
: 95.4 MB
from hydrotools.
Those memory saving are hard to argue against. Do you know if you can specify the category type (str
or int
) prior to the astype
cast? So, can you get an str
as Category
from an int
without having to cast to str
first? From my reading of pandas.Categorical
I am not seeing a straight path forward.
from hydrotools.
It appears that you can be explicit about the CategoricalDtype
. However, I don't think there's a way around performing the initial cast to str
(via astype
, apply
, etc). There may be differences between chaining apply
or astype
or using a lambda
expression.
Another alternative might be to use xarray.DataArray.astype
or similar function on the original dataset.
from hydrotools.
It appears that you can be explicit about the CategoricalDtype. However, I don't think there's a way around performing the initial cast to str (via astype, apply, etc). There may be differences between chaining apply or astype or using a lambda expression.
Right, that was my thinking as well .
I am giving your second suggestion a run at the moment.
from hydrotools.
So I tried the following:
df = pd.DataFrame(
{
# "nwm_feature_id": pd.Categorical(
# ds["streamflow"].feature_id.values
# ),
"nwm_feature_id": ds["streamflow"].feature_id.astype(str),
"value": ds["streamflow"].values,
}
)
df["nwm_feature_id"] = df["nwm_feature_id"].astype("category")
I took ~183 seconds. I did not report it earlier, but I found early on that casting the values while they are still in an xarray.DataArray
to be slow.
from hydrotools.
What about merging your above solution with the underlying numpy method? Something like:
"nwm_feature_id": pd.Categorical(ds["streamflow"].feature_id.values.astype(f"|S{max_digit_length}")
from hydrotools.
Ill give that a go and see if that improves anything. I just tested this:
df = pd.DataFrame(
{
"nwm_feature_id": ds["streamflow"].feature_id.values,
"value": ds["streamflow"].values,
}
)
df["nwm_feature_id"] = df["nwm_feature_id"].astype("category")
df["nwm_feature_id"].cat.rename_categories(
df["nwm_feature_id"].cat.codes.apply(str), inplace=True
)
It was the slowest so far of the bunch ~212 seconds lol.
from hydrotools.
Unfortunately,
"nwm_feature_id": pd.Categorical(ds["streamflow"].feature_id.values.astype(f"|S{max_digit_length}")
took ~148 seconds.
from hydrotools.
Basic caching implemented for this subpack with #66
Closing this issue here. We can discuss the future of dataframe caching on another ticket.
from hydrotools.
Related Issues (20)
- NWM Client New Test Failure: AttributeError: 'EntryPoints' object has no attribute 'get' HOT 5
- Pandas >= 2.0.0 package compliance audit HOT 4
- `nwis_client` "sqlite3.OperationalError: database is locked" HOT 6
- Move `hydrotools` namespace packages to separate repositories HOT 3
- "Run Slow Unit Tests" Action has been failing for some time HOT 2
- 3.7 Tests failing: xarray EntryPoints has no attribute get HOT 6
- DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace HOT 1
- AWS Retrospective HOT 10
- SVI Client slow unit tests failing HOT 8
- nwm_client_new documentation is incomplete for private servers. HOT 1
- nwm_client_new `get` methods fails with custom Parquet Store
- Consider supporting MS Azure (`nwm_client_new`) HOT 1
- Determine feasibility of _restclient's continued dependence on `aiohttp_cache_client` HOT 5
- SVI Client get method failing due to Pydantic>2 issue HOT 1
- New version of `_restclient` cannot be pushed to PyPI b.c. namespace packages with leading `_` in package name cannot be uploaded HOT 1
- Add some basic information about the NWM operational configuration to the `nwm_client_new` package. HOT 1
- Event Detection methods are raising `FutureWarning` HOT 3
- question about update cycle for hydrotools HOT 3
- NWPS API Available HOT 4
- `pint` caching fail leads to `FileNotFoundError` again. (`nwm_client_new`)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hydrotools.