Comments (10)
Hi Noelia. Can you try setting out_vars = ["geopotential_500"] (but left the in_vars the same)? This is because when preprocessing the data, we distinguish different pressure levels of the same variable by appending the level after the variable name.
To avoid confusion, we'll modify this to allow users to specify pressure levels for both input variables and output variables. But you can use this hot fix for now.
from climate-learn.
Hi Tung. I tried the setting:
data_module = DataModule(
dataset = "ERA5",
task = "forecasting",
root_dir = DATADIR,
in_vars = ["geopotential"],
out_vars = ["geopotential_500"],
train_start_year = Year(2015),
val_start_year = Year(2016),
test_start_year = Year(2017),
end_year = Year(2018),
pred_range = Days(3),
subsample = Hours(1),
batch_size = 128,
num_workers = 1
)
I also tried others settings, as I wasn't sure (both in_vars and out_vars "geopotential_500"), but I still get the same error.
67 if len(xr_data.shape) == 3: # 8760, 32, 64
68 xr_data = xr_data.expand_dims(dim="level", axis=1)
---> 69 data_dict[var].append(xr_data)
In my data folder (also coming from WeatherBench) the geopotential only has 1 level, geopotential_500 (similarly for temperature, temperature_850). I think the problem might be because it's not going through the else (line 70 in era5_module.py), and then, it complains because it doesn't find the level (as it does when using data_dict[f"{var}_{level}"])... I might be wrong though, but just in case you might want to check.
Thanks again!
from climate-learn.
Oh yes that was the problem. We were assuming that if you wanted to use geopotential or temperature you would use the multi-level data provided by Weatherbench, not the geopotential_500 and temperature_850 directories. We'll document this to avoid future confusion. In the meantime can you try downloading the multi-level geopotential and temperature from Weatherbench and see if it solves the problem?
from climate-learn.
OK, I could fix that, by adding an additional check, but out_vars must be out_vars = ["geopotential_500"]. However, I realised that there might be another problem if I correctly understood the logic of DataModule:
When in_vars and out_vars are different, the DataModule fails. I'm assuming that out_vars is the target variable, so in principle in_vars and out_vars might be different. In that case, the class ERA5 only use "variables", which works fine if in_vars and out_vars are the same, otherwise it would fail ... or am I wrong here? are in_vars and out_vars supposed to be the same??
from climate-learn.
@noeliaof, you are correct in your understanding of in_vars
and out_vars
. As for whether they should be the same... that was an assumption we made when first writing this code. In hindsight, probably not the best decision.
In any case, this problem has been noticed before and brought to our attention in issue #50. PR #51 proposes changing the code so that out_vars
would not need to be a subset of in_vars
. It is currently in review.
from climate-learn.
@jasonjewik should we keep the issue open? Given #51 is merged now.
from climate-learn.
@prakhar6sharma thanks for reminding me. @noeliaof is the bug resolved on the latest commit?
from climate-learn.
hi, I just checked this (with latest commit), and now it's working when in_vars and out_vars are different, but I am afraid the error still happens when having a variable with one single level (e.g., only geopotential_500 downloaded):
KeyError Traceback (most recent call last)
Cell In[8], line 1
----> 1 data_module = DataModule(
2 dataset = "ERA5",
3 task = "forecasting",
4 root_dir = DATADIR,
5 in_vars = ["2m_temperature"],
6 out_vars = ["geopotential"],
7 train_start_year = Year(2015),
8 val_start_year = Year(2016),
9 test_start_year = Year(2017),
10 end_year = Year(2018),
11 pred_range = Days(3),
12 subsample = Hours(1),
13 batch_size = 128,
14 num_workers = 1
15 )
File ~/climate-learn/src/climate_learn/data/module.py:112, in DataModule.__init__(self, dataset, task, root_dir, in_vars, out_vars, train_start_year, val_start_year, test_start_year, end_year, root_highres_dir, history, window, pred_range, subsample, batch_size, num_workers, pin_memory)
109 caller = eval(f"{dataset.upper()}{task_string}")
111 train_years = range(train_start_year, val_start_year)
--> 112 self.train_dataset = caller(
113 root_dir,
114 root_highres_dir,
115 in_vars,
116 out_vars,
117 history,
118 window,
119 pred_range.hours(),
120 train_years,
121 subsample.hours(),
122 "train",
123 )
125 val_years = range(val_start_year, test_start_year)
126 self.val_dataset = caller(
127 root_dir,
128 root_highres_dir,
(...)
136 "val",
137 )
File ~/climate-learn/src/climate_learn/data/modules/era5_module.py:114, in ERA5Forecasting.__init__(self, root_dir, root_highres_dir, in_vars, out_vars, history, window, pred_range, years, subsample, split)
112 print(f"Creating {split} dataset")
113 unique_vars = list(set(in_vars) | set(out_vars))
--> 114 super().__init__(root_dir, root_highres_dir, unique_vars, years, split)
116 self.in_vars = list(self.data_dict.keys())
117 self.out_vars = out_vars
File ~/climate-learn/src/climate_learn/data/modules/era5_module.py:28, in ERA5.__init__(self, root_dir, root_highres_dir, variables, years, split)
25 self.years = years
26 self.split = split
---> 28 self.data_dict = self.load_from_nc(self.root_dir)
29 if self.root_highres_dir is not None:
30 self.data_highres_dict = self.load_from_nc(self.root_highres_dir)
File ~/climate-learn/src/climate_learn/data/modules/era5_module.py:69, in ERA5.load_from_nc(self, data_dir)
67 if len(xr_data.shape) == 3: # 8760, 32, 64
68 xr_data = xr_data.expand_dims(dim="level", axis=1)
---> 69 data_dict[var].append(xr_data)
70 else: # pressure level
71 for level in DEFAULT_PRESSURE_LEVELS:
- KeyError: 'geopotential'
The problem comes from the way data_dic is built, and it should check whether the variable is in PRESSURE_LEVEL_VARS. Then, check it again in the class ERA5Forecasting. I made myself a couple of changes to make it work, but the solution is not ideal ...
from climate-learn.
We are working on refactoring the data loading part, which will resolve this problem. Will update with you when it's done
from climate-learn.
We are working on refactoring the data loading part, which will resolve this problem. Will update with you when it's done
@tung-nd #68 is merged but it still doesn't resolve the way the data_dict is built. Can you please create a separate issue describing in more detail the exact way how data_dict should built.
from climate-learn.
Related Issues (20)
- Current ViT implementation works with timm 0.6.12 and not with 0.9.2 HOT 2
- Statistical Downscaling of other ERA5 Variables HOT 3
- Table 3 (Downscaling experiments results) reports `RMSE` and not `LatWeightedRMSE` HOT 3
- Save predictions as an `nc` file at test time HOT 1
- CMIP6 data processing HOT 6
- Error when downloading the high-res data HOT 1
- ImportError for the command "from climate_learn.data import DataModule" HOT 3
- map_dataset.setup() keeps crashing
- downscaling script
- Question Regarding Bilinear Interpolation for Downscaling before DL models HOT 7
- Add these datasets to the Hugging Face hub?
- Bug in the persistence baseline for forecasting HOT 2
- Changing order of variables drastically affects model performance (DataModule) HOT 12
- Model Refactor HOT 5
- Deterministic randomness in ShardDataset HOT 1
- Data folder under the climate-learn has missing docs
- ShardDataset doesn't work for DDP HOT 1
- Prefetching for __iter__() in ShardDataset
- Climatology is incorrect shape for forecasting with history HOT 4
- TypeError: DataModule.__init__() got an unexpected keyword argument 'dataset' HOT 16
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from climate-learn.