nannau / climatexml Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 8.7 MB

License: GNU General Public License v3.0

Python 0.75% Jupyter Notebook 99.20% Dockerfile 0.01% Shell 0.04%

climatexml's Introduction

⛵️ About me

🤓 I’m Nic (@nannau) and I'm a physical scientist who is interested in data science, high performance computing, and software development.
👀 Interested in computer vision to solve problems in climate science, atmospheric physics, climate services, and statistical downscaling.

climatexml's People

Contributors

Stargazers

Watchers

Forkers

kdaust

climatexml's Issues

Implement Fully Stochastic Generator

Option for Generator with noise injection
Stochastic training option (create batch of realisations for each field)
Option for switching MAE for CRPS metric in content loss

Choose one msssim package

There are multiple msssim metrics being used one in losses.py and one from torch metrics. Which one should be used. Window size was an issue for smaller fields.

Glob seems to be disk format dependent for order

Apply sorted(glob... in train.py to "manually" sort the glob outputs to be alphabetical. This was a major headache for @sbeale007

ClimatEx domains longitudes and latitudes

Need lon and lats for created climatex domains, on top of rlon and rlat. Most ClimatEx pre-processed domains do not seem to have corresponding lat lons. For each grid of the ClimatEx fields created from nc2pt, an additional file that contains all Lon lat and rlon rlat values would be really useful for analysis and plotting, something similar to the hr and lr topography fields we have in ClimatEx. This then allows us to better compare the models to other fields such as observations. May be something we would need to add to nc2pt. Or, just an additional file that has the lon/lats of hr_ref.nc.

ERA5 to WRF Inference Module

Build an inference portion of the code base to download ERA5 data, preprocess it, and perform a downscaling for a time slice. A script would be nice, or potentially build a Emulator object that performs this operation.

Define sprint objectives and discuss workflow

The first thing to do is describe what our objectives are for this first sprint.

We can discuss them and document them here for future reference.

Then we should discuss the workflow using GitHub and how we should implement changes.

Create new branch
Implement some changes
Open pull request
Tag issues in PR
Obtain code review by me and Kiri
Merge

Implement Testing Loop and Dataloader

Make sure that PyTorch Lightning performs an online test loop set. This might require modifying nc2pt to preprocess a separate valdiation set.

Complete Virtualization Docs

Update Apptainer Run Docs on DRAC
Finalize Docker docs
Clarify binds

More Generalized Data Handling Capabilities

Each of us in the group will have different data needs that the GAN needs to handle and load. Right now it's specific for super resolution.

One way of helping us handle different data types is to factor out the data config and instantiate "translator" objects that determine how PyTorch Lightning/PyTorch Dataloaders load the data. This is related to #2.

This is also closely related to stochastic methods that we want to eventually add.
We want the ability to customize the dataloaders for each persons problem.

Improve Setup Documentation and README

After working through the install with @sbeale007, it's painfully clear that the now empty README requires some installation instructions!

I'll take lead on this, but @kdaust and @sbeale007 your input would be very helpful because you're working with slightly different systems and might encounter things I don't.

Restructure Config with Hydra and Instantiate

Currently, the file wgan-gp is a bit crazy. Specifically SuperResolutionWGANGP -- our PyTorch Lightning class. Throughout the code, there is a complex nested heirarchy that comes from the hydra config dictionary which is not ideal. I propoe that we use a heirarchical structure based on dataclasses and inheritance to separate out some of this configuration information to make the code a bit cleaner and bit more explicit in how to access certain data related to training. Ultimately, my hope is that it will lead to more configurable code for scientific experimentation where we track changes to configurations rather than changes to the underlying code.

One way of helping reduce the number of lines of code, and factor out the config in a more logicla sense is to use hydra's instantiate feature, where you can instantiate Python objects based on a __target__: ClimatExML.object.class style header in the yaml file. https://hydra.cc/docs/advanced/instantiate_objects/overview/

Basically what I'm imaginging is to factor out some of the initalization code in wgan-gp.py to inherit logical groups of parameters like:

@dataclass
class HyperParams:
    batch_sice: int = None
    beta: float = 0.1
    alpha: float = 10.0
    gamma: float = 5.0

which is inherited by

class SuperResolutionWGANGP(pl.LightningModule, HyperParams):
    ...
    instantiate(hyperparams)

Or something. I'm not sure if that synax will work and where exactly to instantiate the object, but something like this will work.

Implement an HR topography input in ClimatExML

@kdaust has made a ton of progress using HR topography in the model architecture. It would be really nice to implement this in ClimatExML. We should use this issue to track/discuss design ideas for how exactly to implement it.

At a high level, I'm thinking we can specify some HR topography file that has been processed as a .pt file like what you would get from ClimatExPrep. We can then specify the location of that file in the config.yaml file and load somewhere in the pipeline.

It would make sense to load this file perhaps in the loader.py portion of the pipeline. Although, we should be careful that we aren't loading the same file over and over again. At a first glance, this might be the right function to add it to:

ClimatExML/ClimatExML/loader.py

Line 28 in c6e4c74

def setup(self, stage: str):

Trigger validation steps before training begins

Currently MLFlow logs all runs, even the failed runs. This is useful if we are part way through training and something fails, but not useful for all of the bugs/user errors that accompany doing science research.

I therefore recommend that we do some checks before starting training to try and catch as many bugs as possible, possibly using pydantic validators to automatically perform checks on the inputs: https://docs.pydantic.dev/latest/concepts/validators/

Documentation Sprint!

Write your suggestions or notes for things you want to cover during our next sprint. I'll start:

Work out kinks in docs
Complete docs for each existing empty category