Code Monkey home page Code Monkey logo

halodrops's People

Contributors

geet-george avatar hdorff94 avatar

Watchers

 avatar  avatar

halodrops's Issues

the user can also decide if based on the qc functions the sonde will be filtered or carried to L2.

The user can decide how the QC will work for them, i.e. either:

  1. filter QC-failed sondes out of L1 and they don't make it to L2, OR
  2. the L2 sondes carry QC flags with them, which also goes in to L3.

This could be achieved by simply running the qc_check_ methods when user opts for option-1, else the attributes are passed on to L2 sondes, if the user opts for option-2.

This will be part of the QC aspect of the pipeline(#49).

Contributing to code

Add section in CONTRIBUTING docs about how one can help with contributing to code:

  • Adding source code
  • Adding tests
  • Adding how to notebooks
  • How to log

And others that one can think of.

batch processing

Data directory structure

The following is taken from the current documentation:

The Data_Directory is a directory that includes all data from a single campaign. Therein, data from individual flights are stored in their respective folders with their name in the format of YYYYMMDD, indicating the flight-date. In case of flying through midnight, consider the date of take-off. In case, there are multiple flights in a day, consider adding alphabetical suffixes to distinguish chronologically between flights, e.g. 20200202-A and 20200202-B would be two flights on the same day in the same order that they were flown.

This system excludes the possibility of having multiple platforms in a single campaign. However, batching by campaign can be one of two options: for a single-platform campaign, batching across all flights in the campaign or for a multi-platform campaign, batching across all flights of a platform and then, again batching across all platforms of the campaign. Currently, batching is only possible for all sondes in a single flight. This is done by providing a mandatory flight_id in the config file.

Suggested changes:

The Data_Directory should be of the structure where each directory in it should stand for a platform and directories within a platform's directory would be individual flight directories. This will be made mandatory. The package will then auto-infer platform names (platforms attribute) based on the platform directories' names. This value will go in to the dataset attributes (e.g. platform_id) and if the user wishes, also in the filenames of the dataset.

If the user wishes to provide custom platforms values, it can be provided as an attribute under the MANDATORY section of the config file, but then a separate platform_directory_names must be provided which will provide the platforms' data directory names in the same sequence as the platform names provided in platforms. If there are multiple platforms in the campaign, the platforms values provided by the user must be comma-separated values, e.g. halo,wp3d (preceding and succeeding spaces will be part of the platform name, e.g. when setting the platform_id name). If there is only one platform, provide a name with no commas.

Now, the only way to batch process will be to process for all sondes of a campaign, i.e. all sondes from all flights of all platforms in a campaign. If the user wants a subset of the batching, they can choose to only include limited directories in the data_directory they provide in the config file. However, considering that the processing doesn't take is not compute-heavy, there are no use-cases coming to my mind which warrant a separate mode of batch processing.

Now how to go about doing this?

The function create_and_populate_flight_object in the pipeline module processes all sondes of a flight.

A new function in the pipeline module get_platforms will get platforms value/s based on the directory names in data_directory or the user-provided platforms values corresponding to directory names (platform_directory_names). For each platform, a Platform object will be created with its platform_id attribute coming from the platforms attribute.

For each Platform object, another function in the pipeline module will get all corresponding flight_id values by looping over all directory names in a platform's directory and process all sondes in flight-wise batches.

After the flight-wise batch processing is done, all L2 files in the corresponding flight_id directories will be populated with L2 datasets that contain the corresponding platform_id and flight_id attributes. For creating L3 and onwards, the script will just look for all L2 files in the data_directory and get the flight and platform information from the platform_id and flight_id attributes of the L2 files.

Add QC workflow

The QC workflow should consists of different steps:

  1. The Aspen software has its own QC process which smooths, removes unrealistic values and filters out any sondes that didn't detect a launch. This is already fulfiled by ASPEN, while going from Level-0 to Level-1.
  2. After this, halodrops can do further QC including (but not limited to) profile coverage (checking how much of the vertical profile has measurements), checking if values at surface are available, checking if values at surface are within expected bounds, etc.
  3. Once these QC steps are done, Level-2 data is created from Level-1.

The QC methods in step-2 can be included as optional steps for the user to choose from. Either as strict filters from Level-1 to Level-2 or as flags that will carry on with the soundings up to Level-3. For Level-4, the user again will choose which sondes will make it from Level-3 based on the flags.

To achieve the above:

  • add qc functions to the Sonde class (each Sonde object can run a qc check on itself then)
  • #47
  • add a function in the Sonde class which takes an input of the list of qc functions to run and then runs only the qc functions provided
  • #48
  • #50

Once this workflow is set up, the individual qc functions can be set up.

complete list of global attributes

Currently the global attributes for L2 (and subsequent ones) have many commented out lines because these require different changes (e.g. versioning, author configurations, campaign_id, instrument_id, etc.). This issue is a overarching issue to solve the other problems in order to have the full set of global attributes.

Write a contributing guide

Include a CONTRIBUTING.md which explains the following:

  • Where to start?
  • Submitting issues - bugs, enhancements, features
  • Development workflow
    • Creating environment
    • Installing halodrops
    • Branch
    • Editing workflow
    • Submit PR
  • Contributing to documentation
  • #13 (made into a separate non-urgent issue)

There are many good sources of reference, where this is done quite well, e.g. xarray or pandas

provide default value for `filter_flags`

filter_flags should have a default value where it filters against all QC attributes. This default value should be of a type so that a user can provide either:

  • individual flag values, e.g. near_surface_coverage_tdry,`profile_fullness_tdry`` OR
  • all flag values of the same type, e.g. all flags starting with profile_fullness_ OR
  • all except values of one type, e.g. all flags except those starting with profile_fullness_

Add API to access workflows

Would it be helpful to separate the source code (classes, functions, etc.) and the scripts, so that the developers and users can have their own domain?

Standard workflows can be stored as scripts, which will be for users and they wouldn't need to bother about the source code too much.

Set up logging

Most functions / scripts currently do not log info/warning/errors. Set this up please :)

add variable attributes

  • rename variables
  • create dict for attributes
  • add sonde_id as trajectory_id variable
  • add attributes for variables & coords

qc-related attributes

To keep all qc-related attributes together, we could think of creating a class attribute called qc, which would essentially be an empty object instance like type('', (), {})(). Next, any qc attributes can be assigned as object.__setattr__(self.qc,"whatever_name",qc_flag_value). Then, the new function (tentatively named qc_filter) can check only these self.qc attributes against the filter_flags list.

`qc_filter` function

The current setup excludes the possibility for the user to flexibly select qc flags for different variables and then use these to create L2. To make it more customizable by the user, all qc-related attributes should be kept available for the user to choose from. A new function (maybe called qc_filter or something) can let the user to only select the sonde based on a list of filter_flags. This list will include elements that are qc-related attribute names, and if any of those are False, they will be filtered out from creating L2. The rest of the attributes will be ignored and will be propagated into L2 with their flag values. So, the most conservative QC filter would include all qc-related attributes in the filter_flag and not let any False value pass into L2. The most lenient QC filter would have an empty filter_flag list and allow all sondes pass into L2 (except no launch detected sonde, see commit 4c4085d) with their qc flag values propagated as variable values.

  • create qc_filter function
  • remove apply_qc_checks and functions with the prefix qc_check_

`skip` argument for QC methods

The qc_filter function will not allow the user to select a subset of the available qc checks. They can only select which attributes will be used for filtering or for propagating. There is no option to skip a QC check/attribute. To make this possible, every function that assigns an attribute to self.qc will come with an optional argument skip that has the default value False. If the value of skip is set to True, the function will simply pass and not do anything. Therefore, the default option is that all QC checks run, but if need be, the user can set some of these QC checks to skip=True and allow only for limited QC checks, which are neither accounted for in filtering nor in propagating further to L2.

Set up precommit

  • Install precommit hooks
  • When done, document it in CONTRIBUTING

Improve pipeline

  • Add correct QC functions to pipeline
  • add placeholders for L2 to L4
  • add placeholder function to go from individual sondes to gridded dataset in pipeline
  • comments in the steps to clarify intention of step

Add dependencies of the package to TOML.

Currently, there are only packages listed in the environment.yml, which is useful for developers. For users, it is so much easier to simply not care about the dependencies themselves.

create a data-processing pipeline

What do I mean by pipeline?

A "pipeline" simply means a sequence of data-processing steps executed in series, where the output of one element is the input of the next one.

Each step in the pipeline corresponds to a level of the data-product (L1, L2, L3, L4) and is associated with a set of substeps, each of which involve a set of functions that are executed to process the data and move it to the next step.
The default pipeline is defined in a separate Python module (pipeline.py), which maps each level to a list of functions (or substeps) that should be executed to reach that level from the previous one.

This default pipeline can be modified by the user through the configuration file. The configuration file allows the user to specify functions for each substep in the pipeline. If a substep is not included in the configuration file, the default functions for that substep will be executed. So, the smallest unit that a user changes in the pipeline is a sub-step. If a substep is defined in the config file, it completely replaces the default substep. The argument values for the functions in the pipeline can also be configured by the user through a configuration file (this part is explained by commit 0612bc1).

The pipeline is executed by iterating over the levels in order, and for each level, executing the associated functions with the provided arguments. The arguments for each function are retrieved from the configuration file. This pipeline is thus the flow of processing data into levels for the final dataset.


The main() function currently does the following:
✅ gets default values of all args in all functions in the package
✅ checks for any user-provided non-default values to the args in the functions from the config file
✅ gets mandatory arg values from the config file

What the main() function should do next is:

  • access the default pipeline
  • get non-default substeps for the pipeline from the configuration file
  • execute pipeline after accounting for the non-default parts of the pipeline and fn args

For the above, we must:

Combining `iterate_Sonde_method_over_dict_of_Sondes_objects` and `iterate_method_over_dataset`

Currently the two functions iterate_Sonde_method_over_dict_of_Sondes_objects and iterate_method_over_dataset (let's call them f1 and f2, respectively) are essentially doing the same thing, except:

  1. f1 does it explicitly for the Sonde object -> This can be changed by making the function detect the object type and then running methods over this object.
  2. f1 loops over every dictionary item and then runs methods over the Sonde object. f2 does not need to do this iteration, and simply goes to running methods on the object (an xr.Dataset in its case). -> This can be changed by making the function check first if the provided argument is a dictionary or a dataset and then, respectively loop over items to apply methods or apply method directly.

Combining the two would be easier in terms of providing the "apply" input to the pipeline dictionary and also reduce points of failure for applying methods in the pipeline.

Add a central config file

It would help to have a .CONFIG from which scripts can read out information like data directory, output directory, and other things that might of use to change/customize during processing or plotting.

Of course, this is really only helpful if the scripts folder is created separately (Issue #14) because I don't think reading from config into source code is a great idea.

Convert numeric attributes from QC checks to bool

Currently the QC functions profile_fullness and near_surface_coverage only set attribute value as a number because this was originally a pre-step before the test is carried out and a bool is assigned.

Modify the functions / create a new function to test for these numeric values and assign a boolean, based on threshold values or other args. This should then be used by the filter_qc_fail method.

`apply_qc_checks` function excludes possibility to control qc flags individually

Currently, the apply_qc_function applies another function to different qc flags and compiles it into a single flag. For example, if it runs the qc_check_profile_fullness function, then all the attributes of profile_fullness_{variable} are looked at together, and a composite value of True (if all True) or False (if any False) is provided to the attribute profile_fullness.

The current setup excludes the possibility for the user to flexibly select qc flags for different variables and then use these to create L2. To make it more customizable by the user, all qc-related attributes should be kept available for the user to choose from. A new function (maybe called qc_filter or something) can let the user to only select the sonde based on a list of filter_flags. This list will include elements that are qc-related attribute names, and if any of those are False, they will be filtered out from creating L2. The rest of the attributes will be ignored and will be propagated into L2 with their flag values. So, the most conservative QC filter would include all qc-related attributes in the filter_flag and not let any False value pass into L2. The most lenient QC filter would have an empty filter_flag list and allow all sondes pass into L2 (except no launch detected sonde, see commit 4c4085d) with their qc flag values propagated as variable values.

To keep all qc-related attributes together, we could think of creating a class attribute called qc, which would essentially be an empty object instance like type('', (), {})(). Next, any qc attributes can be assigned as object.__setattr__(self.qc,"whatever_name",qc_flag_value). Then, the new function (tentatively named qc_filter) can check only these self.qc attributes against the filter_flags list.

This will still not allow the user to select a subset of the available qc checks. They can only select which attributes will be used for filtering or for propagating. There is no option to skip a QC check/attribute. To make this possible, every function that assigns an attribute to self.qc will come with an optional argument skip that has the default value False. If the value of skip is set to True, the function will simply pass and not do anything. Therefore, the default option is that all QC checks run, but if need be, the user can set some of these QC checks to skip=True and allow only for limited QC checks, which are neither accounted for in filtering nor in propagating further to L2.

filter_flags should have a default value where it filters against all QC attributes. This default value should be of a type so that a user can provide either:

  • individual flag values, e.g. near_surface_coverage_tdry,`profile_fullness_tdry`` OR
  • all flag values of the same type, e.g. all flags starting with profile_fullness_ OR
  • all except values of one type, e.g. all flags except those starting with profile_fullness_

Inspecting functions

Inspecting functions for their arguments. The outcoming information of arguments is then used to check if new entries for the function are specified in the config-file.

No ASPEN dataset if launch is not detected

If a launch is not detected, a sonde will not be processed by ASPEN, and therefore, will not have an NC file (Level-1).

Currently, if the method add_postaspenfile is run for Sondes that do not have an NC file, it simply returns a print statement that the ASPEN file was not found and therefore, the attribute postaspenfile is not set. This is silently letting a Sonde without launch detect pass, and will fail at a further point when add_aspen_ds is called, which raises a ValueError when the postaspenfile is not found.

Instead, when the add_postaspenfile is called, it should either raise a ValueError and maybe, additionally check for launch detect status and notify the user about the cause of the failure.

Add how-tos in documentation

With so many changes occuring over the last few weeks, the documentation is outdated. A page each can be made for the following How-Tos and Explanations. The "with demo" part indicates that examples for changing stuff in the config file should be shown to demonstrate different possibilities of customization.

Set up CI for pre-commit

Handing developers the responsibility of employing pre-commit before pushing, without the repo checking it here could be the cause for many mismatches with pre-commit checks later.

Therefore,

  • Set up a CI to ensure pre-commit checks on the Github repo

Include info about assimilation info in reanalysis

Apparently, it is possible to get information via MARS about which dropsondes made it to the ERA5 reanalysis products. Be in touch with Andreas Schäfler and Konstantin (?) from DLR, who have done this before for NAWDEX.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.