geet-george / halodrops Goto Github PK
View Code? Open in Web Editor NEWHALO Dropsondes' Protocol & Software
Home Page: https://halodrops.readthedocs.io/
HALO Dropsondes' Protocol & Software
Home Page: https://halodrops.readthedocs.io/
The user can decide how the QC will work for them, i.e. either:
This could be achieved by simply running the qc_check_
methods when user opts for option-1, else the attributes are passed on to L2 sondes, if the user opts for option-2.
This will be part of the QC aspect of the pipeline(#49).
Add section in CONTRIBUTING docs about how one can help with contributing to code:
And others that one can think of.
Here you can download ASPEN: https://www.eol.ucar.edu/software/aspen
Here are the relevant docs: https://ncar.github.io/aspendocs/man_aspenqc.html
The following is taken from the current documentation:
The Data_Directory is a directory that includes all data from a single campaign. Therein, data from individual flights are stored in their respective folders with their name in the format of YYYYMMDD, indicating the flight-date. In case of flying through midnight, consider the date of take-off. In case, there are multiple flights in a day, consider adding alphabetical suffixes to distinguish chronologically between flights, e.g. 20200202-A and 20200202-B would be two flights on the same day in the same order that they were flown.
This system excludes the possibility of having multiple platforms in a single campaign. However, batching by campaign can be one of two options: for a single-platform campaign, batching across all flights in the campaign or for a multi-platform campaign, batching across all flights of a platform and then, again batching across all platforms of the campaign. Currently, batching is only possible for all sondes in a single flight. This is done by providing a mandatory flight_id
in the config file.
The Data_Directory should be of the structure where each directory in it should stand for a platform and directories within a platform's directory would be individual flight directories. This will be made mandatory. The package will then auto-infer platform names (platforms
attribute) based on the platform directories' names. This value will go in to the dataset attributes (e.g. platform_id
) and if the user wishes, also in the filenames of the dataset.
If the user wishes to provide custom platforms
values, it can be provided as an attribute under the MANDATORY
section of the config file, but then a separate platform_directory_names
must be provided which will provide the platforms' data directory names in the same sequence as the platform names provided in platforms
. If there are multiple platforms in the campaign, the platforms
values provided by the user must be comma-separated values, e.g. halo,wp3d
(preceding and succeeding spaces will be part of the platform name, e.g. when setting the platform_id
name). If there is only one platform, provide a name with no commas.
Now, the only way to batch process will be to process for all sondes of a campaign, i.e. all sondes from all flights of all platforms in a campaign. If the user wants a subset of the batching, they can choose to only include limited directories in the data_directory
they provide in the config file. However, considering that the processing doesn't take is not compute-heavy, there are no use-cases coming to my mind which warrant a separate mode of batch processing.
The function create_and_populate_flight_object
in the pipeline
module processes all sondes of a flight.
A new function in the pipeline module get_platforms
will get platforms
value/s based on the directory names in data_directory
or the user-provided platforms
values corresponding to directory names (platform_directory_names
). For each platform, a Platform
object will be created with its platform_id
attribute coming from the platforms
attribute.
For each Platform
object, another function in the pipeline module will get all corresponding flight_id
values by looping over all directory names in a platform's directory and process all sondes in flight-wise batches.
After the flight-wise batch processing is done, all L2 files in the corresponding flight_id
directories will be populated with L2 datasets that contain the corresponding platform_id
and flight_id
attributes. For creating L3 and onwards, the script will just look for all L2 files in the data_directory
and get the flight and platform information from the platform_id
and flight_id
attributes of the L2 files.
The QC workflow should consists of different steps:
The QC methods in step-2 can be included as optional steps for the user to choose from. Either as strict filters from Level-1 to Level-2 or as flags that will carry on with the soundings up to Level-3. For Level-4, the user again will choose which sondes will make it from Level-3 based on the flags.
To achieve the above:
Once this workflow is set up, the individual qc functions can be set up.
Currently the global attributes for L2 (and subsequent ones) have many commented out lines because these require different changes (e.g. versioning, author configurations, campaign_id, instrument_id, etc.). This issue is a overarching issue to solve the other problems in order to have the full set of global attributes.
check if indeed there is a discrepancy -> Martin Wirth (DLR) showed this at some point -> recheck
ast.literal_eval seems like a safer and more straightforward strategy to get functions to take the strings from the config file and then use them as dicts, bools, lists, etc.
Hooks like pydocstyle and interrogate are available to do this.
Include a CONTRIBUTING.md which explains the following:
There are many good sources of reference, where this is done quite well, e.g. xarray or pandas
Should a default config file be part of the package? What are the advantages & pitfalls?
filter_flags
should have a default value where it filters against all QC attributes. This default value should be of a type so that a user can provide either:
near_surface_coverage_tdry
,`profile_fullness_tdry`` ORprofile_fullness_
ORprofile_fullness_
Would it be helpful to separate the source code (classes, functions, etc.) and the scripts, so that the developers and users can have their own domain?
Standard workflows can be stored as scripts, which will be for users and they wouldn't need to bother about the source code too much.
Most functions / scripts currently do not log info/warning/errors. Set this up please :)
trajectory_id
variableTo keep all qc-related attributes together, we could think of creating a class attribute called qc
, which would essentially be an empty object instance like type('', (), {})()
. Next, any qc attributes can be assigned as object.__setattr__(self.qc,"whatever_name",qc_flag_value)
. Then, the new function (tentatively named qc_filter
) can check only these self.qc attributes against the filter_flags
list.
The current setup excludes the possibility for the user to flexibly select qc flags for different variables and then use these to create L2. To make it more customizable by the user, all qc-related attributes should be kept available for the user to choose from. A new function (maybe called qc_filter or something) can let the user to only select the sonde based on a list of filter_flags. This list will include elements that are qc-related attribute names, and if any of those are False, they will be filtered out from creating L2. The rest of the attributes will be ignored and will be propagated into L2 with their flag values. So, the most conservative QC filter would include all qc-related attributes in the filter_flag and not let any False value pass into L2. The most lenient QC filter would have an empty filter_flag list and allow all sondes pass into L2 (except no launch detected sonde, see commit 4c4085d) with their qc flag values propagated as variable values.
qc_filter
functionapply_qc_checks
and functions with the prefix qc_check_
To convert all variables into SI units, the corresponding functions will be added to the helper
module.
The qc_filter
function will not allow the user to select a subset of the available qc checks. They can only select which attributes will be used for filtering or for propagating. There is no option to skip a QC check/attribute. To make this possible, every function that assigns an attribute to self.qc
will come with an optional argument skip
that has the default value False
. If the value of skip
is set to True
, the function will simply pass
and not do anything. Therefore, the default option is that all QC checks run, but if need be, the user can set some of these QC checks to skip=True
and allow only for limited QC checks, which are neither accounted for in filtering nor in propagating further to L2.
Currently, there are only packages listed in the environment.yml, which is useful for developers. For users, it is so much easier to simply not care about the dependencies themselves.
A "pipeline" simply means a sequence of data-processing steps executed in series, where the output of one element is the input of the next one.
Each step in the pipeline corresponds to a level of the data-product (L1, L2, L3, L4) and is associated with a set of substeps, each of which involve a set of functions that are executed to process the data and move it to the next step.
The default pipeline is defined in a separate Python module (pipeline.py
), which maps each level to a list of functions (or substeps) that should be executed to reach that level from the previous one.
This default pipeline can be modified by the user through the configuration file. The configuration file allows the user to specify functions for each substep in the pipeline. If a substep is not included in the configuration file, the default functions for that substep will be executed. So, the smallest unit that a user changes in the pipeline is a sub-step. If a substep is defined in the config file, it completely replaces the default substep. The argument values for the functions in the pipeline can also be configured by the user through a configuration file (this part is explained by commit 0612bc1).
The pipeline is executed by iterating over the levels in order, and for each level, executing the associated functions with the provided arguments. The arguments for each function are retrieved from the configuration file. This pipeline is thus the flow of processing data into levels for the final dataset.
The main()
function currently does the following:
✅ gets default values of all args in all functions in the package
✅ checks for any user-provided non-default values to the args in the functions from the config file
✅ gets mandatory arg values from the config file
What the main()
function should do next is:
For the above, we must:
Currently the two functions iterate_Sonde_method_over_dict_of_Sondes_objects
and iterate_method_over_dataset
(let's call them f1 and f2, respectively) are essentially doing the same thing, except:
Combining the two would be easier in terms of providing the "apply"
input to the pipeline dictionary and also reduce points of failure for applying methods in the pipeline.
Sondes that pass through QC algorithms
For the user to be able to provide a list of qc functions, this will have to be incorporated in a pipeline which calls the apply_qc_checks
function for each Sonde instance as part of the QC.
It would help to have a .CONFIG from which scripts can read out information like data directory, output directory, and other things that might of use to change/customize during processing or plotting.
Of course, this is really only helpful if the scripts folder is created separately (Issue #14) because I don't think reading from config into source code is a great idea.
Currently the QC functions profile_fullness
and near_surface_coverage
only set attribute value as a number because this was originally a pre-step before the test is carried out and a bool is assigned.
Modify the functions / create a new function to test for these numeric values and assign a boolean, based on threshold values or other args. This should then be used by the filter_qc_fail
method.
Currently, the apply_qc_function
applies another function to different qc flags and compiles it into a single flag. For example, if it runs the qc_check_profile_fullness
function, then all the attributes of profile_fullness_{variable}
are looked at together, and a composite value of True
(if all True) or False
(if any False) is provided to the attribute profile_fullness
.
The current setup excludes the possibility for the user to flexibly select qc flags for different variables and then use these to create L2. To make it more customizable by the user, all qc-related attributes should be kept available for the user to choose from. A new function (maybe called qc_filter or something) can let the user to only select the sonde based on a list of filter_flags. This list will include elements that are qc-related attribute names, and if any of those are False, they will be filtered out from creating L2. The rest of the attributes will be ignored and will be propagated into L2 with their flag values. So, the most conservative QC filter would include all qc-related attributes in the filter_flag and not let any False value pass into L2. The most lenient QC filter would have an empty filter_flag list and allow all sondes pass into L2 (except no launch detected sonde, see commit 4c4085d) with their qc flag values propagated as variable values.
To keep all qc-related attributes together, we could think of creating a class attribute called
qc
, which would essentially be an empty object instance liketype('', (), {})()
. Next, any qc attributes can be assigned asobject.__setattr__(self.qc,"whatever_name",qc_flag_value)
. Then, the new function (tentatively namedqc_filter
) can check only these self.qc attributes against thefilter_flags
list.
This will still not allow the user to select a subset of the available qc checks. They can only select which attributes will be used for filtering or for propagating. There is no option to skip a QC check/attribute. To make this possible, every function that assigns an attribute to
self.qc
will come with an optional argumentskip
that has the default valueFalse
. If the value ofskip
is set toTrue
, the function will simplypass
and not do anything. Therefore, the default option is that all QC checks run, but if need be, the user can set some of these QC checks toskip=True
and allow only for limited QC checks, which are neither accounted for in filtering nor in propagating further to L2.
filter_flags
should have a default value where it filters against all QC attributes. This default value should be of a type so that a user can provide either:
- individual flag values, e.g.
near_surface_coverage_tdry
,`profile_fullness_tdry`` OR- all flag values of the same type, e.g. all flags starting with
profile_fullness_
OR- all except values of one type, e.g. all flags except those starting with
profile_fullness_
Inspecting functions for their arguments. The outcoming information of arguments is then used to check if new entries for the function are specified in the config-file.
For example: fsspec, kerchunk, h5py, dask, s3fs, are dependencies of GoGoesGone but is not in the halodrops requirements.
If a launch is not detected, a sonde will not be processed by ASPEN, and therefore, will not have an NC file (Level-1).
Currently, if the method add_postaspenfile
is run for Sondes that do not have an NC file, it simply returns a print statement that the ASPEN file was not found and therefore, the attribute postaspenfile
is not set. This is silently letting a Sonde without launch detect pass, and will fail at a further point when add_aspen_ds
is called, which raises a ValueError when the postaspenfile
is not found.
Instead, when the add_postaspenfile
is called, it should either raise a ValueError and maybe, additionally check for launch detect status and notify the user about the cause of the failure.
The functions currently in the qc module will go to the Sonde class as qc methods.
With so many changes occuring over the last few weeks, the documentation is outdated. A page each can be made for the following How-Tos and Explanations. The "with demo" part indicates that examples for changing stuff in the config file should be shown to demonstrate different possibilities of customization.
Handing developers the responsibility of employing pre-commit
before pushing, without the repo checking it here could be the cause for many mismatches with pre-commit
checks later.
Therefore,
Apparently, it is possible to get information via MARS about which dropsondes made it to the ERA5 reanalysis products. Be in touch with Andreas Schäfler and Konstantin (?) from DLR, who have done this before for NAWDEX.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.