opensenseaction / pypwsqc Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 3.0 246 KB

Python package for quality control (QC) of data from personal weather stations (PWS)

Home Page: https://pypwsqc.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

pypwsqc's People

Contributors

Watchers

Forkers

cchwala lepetersson jochenseidel

pypwsqc's Issues

Add test and linting for notebooks

Similar to OpenSenseAction/poligrain#11 we have to add linting and tests for notebooks.

Add ruff linting for docstring

Currently docstring are not linted. I did some test and the default behavior for ruff with including all "D" rules is a bit too strict IMO. Also tests should be excluded from docstring checks since they typically do not have a docstring and also do not need one.

Setting a line length of 88, like black does, also makes sense IMO.

I found a config here that seems to be a good compromise:

[tool.ruff]
line-length = 88
target-version = "py38"
# https://beta.ruff.rs/docs/rules/
extend-select = [
    "E",    # style errors
    "W",    # style warnings
    "F",    # flakes
    "D",    # pydocstyle
    "I",    # isort
    "U",    # pyupgrade
    # "S",    # bandit
    "C",    # flake8-comprehensions
    "B",    # flake8-bugbear
    "A001", # flake8-builtins
    "RUF",  # ruff-specific rules
]
# I do this to get numpy-style docstrings AND retain
# D417 (Missing argument descriptions in the docstring)
# otherwise, see:
# https://beta.ruff.rs/docs/faq/#does-ruff-support-numpy-or-google-style-docstrings
# https://github.com/charliermarsh/ruff/issues/2606
extend-ignore = [
    "D100", # Missing docstring in public module
    "D107", # Missing docstring in __init__
    "D203", # 1 blank line required before class docstring
    "D212", # Multi-line docstring summary should start at the first line
    "D213", # Multi-line docstring summary should start at the second line
    "D401", # First line should be in imperative mood
    "D413", # Missing blank line after last section
    "D416", # Section name should end with a colon
]

[tool.ruff.per-file-ignores]
"tests/*.py" = ["D", "S"]
"setup.py" = ["D"]

Implementation of "station outlier filter"

Draft code in Python

We have a first very preliminary draft of the code for the "station outlier filter" here at the bottom of the notebook.

Some notes from discussion with Lottte:

Basic reasoning of the filter: The filter looks at min. two weeks of data (for 5-minute data) at evaluates the correlation with neighboring stations
Two weeks are chosen because short enough to also apply on a "event basis", because first two weeks cannot be evaluated
R tricks to speed up things:
- original R code is here
- use apply in combination with which in R, e.g. like in line 83 in the linked code
```
SOflag[which(apply(cortable, 1, function(x) median(x, na.rm=T)) < corthres)] <- 1	
```
doing with a fixed window length is also reasonable, but maybe limits applicability when studying short events

Improve FZ filter code

Some ideas for how to improve the FZ filter code:

remove the for-loop

discuss/understand this code

pypwsqc/src/pypwsqc/flagging.py

Lines 57 to 60 in 12bdd89

    
           # TODO: check why `< nint + 1` is used here. should `nint`` be scaled with a selectable factor? 
        
           elif (np.sum(sensor_array[i - nint : i + 1]) > 0) or ( 
        
               np.sum(ref_array[i - nint : i + 1]) < nint + 1 
        
           ):

add some more test cases or make sure the code is understood well enough so that we know that there are no edge cases that are not yet covered

FZ filter incorrectly assigns -1

The current implementation of the FZ filter initializes an array with -1 and overwrites the entries with 0 and 1 where applicable, without considering number of stations reporting rainfall (should be above specified threshold n_stat as is done here).

Implementation details of indicator correlation functions from pws-pyqc

For the FZ filter and the HI filter the functions are pretty simple, just taking two time series as input and providing one time series as output. For the indicator correlation the current functions (see here) have more complex (but not complicated) input. Since we are not yet sure if and how we fully want to rely on xarray.Dataset as data model, i.e. to always supply the Dataset as function input so that metadata is automatically available based in naming conventions of WG1, is is not yet clear how we want to or have to reshape these functions.

My current preference would be to have functions that take two main inputs, one time series of the PWS data to flag, and one or multiple timeseries from the neighboring sensors or from the reference. This is easier to experiment with and, if implemented correctly, can easily scale for a large number of PWSs. However, if multiple time series (from the reference or neighbors) are to be supplied as input, we should require the usage of a xarray.Dataarray or a pandas.Dataframe since there is the danger of having the wrong order of dimension when using a 2D numpy.array. IMO we should wait till we have an example dataset and example workflow implemented. Then we can try different approaches in a hands-in manner.

Add small example PWS dataset

We need a small example dataset.

There is a NetCDF of the Amsterdam PWS dataset here. Maybe we add a shortend verison of it directly to this repo.

Implement FZ (faulty zero) filter from PWSQC

A Python implementation of the FZ (faulty zero) filter was provided for the OPENSENSE training school here.

As first step this function should be added here, but with a reasonable docstring. A test should be added. If this works, we can later, in a separate issue and separate PR, potentially work on speeding up the implementation by avoiding the for-loop.

Decisions on filter output conventions

Some decision regarding convection of the output of the filter functions, based on discussion with @lepetersson and @JochenSeidel.

Meaning of flag values:

filter output is 1 for detecting an anomaly
filter output for functions stemming from the PWSQC R code is -1 in case the filter cannot be applied because not enough (non-NaN) data from neighbors
filter output is 0 otherwise

Data handling:

we write the flag time series to the xarray.Dataset as new variable
raw data is not changed
this way the user can decided, after generating/computing the flags, if all flags should be applied and how

Pandoc package requires explicit installation

I assume we have the problem documented in OpenSenseAction/poligrain#30 also here.

If that is true, we should apply the changes from OpenSenseAction/poligrain#31 also here.

Discussion of additional funktionalties and QC methods

Here is a summary of functionality from intense-qc and its extension for sub-hourly data as a basis for discussion of what could be added to pypwsqc:

Edit:

strikethrough bullet points are rejected by Louise and Max after discussion in Prague
--> refers to simplifications

1. intense-qc
check individual gauges

check if high percentiles (.99 and .95) of the rainfall time series are zero code --> simple test if single stations do not record rainfall could be implemented
~~return the largest n rainfall values code~~
check how rainfall is distributed evenly over each day of the week code --> could be implemented for history analysis
check how rainfall is didistributed over hours of day code --> could be implemented for history analysis
~~searching for data with more than 5 data gaps longer than 2 consecutive steps are missing (on an hourly basis) code~~ --> to strict, not really a qc filter, it always depend n the purpose. Could be used for online/offline analysis
~~Pettitt check (change point test) code~~ --> too sophisticated for a too unsophisticated method and we assume it requires long time series
~~check if world records (from an internally used dataset) are broken code}~~
flag too long dry or wet spells (compared to the same internal dataset) code --> with a good reference this could be usefull
check if the amount of one time step is accumulated from more time steps before because of its intensity code --> something interesting, @JochenSeidel is already working on this?!
check if high values are shown repeatedly which is unlikely code --> could make sense
check if the minimal rainfall resolution changes over time code --> could be interesting to see when PWS are calibrated

check neighboring gauges

get neighbors code
compare hourly, daily and monthly neighors

2. SubHourlyQC to use on top of intense-qc

checks if data is sub-hourly
uses thresholds to reject unplausible rainfall
various thresholds are suggested for different accumulation times (1min, 15min, 60min) and each month of the year

Discuss structure of modules, functions and data model

This issue should discuss the general structure of our Python modules, the functions and our internal data model. This will be entangled with the discussion and decision of the implementation details of the indicator correlation functions (see #6), but the general layout of the modules can already be discussed now.

It is not yet clear to me how we logically split up the code from PWSQC and pws-pyqc. Since both have a part which does flagging of faulty periods and a part which does bias correction, it seems logical to also have separate modules for flagging and bias correction. But it is also logical to keep code from PWSQC and pws-pyqc closely together.

First draft of module structure

flagging.py <-- not sure if this is the correct English term for adding the flags to the PWS timeseries
|── fz_filter()
|── hi_filter() 
|── station_outlier_filter() <-- maybe needs a more descriptive name that indicates what method is used
|── indicator_correlation_filter() <-- not sure about this and the next one
|── calc_indicator_correlation()
|── ...

bias_correct.py
|── quantile_mapping() <-- use what is in pws-pyqc, but not yet sure about details
|── ... <-- something from PWSQC bias correction that is done in station-outlier-filter?
|── ...

Note that we can reference the papers and original codebase for each function in their docstring. Hence, we do not have to hint the origin of the methods in their function name.

Also not that finding neighboring sensors shall be done with the implementation from poligrain, see OpenSenseAction/poligrain#15

Data model

Since we fully embrace xarray and xarray.Dataset in poligrain, it seems logical to also rely on it here. I would, however, first do some experiments when example data and an example workflow is ready. If we can write simple functions that work with 1D time series, we could just pass np.arrays and would have much more generic code. We can still use xarray.Datasets for loading and handling the data, but when passing to function we do not have to rely on it and just use the underlying numpy.arrays. But, let's do some experiments first.

Add HI (high influx) filter from PWSQC

In the training school notebook for PWSQC we do not have the implementation of the HI (high influx) filter. But Lotte provided a snippet of the R code:

HIflag <- rep(0, times=length(Nint))
   HIflag[which(((Nint > HIthresB) & (Med < HIthresA)) | ((Med >= HIthresA) & (Nint > (HIthresB*Med/HIthresA))))] <- 1  # if thresholds are exceeded, the HI flag becomes 1
   HIflag[which(Number_of_measurements < nstat)] <- -1	# if less than 'nstat' neighbours supply observations, the HI flag becomes equal to -1

   if(exists("HI_flags")==F){ HI_flags <- HIflag
	}else{ HI_flags <- cbind(HI_flags, HIflag) }

Note that this is copy-paste from the Zoom chat, hence, I am not sure about indentation and completeness of this code snippet.

	# TODO: check why `< nint + 1` is used here. should `nint`` be scaled with a selectable factor?
	elif (np.sum(sensor_array[i - nint : i + 1]) > 0) or (
	np.sum(ref_array[i - nint : i + 1]) < nint + 1
	):