durbank / paipr Goto Github PK

Functions to generate probabilistic estimates of annual accumulation from ice-penetrating radar without the need for manual layer selection or correction

MATLAB 99.32% Shell 0.68%

glaciology radar computer-vision

paipr's People

Contributors

Stargazers

Watchers

paipr's Issues

Improvements to distance metric

Reviewer 3 states

Pg 6 Eq (2): I really don't understand why this distance metric makes sense. x and z are physical distances, so it might make sense to add squared differences of those together. But why does it make sense to add these physical distances to differences between peak magnitudes? The units are completely different, right? It's like comparing two cars by adding together the difference in their horsepowers plus the difference in their weights -- it just doesn't make sense. Is there a reason this happens to work? Wouldn't it be better to have a scaling constant on the m's, similar to the one on z?

The reviewer makes another good point worth investigating. It might make sense to potentially scale all three and end up with a unitless distance metric. Not sure what to scale the magnitude by. We could scale the horizontal distance by the local window length (although that is somewhat arbitrary to begin with...).

Perhaps it would be better to reintroduce the Malahanobis distance metric again in order to address this? This could help with the sensitivity-to-noise issue as well, perhaps? Either way, this is a good point that should be investigated. Important to note too that this will change the correct value to use as a threshold. This could be fixed though if we find a good way to address Issue #16

Spatially weight density parameters rather than values

I suggest when spatially weighting nearby core densities (in rho_spW.m) that we weight the fitted model parameters rather than raw densities. This would fix the current issue of truncated density profiles because of a single short core record.

It would require some additional thought about how to best estimate the variance in density with depth, and especially how best to spatially weight it.

Repo directory structure

The directory structure of the root repo needs to be improved. It should follow some standard conventions, for example as outlined here. The repo should also include the current version of the paper describing and validating the method. Finally, it should have an open source license included, and should be ready to present as a public-facing repository.

Decrease RT window size

A smaller RT window size may improve the layer tracing tracking ability. I will want to experiment and test with changing both the vertical and horizontal window dimensions.

Improved regression coefficients

We currently have some issues with regression coefficients not matching well between the 3 validation sites. This was highlighted while investigating divergences in estimated means and trends in the greater WAIS interior between v1.0 and v1.1 of PAIPR. It's possible better regression coefficients in likelihood estimation will fix this.

QC output

The creation of a QC image was introduced in #36 but I still need to determine how best to use this image.

MC simulations on layer group assignments

Reviewer 3 notes

Lns 31-40 (2nd column): This ``hard assignment'' of peaks to layers seems to be a key reason for failures of the overall algorithm, due to things like estimated layer boundaries jumping between actual layers. I wonder if you considered a probabilistic approach here as well? In other words, why not use Monte Carlo simulations to create the layer boundaries, before the round of MC simulations already proposed?

This is an interesting idea that I would like to explore. I worry though that such a method would not be computationally tractable in an appropriate timeframe. The layer group assignment is by far the most computationally intensive portion of the method, so repeating many times over may not be feasible. Still an interesting idea worth investigating.

Memory use could use optimization

Right now the processing of results (particularly in parallel) is pretty memory-intensive. It would be good to find ways to reduce the memory load where possible so that on the large-RAM CPUs (96+ GB) we can process more chunks at the same time.

MC on logistic coefficients

I should incorporate uncertainty into the logistic coefficients using Monte Carlo distributions. I can do this using the combined distributions (mean and std) of the optimized parameters using the 4 validation sites.

Output distribution parameters instead of MC values

Modify the final saved output of PAIPR so that we have fitted gamma distribution parameters rather than the entire raw Monte Carlo values. This change will significantly decrease storage size (particularly when we increase the number of simulations, possibly to n=10,000). Gamma parameters should be fitted to each year of data for every trace position. This change should also eliminate the need for cell arrays or lists in the data output, leading to more simple transfers/exports to other programs.

Improved stacking of data

It may be advantageous to improve the stacking routine in PAIPR by taking account the local layer slope and stacking along this trendline rather than directly horizontally. In this case, the processing order would likely be

Conversion to depth
Stationarize the data (remove trend and set constant variance)
Set depth cutoff, interpolate to core resolution, and clip to given cutoff
Radon transform the data (possibly with a larger window?) and estimate layer grad field
Data stacking, using the RT data to determine stacking (increase the window to 50 m?)
Vertical data smoothing
Peak finding
Layer picking (and etc...)

Include radar collection time in PAIPR output

Include in the PAIPR results the radar collection time (gps_time) for each stacked trace. This will facilitate unique trace identification later in R tidyverse processing. This will require modifications (and propagation of those changes) starting in radar_stack.m, and will necessitate the reprocessing of PAIPR results.

Some aspects of this issue are related to Issue #2.

High frequency signal filtering

Reviewer 1 noted

A moving average filter does a better job of reducing high-frequency noise in a given signal.
State why you chose a 3rd order Savitzky-Golay filter, and state the length of the filter (two parameters unambiguously specify a Savitzky-Golay filter). Note that when applying a filter to your data, you degrade the range resolution of the radar. Acknowledge this, and state the approximate new range resolution (i.e. the distance between depth layers that can now be distinguished in further processing).

I should therefore expermiment with a couple of different filters (SG filter, moving average, etc) to compare performance, and to calculate the change in range resolution.

Increase final layer smoothing

Tests of precision, recall, and F-measure are currently artificially deflated due to high variance in the PAIPR picked layers. I suggest increasing the smoothing performed on the final picked layers to better represent these assessment statistics.

Function to export processed data to other formats

A function is needed to format the data in such a way that it can be easily imported into other programs (principally R, but perhaps have switch cases for both Python and NetCDF as well?).

Clean branches and rebase/merge to master

There are significant amounts of dead branches and failed experiments littered about in the repo (still streamlining the workflow standards and figuring out an optimal flow). As such, the code requires some merging and rebasing to get back to a clean network of master and active feature branches.

Increase MC simulations

We should increase the number of MC computations (perhaps to 1000) to obtain more complete final estimate distributions. Some of this could also be solved (at least in later portions of analysis) by aggregating to a coarser resolution (say from 25 m to 1 km).

Modular functions

Some general cleaning is needed (particularly of the radar_RT.m function) so that the submodules that are called are more modular (e.g. break out the RT section, the peak finding, etc. into separate functions).

Determine error in layer tracing

Per comments from reviewers, we must provide quantitative error estimates for the layer picking. I propose taking each manually picked layer, calculate RMSE (or something similar) to each PAIPR layer, take the nearest one (lowest RMSE) and use that value as the error on the nearest (PAIPR) layer. Continue for each manually picked layer.

I'll need to think about how best to then estimate the error on the remaining additional (mostly inter-annual) picked layers

Programatically select densification transition depth

The radar_depth.m function should automatically find the depth of the transition between Stage 1 and Stage2 densification. Currently this is manually assigned.

Automated data QC

Ideally we want a mechanism that can automatically screen echograms for bad data sections, skip the processing on those sections, and flag them for later review. This would likely be most effectively accomplished during the Radon transforms. In this step, we could determine how non parallel the various portions of the echogram are, and if a sufficient number of image sections sufficiently violate the assumption of parallel layers, reject the image and flag for later review.

Ensure generated data are spatially adjacent

There appears to be an issue where radar data are either imported or generated out of order (more likely the latter?). This may have to do with the parallel processing of echograms, where one farther along the flightline might complete processing before an earlier one. The best place to address this is likely in the radar_wrapper.m file.

I think the most straightforward way to address this would be to order the final results based on time of data collection. This would involve saving another variable (and propagating it through various helper functions like radar_stack.m).

Improved selection of starting layer trace

In v1.1 and prior, the layer tracing picks where to start the first layer (and then subsequent layers) based on the brightest individual peak without a layer group. It would be better to base this selection off of integrated streamline brightnesses using the layer gradient field. We could convolve a Gaussian filter with the streamlines at regular depth intervals (every 4 cm? every 10 cm?) to create a smoothed potential layer. We then sum up the brightness values for each streamline and pick the brightest remaining pixel in that streamline (or perhaps simply the midpoint of the streamline?) as the starting point of the next layer grouping.

We could write a subroutine that would calculate integrated brightnesses for each streamline based on the remaining ungrouped peaks. The subroutine would then run within the loop of radar_trace prior to the start of each successive layer.

Robustness testing of results

I want to more fully test the robustness of results and sensitivity to input data (both different echograms and radar frequencies). I already have code for radar frequency sensitivity. I would also add sensitivity tests for adjacent echograms. I already generate overlapping data between echograms, so I can use these overlapping sets to directly assess algorithmic sensitivity to different echograms.

The final output would be a table or plot showing sensitivity tests between adjacent echograms for both OIB and SEAT radar and sensitivity tests between OIB and SEAT radar traces.

I would assume 1:1 plots of estimated age would be best for this, as it will likely show increased error/variance for earlier times

Improved QC flags

I should improve the QC flag generation to detail more information about what caused the flag. I'm thinking of having 4 separate flag designations:

0: Everything checks out; data should be useable
1: Data is consistently good near the surface, but beyond a certain depth there may be issues
2: The initial image passed to PAIPR has no obvious issues, but the results are questionable throughout
3: Initial raw echogram deemed unusable

Improved optimization script

The current ad hoc script for optimizing logistic regression is not very reproducible. Rework this script so that it is dynamic and more generally applicable.

Reorganize excessive scripts

There are a number of excess scripts littering the master branch. For now, I suggest moving them into an archive directory for a cleaner look.

durbank / paipr Goto Github PK

paipr's People

Contributors

Stargazers

Watchers

paipr's Issues

Recommend Projects

Recommend Topics

Recommend Org