jeschkelab / deerlab-matlab Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 32.67 MB

Data analysis and method development toolbox for dipolar EPR spectroscopy

License: MIT License

MATLAB 73.70% Makefile 0.06% Batchfile 0.08% CSS 23.91% Python 1.85% HTML 0.39% TeX 0.01%

deer epr esr peldor scripting spectroscopy

deerlab-matlab's People

Contributors

Stargazers

Watchers

Forkers

spribitzer nwili

deerlab-matlab's Issues

Memory and performance issues of sensitivan

Running a sensitivity analysis with a large multi-level full factorial setup, generates so many evaluation vectors that memory becomes an issue.

For a run with 6 factors (each with 5 levels), at the end of the analysis 65536 evaluations are generated and memory consumption reaches 5-6 GB.

This can seriously slow-down calculation or even freeze some machines.

An improved memory allocation system should be implemented.

Distribution models and r<0

All distance distibution models (rd_*) currently tolerate negative distances in their r input, and also return density for negative distances. It is probably better to issue an error if any(r<0).

refinement of correctscale

The correctscale function fits/estimates the verticale scale of the signal, to normalize it correctly even in the presence of noise. This is done by fitting a single Gaussian with an exponential background model. While rather simplistic, the fitting of the scale if very precise as the rest of the parameters (despite being wrong) induce less error.

However, there are many cases which can be constructed to show the limitation of the function. For example a very broad distribution:

t = linspace(-0.3,3,200);
r = linspace(1,6,100);
P = rd_randcoil(r,[50 0.602]);

V = dipolarsignal(t,r,P,'scale',1e6,'noiselevel',0.01);
[V,scale] = correctscale(V,t);

does not fit the scale correctly. Improved behaviour can be achieved if only e.g. the first quarter of the signal is passed:

t = linspace(-0.3,3,200);
r = linspace(1,6,100);
P = rd_randcoil(r,[50 0.602]);

V = dipolarsignal(t,r,P,'scale',1e6,'noiselevel',0.01);
[~,scale] = correctscale(V(1:50),t(1:50));
V = V./scale;

While the correct results can be found by trial and error, the function should be refined to automatically perform this and spare the effort to the users.

A first step would be to limit the fit of this simplistic model to a narrow range around the zero-time of the signal.

making the default branch 'develop' in the repository settings

As the branch for new features and that is being worked on is development, maybe it would make sense set development as the default branch in the repository settings?

I can see three benefits:

It would help outside contributors with nagivation (PRs would by default point to development)
forks would automatically take development as starting branch (contributors would not accidentally start branching from a not up-to-date branch)
and maybe most importantly, the branch presented on the front page would be development instead of master, making it a lot easier to grasp what has been going on in the repository and seeing the newest commits

Misleading error message when Optimization Toolbox is not installed

I tried to follow the "Analyzing Datasets" example in the tutorial section, but whenever I ran

[B,lam] = fitbackground(V,t,@td_exp);

I received the following error message:

Invalid solver specified. Provide a solver name or handle (such as
'fmincon' or @fminunc).

After some digging around it turned out, that I did not have the Optimization Toolbox installed. However, the error message insinuated that I had not set the paths correctly.

Suggestion: Test for availability of Optimization Toolbox with

license('test','optimization_toolbox')

which returns 1 if the toolbox is installed.

Possible bug in sensitivian

sensitivian crashes (in my case) when I analyze for a single factor (only noise vector):

Index exceeds array bounds.

Error in sensitivan (line 235)
main(jj) = abs(evalmean(1) - evalmean(2));

It works, after commenting out Factors interaction analysis

Geometric basis of Rice distribution model is unclear

Take two spin labels A and B with 3D Gaussian spatial distributions. The distribution of the distance between the two labels is not quite equal to rd_onerice, as shown in the example below.

Is rd_onerice representing a different geometric situation, or does rd_onerice need to be adjusted?

clear, clc, clf

% Parameters for two 3D Gaussian distributions
rA0 = [0;0;0];
sigA = 1;
rB0 = [0;0;1];
sigB = 1;

% Monte Carlo simulation of norm of distance between A and B
N = 10e6;
rA = rA0 + randn(3,N)*sigA;
rB = rB0 + randn(3,N)*sigB;
rAB = vecnorm(rA-rB);

% Calculate PDF
r = linspace(0,8,201).';
dr = r(2)-r(1);
P0 = histcounts(rAB,[r; r(end)+dr]).';
P0 = P0/sum(P0)/dr;

% Fit rd_onerice to the obtained PDF
fun = @(p,r)rd_onerice(r,p);
mu = 1.92;
sig = 1.15;
p0 = [1.92 1.15];
pfit = lsqcurvefit(fun,p0,r,P0);
P1 = rd_onerice(r,pfit);
hold on

% Plotting
plot(r,P0,r,P1);
legend('3D Monte Carlo','rd_onerice','Interpreter','none');
box on
grid on
xlabel('r (nm)');
ylabel('P (nm^{-1})');

correctscale crashes if time is in nanoseconds

When calling correctscale in the pre-processing without previously converting the time axis to nanosecond leads to a crash with a cryptic error cascade.

The error arises in the call to fitparamodel inside of correctscale.

built-in boundaries of Gaussian models

As discussed in #29 the upper bounds of the built-in Gaussian parametric models is set to 20nm. This appears to be too large, as in some cases, the fitparamodel solvers will try to remove a Gaussian from the model by setting it to a large distance (>10nm), instead of by reducing its amplitude.

By comparison with the boundaries used for other software (notation [lower upper]):

Parameter	DeerLab	DeerAnalysis2018	DeerAnalysis2019	DD
`rmean`	`[1.5 20]`	`[1.5 20]`	`[1.5 10]`	`[1.5 5]`
`fwhm`	`[0.2 5]`	`[0.12 11.8]`	`[0.12 11.8]`	`[0.12 5.89]`
`amp`	`[0 1]`	`[0 1]`	`[0 1]`	`[0 1]`

One can see that the DeerLab Gaussian function were built based on the DeerAnalysis2018 bounds. It seems these were fixed in the newer DeerAnalysis2019 release. DD uses a maybe too small lower boundary.

DeerLab should adopt the same upper boundary as DeerAnalysis2019.

For the width of the Gaussian, DeerLab has more constrained boundaries than DD & DeerAnalysis.

These boundaries could maybe be relaxed in DeerLab.

GND & SND parametric models

The following parametric models are employed in DD and should be added to DeerLab for reproducibility and coverage sakes. Here is a list of the missing models (all equations taken from the DD manual):

Generalized Normal Distribution (GND)

Skew Normal Distribution (SND)

Generalized Skew Normal Distribution (GSND)

unstable fitting of Gaussian models

Running the following script:

t = linspace(0,5,300);
r1 = linspace(2,6,200);
InputParam = [3 0.3 4 0.3 5 0.3 0.3 0.3];
P = rd_threegaussian(r1,InputParam);
K = dipolarkernel(t,r1);
rng(1)
S = K*P + whitegaussnoise(t,0.01);
[~,Pfit1] = fitparamodel(S,@rd_threegaussian,r1,K,'tolfun',1e-5);
r2 = linspace(2,6,210);
K = dipolarkernel(t,r2);
[~,Pfit2] = fitparamodel(S,@rd_threegaussian,r2,K,'tolfun',1e-5);

shows that the fit of the same signal, with the same three-Gaussian model, with a difference of 5 points in distance-domain leads to completely different solutions.

I checked whether the fit was not finding the solution by manipulating the TolFun option.

This needs to be investigated and fixed.

Remove remaining dependencies on MATLAB toolboxes

In order for the DOE to be usable with the core MATLAB product without any additional (purchaseable) toolbox the following dependencies must be eliminated::

Optimization toolbox - need to find good open-source alternatives to the fmincon, lsqnonlin, ...
Statistics and Machine Learning toolbox - the sensitivity analysis function now requires the pdist map, this must be changed by an open-source alternative

The functionality does not need to be removed, and toolbox-dependent functions can be used optionally.

Random-coil model: segment length

Currently, the random-coil distribution model rd_randcoil (here) has a hard-wired segment length R0 = 0.198 that cannot be changed.

The segment length should be a model parameter: rd_randcoil(r,[N R0 nu] or rd_randcoil(r,[N nu R0].

Regularization parameter selection degraded since 0.8.beta

In two (of two) cases tested, the current commit returns significantly larger regularization parameters in 'lr' mode of regularization parameter selection than version 0.8.beta did. The old parameters were reasonable, the new ones correspond to strongly visible oversmoothing.

Bug in global fit system

The changes made recently lead to a dimensionality bug in the global fit system.

fitparamodel(...,'GlobalWeights',rand(5,1))

and

fitparamodel(...,'GlobalWeights',rand(1,5))

do not work equally, one of them leading to a crash.

Tutorial Sensitivity Analysis

Not sure if you want issues regarding the website/tutorials here as well.

Here: https://luisfabib.github.io/deerlab/0.8.beta/examples/example_sensitivan_tikhonov_simple.html#example_sensitivan_tikhonov_simple

It says [Pmed,Pup,Plo] = sensitivan(fcn,factors);

which is not the output of the sensitivian function anymore, right?

[stats,factors,evals] = sensitivan(...)

sensitivan's memory checks are WinOS specific

MATLAB's memory function used in sensitivan to check whether the analysis will consume too much memory, is specific to Windows systems and does not work on Linux or MacOS.

Needs to be fixed for full compatibility.

Normalization of model distributions

Currently, distributions models are manually re-normalized over the distance range provided in the input vector r.

This is identical to normalization over r=(0,inf) as long as the entire mass of the distribution is contained within r. If not, then the normalization will differ. Here is an example:

param = [3 2];
r1 = linspace(2,6);
r2 = linspace(0,6);
P1 = rd_onegaussian(r1,param);
P2 = rd_onegaussian(r2,param);
plot(r1,P1,r2,P2);
grid on

P1 and P2 have different normalization, even though they are the same model.

This means that a P evaluated over a r range is implicitly treated as a truncated P with P=0 outside the provided r range, instead of the P defined according to the analytical expression of the model, just evaluated over a limited r range.

There are two situations where the model functions are used:

The user wants to a look at a distribution.
The distribution model function is called by a fit function. In this case, the distance range should always be chosen such that truncation doesn't happen - otherwise the model is truncated.
In either case, re-normalization is not needed.

I think it is therefore better to remove re-normalization from all rd_* models.

Are there any other consequences where removing re-normalization could lead to issues?

user-defined global weights affected by round-off errors

When passing the weights for global fitting manually, the program checks whether the weights sum up to unity. However, the comparison is made via a simple sum(weights)~=1 check. However if the sum is not exactly 1, due to round-off errors, then the program prompts an error.

Observed when passing the option:

fitparamodel(...,'GlobalWeights',ones(20,1)/20);

which of course should sum up to one.

Two possible remedies:

Introducing round functions in all DeerLab functions which accept global weights.
Removing the unity sum check and allow any values for the weights. The function would then normalize internally.

Inconsistent parameter ordering in multi-component models

The parameter ordering in multi-component models are inconsistent between the built-in models dd_* and the mixture models generated by mixmodels.

Example of a two-Gauss model:
dd_gauss2: [r01 w1 r02 w2 a1]
mixmodels(@dd_gauss,@dd_gauss): [a1 r01 w1 r02 w2]

This should be made consistent.

Two main possibilities:

Put the amplitudes at the end, as in the built-in models: [r01 w1 r02 w2 a1]
Put each amplitude with the parameters of the corresponding model: [r01 w1 a1 r02 w2]

The second has the aesthetic advantage that all parameters for each component are grouped together, but the first would only require changes in mixmodels.

validation of output variable length fails

When running validation of a variable x via

       x = validate(@myprocess,varpar)

if one of the varpar is the length of x, the function returns an error due to the size of the different x-outputs being inconsistent.

longpass fails when including negative times

When checking the longpass_shortdistmass unity test, I realized that the longpass function works allright as long as only positive times are passed to the function.

For example, the display of the longpass_shortdistmass test is

which is the expected behaviour of the function. Now, if I add 0.1us of negative time the following output is returned

meaning that negative timings completely break the functionality of longpass.

Needs diagnostics and a fix.

fitbackground (free-version) not compatible with MacOS and Linux

The fitbackground function uses the nlsqbnd as default for a toolbox-free use. However, the solver requires a mex file only available for WinOS.

Adapt this in a similar way as in fitparamodel.

too complex MExceptions

Due to DeerLab being highly function-handle based, when an error occurs in the definition of the models or functionals, the large error message cascade makes it hard for users to understand the error, even if it acutally is simple in nature. A simple example where the kernel in the model is multiplied along the wrong dimension:

r = linspace(1,6,200);
t = linspace(0,3,100);
Pmodel = rd_onegaussian(r,[3 0.3]);
K = dipolarkernel(t,r);
V = K*Pmodel;
model = @(t,p)K.'*rd_onegaussian(r,p);
fitparamodel(V,model,t,[3 0.3]);

leads to the following MException message:

Error using  * 
Incorrect dimensions for matrix multiplication. Check that the number of
columns in the first matrix matches the number of rows in the second
matrix. To perform elementwise multiplication, use '.*'.

Error in Untitled4>@(t,p)K.'*rd_onegaussian(r,p)

Error in paramodel/myparametricmodel (line 75)
                Output = handle(ax,param);

Error in
fitparamodel>@(Parameters)(sqrt(0.5)*(K{1}*model(ax{1},Parameters,1)-V{1}))
(line 373)
        ModelCost = @(Parameters)
        (sqrt(0.5)*(K{1}*model(ax{1},Parameters,1) - V{1}));

Error in lsqnonlin (line 205)
            initVals.F = feval(funfcn{3},xCurrent,varargin{:});

Error in fitparamodel (line 374)
        [FitParameters,~,~,exitflag]  =
        lsqnonlin(ModelCost,StartParameters,LowerBounds,UpperBounds,solverOpts);

Error in Untitled4 (line 13)
fitparamodel(V,model,t,[3 0.3]);

Caused by:
    Failure in initial objective function evaluation. LSQNONLIN cannot
    continue.

If the structure of the program is clear to the user, then it is of course simple to find the source of the error. But for users which never have worked with the source code, this can be overhelming.

conversion mlx -> m of tutorials

The tutorials are written in mlx-files. These are non-text files and are impossible to maintain (no diff views, not full-text searches, each commit generates a new copy of the entire file, etc).

Use the publish function for m-files to generates HTML code and adapt the functions and code required to incorporate it into the documentation.

Return of regularization parameter requested

Sensitivity analysis may benefit from keeping the regularization parameter constant, but fitregmodel does not return it. One may also want to publish this value in some cases.

Can we have: [P,alpha] = fitregmodel(S,K,r,RegType,alpha,varargin) ?

multispin effect's future

In the supressghost function there is no re-scaling as described in the original van Hagens paper

Remove docs function

I don't think this is needed. A few points:

The function name is so generic that it might name clash with other toolboxes.
It will be confusing for users to have to use doc for MATLAB functions (incl. toolboxes) and docs for DeerLab

I suggest removing docs. Users can bookmark the DeerLab doc folder in their web browser.

Also, access to the DeerLab documentation can be incorporated using MATLAB's capabilities for custom documentation? It doesn't look like any of this (info.xml, helptoc.xml, etc.) is currently implemented for DeerLab.

fitregmodel and alpha-search method

In fitregmodel the search method of the regularization parameter defaults to golden but cannot be changed to grid without calling selregparam separately. Maybe add this as an option to fitremodel.

deerload needs to handle multi-signal data files

Same number of Gauss and Rice models

Currently, there are Gauss models up to 5 components, but 3D-Rice models only up to 3 components. The 3D-Rice models should be extended up to 5 components.

The main reason is that the Rice models are actually more rigorous: (a) They are limited to r>=0. (b) They are based on a physical model (norm of distance between Gaussian 3D spatial distributions of two spin labels).

fitmultirice etc

fitmultigauss is a convenient interface for using Gauss functions as basis functions. It would be good to have a similar function fitmultirice for 3D-Rice basis functions, and maybe others. The challenge is not to repeat code.

WLC model can return negative values

When calling the WLC with
```
P = rd_wormchain(r,[3.7 100]);
```

returns a distribnution with negative values. Must be fixed.

The model returns a row vector instead of column. Must be fixed too.

normalization in multi-pathway kernels

@stestoll I see two issues with the current multi-pathway system in dipolarkernel:

Normalization of the pathway amplitudes

At the moment there is a check during input parsing which enforces the sum over all dipolar pathway amplitudes to be smaller/equal one.

% Assert that amplitude of non-modulated pathways is not larger than 1
% (this must hold even if the unmodulated amplitude is negative)
if round(sum(lambda)+Lambda0,5)>1
    error('In pathinfo, the sum of all lambdas cannot be larger than 1.');
end

While this is accurate and follows the theory, in practice leads to two problems:

While it is easy to pass inputs respecting this property, during a fit it represents a non-linear inequality, which is incompatible with most specific fast/accurate solvers in DeerLab (except fmincon, which requires a toolbox).
It makes sense to enforce this probability since these values are probabilities (or cosine-modulated probabilities) limited thus to a 0 to 1 range. However, if we think about them as amplitudes (and we even call them that way), it would make sense for them to be able to adapt any values, since the relative values are what make the difference. The values could be normalized into a relative scale internally.

Differences in absolute scales of the pathway amplitudes only lead to differences in the absolute scaling of the whole dipolar signal. Which leads me to the next issue...

Normalization of the multi-pathway kernels

According to the multi-pathway theory, the dipolar signals do not necessarily follow V(t=0) = 1 and can adapt values smaller than 1. Again, DeerLab's dipolarkernel simulates this accurately, but in practice introduces some problems as well:

Experimental signals are usually re-scaled during pre-processing such that V(t=0) = 1 holds. If a multipulse DEER experiment is pre-processed this way, it would be impossible to fit with the current dipolarkernel since V(t=0) = 1 is not possible.
The current system requires a scale parameter to be introduced (instead of re-scaling in pre-processing), which introduces another fit parameter making processing (especially in regularization) more complex.
If the kernel was normalized such that K(t=0,r) = 1 then all these issues would be avoided, and would also allow unbounded pathway amplitudes as described above.

I would for a normalization of the kernel such as we had in the earlier iterations of dipolarkernel.

Inconsistent naming of options throughout

The following options are inconsistent with the rest of the notation in the software

'NoiseDeviation' in regparamrange

fitmultigauss not working correctly after new changes

After the changes implemented to close #53 recently, when the Upper and Lower bounds are manually specified and trying to fit the background along, the results seem to be completely broken.

too complex dipolar pathway input scheme

In dipolarkernel the dipolar pathway information (lambdas, timings, and harmonics) are passed as a single matrix with the following structure:

pathwayinfo = [Lambda0 lambda1 lambda2 
                NaN    T1      T2
                NaN    n1      n2];
K = dipolarkernel(t,r,pathwayinfo)

There are some issues with this current system:

Having to pass NaN values as an input is counter-intuitive and a has a huge potential for errors from the user-perspective
Due to the row/column differentiation in MATLAB, we are requiring the Lambdas, Ts and ns to be passed as columns, concatenated as rows. Most common way for users to define vectors is as rows (space or colon instead of semicolon as separator).

While writing a simple script I found myself stuck at the second issue every time I defined a new kernel.

My proposal for an alternative input scheme would use cell arrays instead of numerical ones:

No need for NaN values
No issues with dimensionality of the input, everything can be handled internally
Extremely straightforward concatenation of the three variables.

pathwayinfo = {Lambdas Ts ns};
K = dipolarkernel(t,r,pathwayinfo)

@stestoll what do you think? I could implement the changes myself if we agree on the design.

dipolarkernel fails for long evolution times

The following script produces a plot that is correct for the shorter trace, but wrong for the (minimally) longer one.

clear, clc, clf

r = 3; % nm
tmax = 30; % us
dt = 0.05; % us

t1 = 0:dt:tmax; % us
t2 = 0:dt:tmax+dt; % us

S1 = dipolarkernel(t1,r);
S2 = dipolarkernel(t2,r);

plot(t1,S1,t2,S2);

Golden-search for selregparam

Using a simple golden-search algorithm for the selection of the regularization parameter instead of an exhaustive search.

Preliminary tests using different noise levels and array sizes show a potential 2x-5x speedup, while finding (approximately) the same answers.

pitfall of correctzerotime

The correctzerotime aim to find the "true" zero-time of a dipolar signal. It uses an improved version of the moment anaysis algorithm used in the old DeerAnalysis software.

While it works really well for cases where the zero-time has been recorded, it fails to identify cases where the first measured point is past the zero-time.

A possible solution would be to proceed via a similar approach to correctscale and optimize the zero-time by fitting a simple Gaussian model. This would however lead to functionality overlap with coorrectscale.

There is a possible design issue in here.

selregparam does not always return residual

When calling selregparam with a request for all output parameters listed in documentation, I received an error

`Output argument "Residuals" (and maybe others) not assigned
during call to "selregparam".

Error in process_by_DeerLab_legacy>mymodel (line 268)
[alphaopt1,~,alphas1,res1,pen1] =
selregparam(V,KB,r,'tikh','lr','Search','grid');

Physical units heuristics system

Physical units are automatically detected based on numerical and physical heuristics. While they cover all experimental and common day situations, for method developers or for trial cases, the units system may not be optimal.

add global-fit compatibility to fitmultigauss & selectmodel

Currently fitmultigauss can only fit individual dipolar signals. Analogously to fitparamodel it should be able to handle multiple dipolar signals to do a global fit.

backgroundstart: simplify interface

Currently, backgroundstart takes input for start time search range as a fraction and for end time as a number of points. That's not intuitive. It would be simpler to just use times. The user can choose these directly from looking at a plot of the data.

Uniform dimensionality for arrays

DeerLab is behaving unintuitive for processing arrays. Prefered orientation for arrays (horizontal or vertical) should be uniform over all functions. Especially output should be uniform.

Functions in DeerLab seem to prefer 1 x m (horizontal) arrays. There are some exceptions (just the ones I found till now):

correctzerotime is sensitive to input array dimensions and will yield identical dimension.
td_strexp (and maybe also other td_models) yields m x 1-array and is insensitve to input. The dipolarkernel function is insensitive to input dimension of background.
fitregmodel yields m x 1 array (vertical). This is reasonable, as K * Pfit should yield Vfit. Matrix multiplication is not possible with 1 x m array.

Possible solution:

I would prefer uniform dimension of output arrays.
It is useful, if functions accept both types of arrays as input (especially if it's an 'input' function).

If 1 x m dimensionality is chosen (prefered by me), a utility function to convert distance probabilities Pfit and kernel K to a dipolar signal Vfit could be handy. This function handles the dimensionality of Pfit. (mockup code)

Vfit = utility_function(K,Pfit);

function Vh = utility_function(K,P)
   Pv = reshape(P,[length(P),1]); % make array vertical for matrix multiplication
   Vv = K*Pv; % matrix multiplication.
   Vh = reshape(Vv,[1,length(Vv)]); % make output array horizontal
end

Maybe, current design was intentional and I didn't get the reason. In this case, please close this issue.

Rename model functions rd->dd, td->bg

All current time-domain model functions (td_*) are background model functions. Also, there are no distance-domain background models. Therefore, it would be more appropriate to rename td_* to bg_*. This would be more specific, and it would allow us to introduce other types of time-domain models.

Similar reasoning applies to rd_*. These are all models for distance distributions. Renaming to dd_* would therefore make sense.

sensitivan memory check fails for some MacOS

The memory checks used by sensitivian appear to crash with MacOS High Sierra (10.13.6). Due to the lack of a MacOS development environment for me this issue cannot be explored in depth.

A solution for now will be a try/catch statement for all MacOS systems where the current code fails. This will result in a limited functionality for some MacOS systems but at least will not crash.

Link in documentation points to Bitbucket

Working with release 0.1-beta, I noticed that the link at the top of the documentation points to the bitbucket repository and not the github one.

Inconsistent parametric model names

Gauss and Rice functions are named inconsistenty, e.g. td_onegaussian and td_onerice.

To be consistent, it should be

either td_onegauss and td_onerice
or td_onegaussian and td_onerician

The first is preferable, since it's shorter.

multipathway backgrounds are inaccessible

Currently the background function for the multipathway models is computed internally indipolarkernel given a basis background function bg_model. There is however, no way to access the multipathway background afterwards.

@stestoll The question is whether dipolarkernel should return this as a second output or to have another function/model which will generate this given a background model and pathway info.