haziqj / iprior Goto Github PK

An R package for I-prior regression

License: GNU General Public License v3.0

R 96.41% C++ 3.59%

regression gaussian-processes priors gpr fisher-information reproducing kernel hilbert krein space

iprior's Introduction

R/iprior: An R package for I-prior regression

Based on the manuscript entitled "Regression and Classification with I-priors" by Wicher Bergsma (2018, arXiv:1707.00274). In a general regression setting, priors can be assigned to the regression function in a vector space framework, and the posterior estimate of the regression function obtained. An I-prior is defined as Gaussian with some prior mean (usually zero) and covariance kernel proportional to the Fisher information for the regression function.

This package performs regression modelling using I-priors in R. It is intuitively designed to be similar to lm, making use of familiar syntactical conventions and S3 methods, with both formula and non-formula based input. The package estimates these parameters using direct log-likelihood maximisation, the expectation-maximisation (EM) algorithm, or a combination of both. While the main interest of I-prior modelling is prediction, inference is also possible, e.g. via log-likelihood ratio tests.

For installation instructions and some examples of I-prior modelling, continue reading below. The package is documented with help files, and the vignette provides an introduction to the concept of I-priors and also to using the package.

Installation

Install R/iprior either by downloading the latest CRAN release

install.packages("iprior")
library(iprior)

or the developmental version from this GitHub repository. R/iprior makes use of several C++ code, so as a prerequisite, you must have a working C++ compiler. To get it:

On Windows, install Rtools.
On Mac, install Xcode from the app store.
On Linux, sudo apt-get install r-base-dev or similar.

The easiest way to then install from this repo is by using the devtools package. Install this first.

install.packages("devtools")

Then, run the following code to install and attach the iprior package.

devtools::install_github("haziqj/iprior", build_vignettes = TRUE)
library(iprior)

Syntax

To fit an I-prior model to mod regressing y against x, where these are contained in the data frame dat, the following syntax are equivalent.

mod <- iprior(y ~ x, data = dat)     # formula based input
mod <- iprior(y = dat$y, x = dat$x)  # non-formula based input

The call to iprior() can be accompanied by several other model specification arguments, including choosing the RKHS, hyperparameters to estimate, estimation method, and others. Control options for the estimation method of choice is done through the option control = list(). Find the full list of options by typing ?iprior in R.

Resources

View the package vignette by typing browseVignettes("iprior") in R or visiting this link. This package is part of the PhD project entitled "Regression Modelling using priors depending on Fisher information covariance kernels (I-priors)" by Haziq Jamil [link].

iprior's People

Contributors

Stargazers

Watchers

Forkers

ercrema

iprior's Issues

Combine `one.lam` and `parsm`

These seem to be doing the same thing. See this comment relating to issue #7.

Add function to update ipriorKernel

After fitting with iprobit(), it is necessary to fix psi = 1. Write a function kernL.update() to update the ipriorKernel object on which theta to estimate.

Fitting all two-way interactions when some scale parameters are shared

It is impossible to fit the following model

y ~ (x1 + x2) + x3 + x1:x2 + x1:x3 + x2:x3

We can only fit y ~ (x1 + x2) + x3 via non-formula, or assign separate scale parameters for each of the interactions. The reason is that (x1 + x2) is treated as H = H1 + H2 and the individual kernel matrices H1 and H2 are not saved to process the Hadamard interactions.

It is possible to code this to make it work, but need to think of how the user inputs the iprior call. Perhaps need to work on the formula parsing before this can be implemented.

On the bright side, it is not probable that a model would have shared scale parameters and interactions at the same time... Not that I've encountered yet anyway.

R documentation

Need to be completed.

The EM algorithm feedback interface

At times, when the dataset is large enough, the EM algorithm seems to be frozen, when in fact it is taking it's time to compute. Can we implement a feedback so that the user can be assured it is not frozen. Something like a progressbar?

On another note, maybe look at ways to improve the EM algorithm itself.

Print of `ipriorKernel` object

Sometimes the output of str(H_mat) is too long and this is a bit ugly when compiling with knitr. Can add options for print.ipriorKernel() such as digits or deparse etc. and pass to str().

Large .so file being generated in Linux systems

On install, this note appears:

NOTE 
     installed size is 7.3Mb
     sub-directories of 1Mb or more:
     libs 6.7Mb

This is the .so file from Rcpp/RcppEigen.

Installation error on `r-oldrel-windows-ix86+x86_64`

On CRAN submission,

using R version 3.2.5 (2016-04-14)
using platform: x86_64-w64-mingw32 (64-bit)

* installing *source* package 'iprior' ...
** package 'iprior' successfully unpacked and MD5 sums checked
** libs

*** arch - i386
make[1]: Entering directory `/cygdrive/d/temp/RtmpSQBrwH/R.INSTALLb8fc55c44dd9/iprior/src-i386'
g++  -I"D:/RCompile/recent/R-3.2.5/include" -DNDEBUG    -I"d:/RCompile/CRANpkg/lib/3.2/Rcpp/include" -I"d:/RCompile/CRANpkg/lib/3.2/RcppEigen/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c RcppExports.cpp -o RcppExports.o
g++  -I"D:/RCompile/recent/R-3.2.5/include" -DNDEBUG    -I"d:/RCompile/CRANpkg/lib/3.2/Rcpp/include" -I"d:/RCompile/CRANpkg/lib/3.2/RcppEigen/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c eigenCpp.cpp -o eigenCpp.o
g++  -I"D:/RCompile/recent/R-3.2.5/include" -DNDEBUG    -I"d:/RCompile/CRANpkg/lib/3.2/Rcpp/include" -I"d:/RCompile/CRANpkg/lib/3.2/RcppEigen/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c fastSquare.cpp -o fastSquare.o
g++  -I"D:/RCompile/recent/R-3.2.5/include" -DNDEBUG    -I"d:/RCompile/CRANpkg/lib/3.2/Rcpp/include" -I"d:/RCompile/CRANpkg/lib/3.2/RcppEigen/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c fastVDiag.cpp -o fastVDiag.o
g++ -shared -s -static-libgcc -o iprior.dll tmp.def RcppExports.o eigenCpp.o fastSquare.o fastVDiag.o -Ld:/RCompile/r-compiling/local/local323/lib/i386 -Ld:/RCompile/r-compiling/local/local323/lib -LD:/RCompile/recent/R-3.2.5/bin/i386 -lR
make[1]: Leaving directory `/cygdrive/d/temp/RtmpSQBrwH/R.INSTALLb8fc55c44dd9/iprior/src-i386'
make[1]: Entering directory `/cygdrive/d/temp/RtmpSQBrwH/R.INSTALLb8fc55c44dd9/iprior/src-i386'
make[1]: Leaving directory `/cygdrive/d/temp/RtmpSQBrwH/R.INSTALLb8fc55c44dd9/iprior/src-i386'
installing to d:/Rcompile/CRANpkg/lib/3.2/iprior/libs/i386

*** arch - x64
make[1]: Entering directory `/cygdrive/d/temp/RtmpSQBrwH/R.INSTALLb8fc55c44dd9/iprior/src-x64'
g++ -m64 -I"D:/RCompile/recent/R-3.2.5/include" -DNDEBUG    -I"d:/RCompile/CRANpkg/lib/3.2/Rcpp/include" -I"d:/RCompile/CRANpkg/lib/3.2/RcppEigen/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c RcppExports.cpp -o RcppExports.o
g++ -m64 -I"D:/RCompile/recent/R-3.2.5/include" -DNDEBUG    -I"d:/RCompile/CRANpkg/lib/3.2/Rcpp/include" -I"d:/RCompile/CRANpkg/lib/3.2/RcppEigen/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c eigenCpp.cpp -o eigenCpp.o
g++ -m64 -I"D:/RCompile/recent/R-3.2.5/include" -DNDEBUG    -I"d:/RCompile/CRANpkg/lib/3.2/Rcpp/include" -I"d:/RCompile/CRANpkg/lib/3.2/RcppEigen/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c fastSquare.cpp -o fastSquare.o
g++ -m64 -I"D:/RCompile/recent/R-3.2.5/include" -DNDEBUG    -I"d:/RCompile/CRANpkg/lib/3.2/Rcpp/include" -I"d:/RCompile/CRANpkg/lib/3.2/RcppEigen/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c fastVDiag.cpp -o fastVDiag.o
g++ -m64 -shared -s -static-libgcc -o iprior.dll tmp.def RcppExports.o eigenCpp.o fastSquare.o fastVDiag.o -Ld:/RCompile/r-compiling/local/local323/lib/x64 -Ld:/RCompile/r-compiling/local/local323/lib -LD:/RCompile/recent/R-3.2.5/bin/x64 -lR
make[1]: Leaving directory `/cygdrive/d/temp/RtmpSQBrwH/R.INSTALLb8fc55c44dd9/iprior/src-x64'
make[1]: Entering directory `/cygdrive/d/temp/RtmpSQBrwH/R.INSTALLb8fc55c44dd9/iprior/src-x64'
make[1]: Leaving directory `/cygdrive/d/temp/RtmpSQBrwH/R.INSTALLb8fc55c44dd9/iprior/src-x64'
installing to d:/Rcompile/CRANpkg/lib/3.2/iprior/libs/x64
** R
** data
** inst
** preparing package for lazy loading
Error : object 'sigma' is not exported by 'namespace:stats'
ERROR: lazy loading failed for package 'iprior'
* removing 'd:/Rcompile/CRANpkg/lib/3.2/iprior'

Summary screen for Nystrom approximation

There should be information about the Nystrom approximation in the summary screens for ipriorMod and ipriorKernel objects.

Large installed package size

I still don't know how to solve this issue. This NOTE appears when running R CMD CHECK on Linux systems. I think it has something to do with the C++ code and a .so file resulting from the usage of RcppEigen.

Check: installed package size 
Result: NOTE 
     installed size is 7.4Mb
     sub-directories of 1Mb or more:
     libs 6.7Mb 
Flavors: r-devel-linux-x86_64-fedora-clang, r-devel-linux-x86_64-fedora-gcc

Add new kernels

In particular, the squared exponential kernel. But other kernels might be interesting to look at too.

http://www.cs.toronto.edu/~duvenaud/cookbook/index.html

Bug in `predict()` function

If fitting using formula method, and the new data frame does not contain the response variable then predict function fails. This stems from the fact that it was not able to generate a similar data frame based on the formula for new data.

The summary function is slow for large datasets

This is probably down to the calculation of the Hessian matrix (to calculate observed Fisher information) and also inverting it. Maybe look at other methods.

Develop the Fractional Brownian Motion (FBM) kernel for smoothing models

I will first develop the FBM kernel for a fixed Hurst coefficient gamma, which will be user-specifiable. Perhaps default to gamma=0.5.

Code cleanup in `indxFn()`

Not so much as an issue, but code that could be cleaned up. Replace the use of

zb <- which((ind1 %in% grid.PR[,1] & ind2 %in% grid.PR[,2]) | 
            (ind2 %in% grid.PR[,1] & ind1 %in% grid.PR[,2]))

with the already written helper function findH2()

zb <- apply(grid.PR, 1, findH2, ind1 = ind1, ind2 = ind2)

Check calculated value of log-likelihood

Add option for user-specified starting values

As per title. Also incorporate some validation check so that the correct number of parameters are used.

Fix rounding error in summary()

Value used for calculation of standard deviation of errors is rounded value of psi. This needs to be fixed - i.e. round the value after calculating sigma = 1/sqrt(psi).

Make control options easier to input

Users should be able to use the control list or directly into iprior.

Nystrom for the Pearson kernel

Should sampling be done within each group? Or just randomly across samples?

Add space into the summary view of iprior

> summary(mod.iprior)
[need space here]
Call:
iprior.formula(formula = Hwt ~ Bwt, data = cats)

        Estimate      S.E.       z P[|Z>z|]    
alpha  10.630556  0.120607 88.1419   <2e-16 ***
lambda  0.979524  0.672822  1.4558   0.1454    
psi     0.477409  0.056445  8.4580   <2e-16 ***

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[need space here]

Also, maybe move the psi to the bottom row, since a Wald's test it not quite appropriate for variances.

Check whether the standard errors reported give the correct value

Since this is calculated numerically. Check with the actual formula.

On including the intercept in the EM update

In previous versions, alpha was included as part of the EM update, with the generalised least squares formula

alpha = (1^T Var.Y.inv 1)^{-1} (1^T Var.Y.inv y)

where 1 is a vector of ones, and just to remind ourselves that Var.Y.inv depends on lambda and psi.

As it turns out, if a non-centred version of the Canonical or FBM kernel is used, then alpha would never reach the MLE of mean(y). The estimation of the lambda parameters are affected, and in turn this affects alpha too. In a sense, the non-centering of the kernel is accounted for in the alpha parameter.

Here is an illustration from the stackloss dataset. We ran three different models.

> summary(mod1) #non-centred, updating alpha

Call:
iprior(formula = stack.loss ~ Air.Flow, data = stackloss)

RKHS used:
Canonical (Air.Flow) 
with a single scale parameter.

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-12.20000  -1.11200  -0.06797   1.02100   8.88800 

            Estimate     S.E.      z P[|Z>z|]    
(Intercept) -43.5768   5.9296 -7.349   <2e-16 ***
lambda        0.0146   0.0106  1.380    0.168    

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

EM converged to within 1e-07 tolerance. No. of iterations: 19813
Standard deviation of errors: 3.997 with S.E.: 0.6321
T2 statistic: 0.9774 on ??? degrees of freedom.
Log-likelihood value: -63.14624

> summary(mod2) #non-centred, alpha=mean(y)

Call:
iprior(formula = stack.loss ~ Air.Flow, data = stackloss)

RKHS used:
Canonical (Air.Flow) 
with a single scale parameter.

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-10.520  -6.524  -2.524   1.476  24.480 

            Estimate    S.E.     z P[|Z>z|]    
(Intercept)  17.5238  2.1662 8.089   <2e-16 ***
lambda        0.0000 18.0729 0.000        1     #this looks wrong, lambda is 0    
---                                             #regression curve is a flat line
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

EM converged to within 1e-07 tolerance. No. of iterations: 17
Standard deviation of errors: 9.95 with S.E.: 1.5763
T2 statistic: 6.67e-09 on ??? degrees of freedom.
Log-likelihood value: -77.99705

> summary(mod3) #centred, alpha=mean(y)

Call:
iprior(formula = stack.loss ~ Air.Flow, data = stackloss)

RKHS used:
Canonical (Air.Flow) 
with a single scale parameter.

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-12.20000  -1.11300  -0.06851   1.02000   8.88700 

            Estimate    S.E.      z P[|Z>z|]    
(Intercept)  17.5238  0.8717 20.102   <2e-16 ***
lambda        0.0991  0.0725  1.368    0.171    

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

EM converged to within 1e-07 tolerance. No. of iterations: 239
Standard deviation of errors: 3.994 with S.E.: 0.6306
T2 statistic: 0.9876 on ??? degrees of freedom.
Log-likelihood value: -61.22966

In summary, it looks like it is 'wrong' to use a non-centred version of the kernel while using alpha=mean(y). However, it also looks like it is 'OK' to use non-centred version of the kernel while having alpha update in the EM iterations. From this simple illustration, using centred and alpha=mean(y) yields the highest MLE.

Multi-dimensional covariates

Think about how to make iprior function handle multi-dimensional covariates.

Predicted log-likelihood value

We can implement a feature which gives the predicted log-likelihood value based on the previous 3 iterations of the EM. This helps us check whether the EM is progressing well or not.

Find a way to display this information. Suggest

Incorporate into the EM reporting, and/or
Separate function which calls the stored mod.iprior which can fully display the table

The formula is

dloglik = loglik - loglikold;
a = If[0 < dloglik < dloglikold, dloglik/dloglikold, 0]; (*If[A,x,y] returns x if A is true and y otherwise*)
predloglik = loglik - dloglik + dloglik/(1 - a);

Change `X1` to `X` in `xnames`

...when there's only one covariate. Looks neater in summary(). The relevant code is in fix_call_default() and kernL().

EM decreasing in the first few iterations in some instances

For the stackloss dataset, noticed that the EM decreases for the first few iterations. Seems to be some bug in the EM code. Perhaps some parameter values are not updated in the right sequence? Check the lambda parameter. Maybe starting value for lambda can be zero.

Check standard errors for direct optimisation

In single lambda cases, it doesn't match with the EM standard error. Failure of delta method approximation? Better to use special case of normal standard errors instead using Fisher information matrix described here: https://en.wikipedia.org/wiki/Fisher_information#Multivariate_normal_distribution

Use familiar terms for the options

Change:

delt to stop.crit

Decided to keep report.int instead of show.prog. The reason is that report.int is an option to determine how often the EM steps are reported. show.prog would behave to the the already available option silent. However, I've found that some R functions use quietly or quiet instead of silent. I'll leave this for now...

But will look at other options if some of these can be changed to a more meaningful name.

Although I think once the I-prior package is properly documented then there won't be any confusion.

Bug with the `fastVDiag()` function

Somehow, re-running the function on the same arguments returns different answers each time. Suspect that it keeps some results in memory and re-uses it.

> H <- fnH3(rnorm(5))
> fastVDiag(H, 1:5)
            [,1]       [,2]        [,3]       [,4]       [,5]
[1,]  0.26970789 -0.1506874 -0.08902361 -0.3817465  0.3517496
[2,] -0.15068739  4.6687736 -2.03456593 -3.1172434  0.6337231
[3,] -0.08902361 -2.0345659  1.09680087  1.5941589 -0.5673703
[4,] -0.38174646 -3.1172434  1.59415895  3.0434338 -1.1386029
[5,]  0.35174956  0.6337231 -0.56737028 -1.1386029  0.7205005
> fastVDiag(H, 1:5)
          [,1]       [,2]       [,3]       [,4]       [,5]
[1,]  3.345030   3.128946  -3.800166  -9.258419   6.584607
[2,]  3.128946  36.077689 -16.682934 -32.035877   9.512176
[3,] -3.800166 -16.682934  11.084884  19.345309  -9.947093
[4,] -9.258419 -32.035877  19.345309  42.646019 -20.697033
[5,]  6.584607   9.512176  -9.947093 -20.697033  14.547343
> fastVDiag(H, 1:5)
           [,1]       [,2]      [,3]      [,4]      [,5]
[1,]   65.34335   67.02975  -84.8986 -189.4826  142.0081
[2,]   67.02975  404.40627 -184.4152 -424.4495  137.4287
[3,]  -84.89860 -184.41523  155.6321  307.8244 -194.1427
[4,] -189.48262 -424.44951  307.8244  711.1323 -405.0246
[5,]  142.00813  137.42871 -194.1427 -405.0246  319.7304

Specifying a different RKHS for each continuous variable

This needs a bit more thought on how the user inputs the different RKHSs into the iprior function. Things can get messy when there are a lot of variables involved.

Note: By default, the centred Canonical kernel is used for continuous variables, and Pearson for categorical. But sometimes, we would like to mix Canonical with FBM, say.

In-built R plot for iprior

Currently, plot(mod.iprior) is broken. Think of what we would like to display with this funciton.

The summary screen and the use of Wald's test for parameters

Perhaps a better test of hypothesis for the parameters is a log-likelihood ratio test. This is much simpler to implement. Also, what test do we want for psi?

On an aesthetic note, change the alpha parameter name to (Intercept), as some users might not be familiar with what alpha is.

Wrong scale for predictions from "predict(...,newdata=...)"

Great package!

However, I'm getting odd predictions that appear to be on a different scale than the training data when specifying an "se" kernel and predicting with "newdata". I've attached a small R script with an example. (I changed the file extension to .txt to stop Github from complaining when I drag & drop.)

I hope this has an easy work-around. If you have one, please let me know.
Thank you.
iprior_bug.txt

The T squared statistic

...which is yet to be named. Will add this in the summary screen. Useful for model comparison.

T2 = crossprod(w.hat, w.hat) / psi

Does the function call show the right information when using `iprior.ipriorMod()`?

Does the function call show the right information when using iprior.ipriorMod()?

No errors when wrong dimensions in non-formula syntax

It's possible to load the kernels when y and x are not of the same dimensions. A non-descript error appears when trying to fit the ipriorKernel object. Need more informative error message.

Speeding up the programme

After some timing tests, these are the bottlenecks in the programme:

The many matrix multiplications of the kernel matrix H.mat %*% H.mat. Need some clever way to reduce these occurences. This is doable I think. The plan is to do all possible matrix multiplications and store them, and use them as needed. This way they are not evaluated again and again.
Evaluating the inverse of Var.Y. The default R method is to use solve(Var.Y). A faster, albeit less stable, way is to use the Cholesky decomposition chol2inv(chol(Var.Y)). This is 5 times faster than the regular method.
Figure out an alternative method to estimate the intercept, because the current Generalised Least Squares method that I employ heavily relies on the inverse of Var.Y. If I can find another method which does not depend on this inverse, then I can speed up the programme. For example, sometimes it is as simple as intercept = mean(Y).
Calculation of the log-likelihood via the built in function dmvnorm() also takes a substantial amount of time, but not sure how much can be reduced by building own function. For reference, dmvnorm() takes 1/8th of the time to do a solve()Will look into this.

@Wicher2 can you comment on the above points, specifically 2-4? Thanks.

Prediction of new data points with the Pearson kernel.

Suppose we fit an I-prior model mod <- iprior(y~x, data), and we use the predict function to obtain the predicted value at x[1]. Why doesn't this value match with fitted(mod)[1]?

Suspect that there is a wrong implementation of the Pearson kernel code. The Pearson kernel is h(x_i, x_j) = I(x_i = x_j) / p_i - 1. When a new data point x_k is introduced which has the same value as x_i, does p_i change as well, or does it remain the same? p_i is the proportion of the data which has the same value as x_i.

Different Hurst coefficients for different variables?

Is this possible?

New feature: Function to re-run EM from previous stopped values.

The intended feature would behave as follows:

mod.iprior <- lm(y ~ x, data)
update(mod.iprior)

There already exists a generic update() function in R. In the lm setting, update() is used to update the model formula. Perhaps we could also exploit this to simulatenously solve issue #13.