itsrainingdata / sparsebn Goto Github PK

Software for learning sparse Bayesian networks

R 100.00%

bayesian-networks covariance-matrices experimental-data graphical-models regularization machine-learning statistics r

sparsebn's Introduction

sparsebn

Introducing sparsebn: A new R package for learning sparse Bayesian networks and other graphical models from high-dimensional data via sparse regularization. Designed from the ground up to handle:

Experimental data with interventions
Mixed observational / experimental data
High-dimensional data with p >> n
Datasets with thousands of variables (tested up to p=8000)
Continuous and discrete data

The emphasis of this package is scalability and statistical consistency on high-dimensional datasets. Compared to existing algorithms, sparsebn scales much better and is under active development. For more details on this package, including worked examples and the methodological background, please see our new preprint [1].

Overview

The main methods for learning graphical models are:

estimate.dag for directed acyclic graphs (Bayesian networks).
estimate.precision for undirected graphs (Markov random fields).
estimate.covariance for covariance matrices.

Currently, estimation of precision and covariances matrices is limited to Gaussian data.

The workhorse behind sparsebn is the sparsebnUtils package, which provides various S3 classes and methods for representing and manipulating graphs. The basic algorithms are implemented in ccdrAlgorithm and discretecdAlgorithm.

Installation

You can install:

the latest CRAN version with
```
install.packages("sparsebn")
```

the latest development version from GitHub with

devtools::install_github(c("itsrainingdata/sparsebn/", "itsrainingdata/sparsebnUtils/dev", "itsrainingdata/ccdrAlgorithm/dev", "gujyjean/discretecdAlgorithm"))

References

[1] Aragam, B., Gu, J., and Zhou, Q. (2017). Learning large-scale Bayesian networks with the sparsebn package. arXiv: 1703.04025.

[2] Aragam, B. and Zhou, Q. (2015). Concave penalized estimation of sparse Gaussian Bayesian networks. The Journal of Machine Learning Research. 16(Nov):2273−2328.

[3] Fu, F., Gu, J., and Zhou, Q. (2014). Adaptive penalized estimation of directed acyclic graphs from categorical data. arXiv: 1403.2310.

[4] Aragam, B., Amini, A. A., and Zhou, Q. (2015). Learning directed acyclic graphs with penalized neighbourhood regression. arXiv: 1511.08963.

[5] Fu, F. and Zhou, Q. (2013). Learning sparse causal Gaussian networks with experimental intervention: Regularization and coordinate descent. Journal of the American Statistical Association, 108: 288-300.

sparsebn's People

Contributors

Stargazers

Watchers

Forkers

gujyjean ml-lab benjamesbabala grseb9s shubhampachori12110095 averissimo guhjy ycffei

sparsebn's Issues

Add call argument to output of estimate.dag

Similar to default R modeling functions such as lm, glm. See ?match.call. Will need to modify sparsebnFit in sparsebnUtils to accommodate this (see related issue).

Add whitelists and blacklists

Correct error message for mixed data

current <- cbind(c(0,1,1,0),
c(2,1,0,1),
c(0,0,3,0), c(0.35, 5, 10,7), c(4,1,0,7))
current.data <- sparsebnData(current, type ="mixed")

#Error in sparsebnData.data.frame(as.data.frame(x), type, levels, ivn) :
#Invalid 'type' entered: Must match one of 'continuous', 'discrete', 'mixed’.

In Rstudio, this command terminates the R session

Incorrect argument in documentation for estimate.dag

Link to offending code

#' dat <- sparsebnData(cytometryContinuous$data, type = "d", ivn = cytometryContinuous$ivn)

should be

#' dat <- sparsebnData(cytometryContinuous$data, type = "c", ivn = cytometryContinuous$ivn)

(Note the value of type.)

least squares loss

Request for the least squares loss to handle non-Gaussian continuous data with linear relations as described in Equation 1.3 of this preprint: https://arxiv.org/abs/1511.08963

#Fucntion to_igraph not converting network to graph?


> library(sparsebn)
> library(igraph)
> cyto.data <- sparsebnData(cytometryContinuous[["data"]],
+                           type = "continuous",
+                           ivn = cytometryContinuous[["ivn"]])
> cyto.learn <- estimate.dag(data = cyto.data)
> cyto.param <- estimate.parameters(cyto.learn, data = cyto.data)
> param=select.parameter(cyto.learn, cyto.data)
> 
> 
> cyto.learn.igraph=to_igraph(cyto.learn[[param ]])
> 
> get.edges(cyto.learn.igraph)
Error in ends(graph, es, names = FALSE) : Not a graph object

sparsebnData does not work well if data is integer.

sparsebnData(data, type="continuous")
does not work well if data is integer.

Get error with select.paramenter with discrete data

Error:

> param=select.parameter(current.learn, current.data)
 Error in eval(expr, envir, enclos) : object 'Anti' not found

This error only appears when the input dataset of sparsebnData() is a matrix. I observed it with discrete data. The error does not occur when the input dataset of sparsebnData() is a data.frame.

Maybe send warning while not using data.frame with sparsebnData()?

Add link to cytoscape

Allow for users to visualize / explore output of sparsebn methods in Cytoscape, e.g. using RCytoscape.

Get error with estimate.dag with continuose data

Error:

cyto.learn <- estimate.dag(cyto.data, whitelist = whitelist,lambdas.length = 10)
Error in weights < -1 || weights > 1 :
'length = 6922161' in coercion to 'logical(1)'

I am getting this error only in my workstation but when i running this code in another workstation, its work fine.
Also, i am working on 9897 nodes and 6 observational data. It is running from last 6 days on single cpu core. How much time it take to estimate dags ? and can we run sparsebn r package on multicore-processor or run on GPU ??

Pathfinder data is missing correct node names

Should be "Fault", "F1", "F2", etc. See BN repository.

Improve methods for exploring graphs

Extend and improve the existing show.parents method to filter by richer criteria such as minimum number of parents, number of children, v-structures, etc.

Write a method for filtering: filter.nodes that returns a list of filtered nodes
Write a wrapper that calls show.parents on the output of filter.nodes

Improve output when estimate.parameters is singular

Instead of returning an error when estimate.parameters encounters a singular Gram matrix, return valid estimates for those nodes that are nonsingular (< n parents). Output a warning, and return NA for nodes with > n parents.

Also, improve the error message, which is a bit cryptic:

  Error in fit_glm_dag(edges, data$data, call = "lm.fit", ...) :
        Node 465 has too many parents! <27 > 22>

Need to add an argument in estimate.dag function

In cd.run() function, I added an argument: adaptive
It used to be:
cd.run <- function(indata, weights=NULL, lambdas=NULL, lambdas.length=30, error.tol=0.0001, convLb=0.01, weight.scale=1.0, upperbound = 100.0) {...}

And now I updated it as:
cd.run <- function(indata, weights=NULL, lambdas=NULL, lambdas.length=30, error.tol=0.0001, convLb=0.01, weight.scale=1.0, upperbound = 100.0, adaptive = FALSE) {...}

Jiaying