nutterb / hydenet Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 3.0 7.7 MB

Hybrid Decision Networks in R

License: Other

R 76.06% HTML 23.64% TeX 0.30%

hydenet's People

Contributors

Stargazers

Watchers

Forkers

rpietro ryscott5 sethkurtmason

hydenet's Issues

Value of Perfect Information analyses

I made an issue for VPI analysis in the original pgm package.

I don't know if it's really necessary to make a function to do this. The reason is that VPI analysis involves constructing two (compiled) decision networks - For the first, the "information" edge is included (i.e., edge going from a variable to a decision node) while for the second it is excluded. Then, we use MEU to get a maximum expected utility for each of the two networks. The Value of Perfect information is then the difference in MEU between the two networks.

So I think it is no dedicated function but perhaps an example within a vignette called something like "Econometric analysis with Hyde"

Mapping needed between factor labels and levels.

When passing observed data, jags wants the level number, which makes it difficult to know when exactly needs to be passed. Some kind of mapping schemed needs to be included in compileJagsModel to navigate the difference. Another function will be needed for the output of coda.samples to return it to the labels.

Deterministic node type declaration requires running setNode()

Currently, if I'm not mistaken, you have to do the work of populating all the node distributions in the network before you can take advantage of all the fancy plot.HydeNetwork() defaults for visualization.

Sometime, we should think about giving users the capacity to declare node types for all nodes in the network up front, so that they can use the plot method without needing to use setNode() to say which nodes are deterministic in nature.

Error on default call of HydeNetwork()

I can't remember if this was a known issue or not, but I thought I'd post just in case it wasn't:

When HydeNetwork() is run with the PE training dataset, it produces an error:

data(PE, package='Hyde')
autoNet <- HydeNetwork(~ wells
+ pe | wells
+ d.dimer | pregnant_pe
+ angio | pe
+ treat | d.dimer_angio
+ death | pe_treat,
data = PE)
Error in [[<-(_tmp*, which(sapply(nodeFitter, function(x) x == "glm")), :
no such index at level 2

I think the error has to do with indexing within the nodeFitterArgs object at about line 103, where it tries to populate the family='binomial' argument for glm().

The weird thing is that when I step through HydeNetwork.formula(), the error doesn't show up.

Efficiency of compileDecisionModel

It just occurred to me that compileDecisionModel might be running writeNetworkModel on each iteration of the policy matrix. This isn't really necessary because the model definition isn't changing, just the observed nodes. And when writeNetworkModel runs, it is rebuilding models if they aren't already built. That's a lot of time we don't need to use.

setNodeModels() strict requirement of parent nodes in model

Is there a downside to loosening the requirement that a model object must contain all parent nodes present in the graph?

net <- HydeNetwork(~ card1.ace | card1
                   + card2.ace | card2
                   + initialPoints | card1*card2*card1.ace*card2.ace
                   + hit1 | initialPoints*dealerUpcard
                   + card3 | hit1
                   + card3.ace | card3
                   + pointsAfterCard3 | initialPoints*card3*card3.ace
                   + hit2 | pointsAfterCard3*dealerUpcard
                   + card4 | hit2
                   + card4.ace | card4
                   + pointsAfterCard4 | pointsAfterCard3*card4*card4.ace
                   + hit3 | pointsAfterCard4*dealerUpcard
                   + card5 | hit3
                   + card5.ace | card5
                   + pointsAfterCard5 | pointsAfterCard4*card5*card5.ace
                   + playerFinalPoints | initialPoints*hit1*pointsAfterCard3
                                         *hit2*pointsAfterCard4*hit3*pointsAfterCard5
                   + dealerFinalPoints | dealerUpcard
                   + payoff | playerFinalPoints*dealerFinalPoints)

data(BlackJackTrain)

# glm.hit1 <- glm(hit1 ~ initialPoints + I(dealerUpcard %in% c("9","10-K","A")),
#                 data = BlackJackTrain, family="binomial")

#this model is missing 'dealerUpcard' even though it's in the graph:
glm.hit1 <- glm(hit1 ~ initialPoints, data = BlackJackTrain, family="binomial")

net <- setNodeModels(net, glm.hit1)#, glm.hit2, glm.hit3)

Error in setNodeModels(net, glm.hit1) : 
1: The following model independent variables are not identical to the node parent list: hit1

htmlwidgets dependency

R is barking at me because it wants htmlwidgets.

It seems this package needs to be installed from GitHub, which is a barrier for the common user. Do we really need it?

> devtools::install_github("nutterb/HydeNet", ref="development")
Downloading github repo nutterb/HydeNet@development
Installing HydeNet
"C:/PROGRA~1/R/R-31~1.3/bin/x64/R" --vanilla CMD INSTALL  \
  "C:/Users/daltonj/AppData/Local/Temp/Rtmp0GtC7C/devtoolse5015101222/nutterb-HydeNet-9269616" --library="C:/Program  \
  Files/R/R-3.1.3/library" --install-tests 

* installing *source* package 'HydeNet' ...
** R
** data
*** moving datasets to lazyload DB
Warning: namespace 'HydeNet' is not available and has been replaced
by .GlobalEnv when processing object 'BlackJack'
Warning: namespace 'HydeNet' is not available and has been replaced
by .GlobalEnv when processing object 'BlackJack'
** preparing package for lazy loading
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  there is no package called 'htmlwidgets'
ERROR: lazy loading failed for package 'HydeNet'
* removing 'C:/Program Files/R/R-3.1.3/library/HydeNet'
* restoring previous 'C:/Program Files/R/R-3.1.3/library/HydeNet'
Error: Command failed (1)

summary() methods for `HydeNetwork` objects

For (uncompiled) HydeNetwork objects, some sort of printout listing nodes within each node type:

Decision Nodes:
  hit1 | initialPoints*dealerUpcard       some_useful_stuff
  hit2 | pointsAfterCard3*dealerUpcard    some_useful_stuff

Utility Nodes
  payoff | playerFinalPoints*dealerFinalPoints      some_more_useful_stuff

Random Variable Nodes:
  card1           dcat    some_other_useful_stuff
  card2           dcat    some_other_useful_stuff
  card3 | hit1    dcat    some_other_useful_stuff
  card4 | hit2    dcat    some_other_useful_stuff
  card5 | hit3    dcat    some_other_useful_stuff

Deterministic Nodes
  etc
  etc
  etc

For compiled.HydeNetwork objects, perhaps this could be annotated with some node type(/node istribution)-specific summary statistics. Or perhaps not...

blackjack network does not compile

Have a look at the present decision nodes vignette. I can't get the blackjack network to compile. The issue seems complicated enough that I can't figure out if it is user error (probably the case) or a bug.

The error message alludes to useful information on line 28 of the jags script:

Error in rjags::jags.model(textConnection(writeNetworkModel(network)),  : 
  RUNTIME ERROR:
Compilation error on line 28.
Unknown parameter Unspecified

But how is this script accessed?

setNodeModels() for multiple nodes using a single GLM

See the Decision Networks vignette, under section 'Decision Nodes'. What if we wished to use a single GLM to populate the node distributions for hit1, hit2 and hit3?

I don't think we can do that without renaming stuff and/or re-estimating the model?

computeUtility() and MEU()

computeUtility(decisionNet, policies)
MEU(decisionNet, policies)

decisionNet must be "compiled", with coda.samples matrix of posterior samples populated

parameter policies is a data structure containing user-specified combinations of values of the decision nodes (i.e., "policies") in the network. If someone wants to only look at 1 of 3 decision nodes, they just leave the column of the other two decision nodes blank.

Example:

policies = data.frame(
goToCollege = c("","","",""),
haveKids = c("No","No","Yes","Yes"),
moveToAbuDhabi=c("No","Yes","No","Yes") # that used to be a joke
)

The column names of the policies data frame must be equal to (or some subset of?) the names of the decision nodes in decisionNet. Maybe we can get away with not needing to specify blank columns though, with a warning message like: "NOTE: no values for the following decision nodes were set in the policies argument to computeUtility(): goToCollege. These decision nodes will be marginalized over in the computation of utilities."?

The computeUtility() function will then repeatedly run JAGS (have a parallel=TRUE parameter, as they did in random.forest?) with the decision nodes set to values of given rows in the policies data frame, outputting (only?) the coda.samples() matrix of nodes that are parents of utility nodes in the network. After JAGS returns, columns represented computed values of the utility nodes can be added to the coda.samples() matrix. Also, a node called totalUtility can then be added which is simply the sum of the utility node values.

MEU just does the same, but does the additional work of returning an appended policies data frame with a column representing expectedUtility (average of the totalUtility column). I guess returning the data frame sorted by expectedUtility would be useful.

Conditional Logic Nodes

(This issue originated in #24 )

Just a thought - and maybe you're thinking along these lines already. Would it simplify things to do the following? Remember this is merely idea generation.

Use rToJags() for the general case of specifying deterministic nodes with an equation
*define a series of "helper" functions that may be frequently used. Other packages have defined things like and(), or(), and xor(), etc., as an example. The way these might work could be like this: and() would verify logical nature of its arguments: f(parent1), f(parent2), f(parent3), ..., and then return 0 or 1 appropriately (or 1 or 2 if jags). So something like and(di_1 < 2, di_2 ==6, di_3 >= 4) for a dice network.
we'd need to be careful about these names and whether or not they overlap with base pkg function names or other popular pkgs.
*Not sure what other functions come to mind but these set-theoretic functions might be a useful base set.

Get Node Distributions from Model Object

Within setNode(), we could have a new parameter called modelObject.

Error Checking:

Logic/error checking to ensure class(modelObject) %in% c("lm","glm","multinom",<any others?>)

Logic/error checking to ensure all variables showing up in model equation actually are parents of the node in the graph. Something like:

length(setdiff(setdiff(names(modelObject$model), node), parents("d.dimer", net$dag))) == 0

But need to be careful that modelObject$model exists for all types of allowable model classes.

A good test would be to try to do a spline model (e.g., y ~ bs(x1, df=5))

Behavior (pseudocode):

modelType <- class(modelObject)

if(modelType)=="lm" {
nodeType <- "dnorm";
nodeFormula <- [ extract fitted regression equation from modelObject ]
} else if(modelType == "glm") {
nodeType <- [logic to see which family was used for the glm]
[behavior slightly varies depending on nodeType - may or may not need to specify error distribution]
} else {
nodeType <- "dcat";
[ set multinomial logistic regression equations using the multinom object ]
}

Question on Policy Matrices

Would it be fair to characterize policy matrices as a potential subset of all possible combinations of decisions from the set of decision networks?

I'm wondering if I can use the same infrastructure of the decision networks in the policy matrices.

Bug in vectorProbs()

When you use setNode() in conjunction with vectorProbs, you can inherit the node parameter:

net <- setNode(net, wells,
nodeType = "dcat",
pi = vectorProbs(p = c(.3, .6, .1)) )

But when you do this, it doesn't construct the nodeParams string correctly:

net$nodeParams$wells

$pi

[1] "pi.[1] <- 0.3; pi.[2] <- 0.6; pi.[3] <- 0.1"

It should be:

$pi

[1] "pi.wells[1] <- 0.3; pi.wells[2] <- 0.6; pi.wells[3] <- 0.1"

Possible directions to consider in the more distant future

Make a customized Metropolis-Hastings algorithm that accepts a particle matrix (e.g., 50000 sampled realizations from coda.samples()) for specification of the prior distributions
Make another customized M-H algorithm that explores the

Add nodes= argument to print.HydeNetwork

This is neither important nor urgent.

What about optionally specifying a vector of node names to print.HydeNetwork so as to restrict the output?

univariate distributions now getting warning

Not sure if this is related to argument processing in setNode after implementing setNodeModels, but I now get a warning message:

net <- HydeNetwork(~ wells

pe | wells
d.dimer | pregnant*pe
angio | pe
treat | d.dimer*angio
death | pe*treat)

net <- setNode(network = net, node = pregnant, nodeType = "dbern", p=.4)
Error in if (wrn.flag) warning(paste(warn.msg, collapse = "\n")) :
argument is not interpretable as logical

pretty=TRUE in writeNetworkModel()

The numbers aren't showing up that prettily. Just a thought: should we round coefficient estimates in the pretty output?

If this is a large amount of effort, then let's do this after we've done other stuff.

Add-on: Construct hydeNetwork objects by simply passing a list of model objects

It would read the regression equations from the model objects and make the network using the predictors in the model equations.

We would still need to input the prior distributions for the predictor variables.

JAGS parse error with compileJagsModel

g1 <- lm(wells ~ 1, data=PE)
g2 <- glm(pe ~ wells, data=PE, family="binomial")
g3 <- lm(d.dimer ~ pe + pregnant, data=PE)
g4 <- xtabs(~ pregnant, data=PE)
g5 <- glm(angio ~ pe, data=PE, family="binomial")
g6 <- glm(treat ~ d.dimer + angio, data=PE, family="binomial")
g7 <- glm(death ~ pe + treat, data=PE, family="binomial")

bagOfModels <- list(g1,g2,g3,g4,g5,g6,g7)

bagNet <- HydeNetwork(bagOfModels)

compiled.bagNet <- compileJagsModel(bagNet, n.chains=3)
Error in rjags::jags.model(textConnection(writeNetworkModel(network)), :

Error parsing model file:
syntax error on line 2 near "~"

writeJagsModel(); writeJagsFormula.XXXX()

These functions seem to be buggy.

g <- glm(treat ~ d.dimer*angio, data=PE, family="binomial")
writeJagsFormula(g)
Error: could not find function "writeJagsFormula"
Hyde::writeJagsFormula(g)
Error: 'writeJagsFormula' is not an exported object from 'namespace:Hyde'
Hyde::writeJagsFormula.glm(g)
[1] "treat ~ 1/(1 + exp(-6.28559524027548))"

I think just a little bit of work on the methods is what is needed? The function works when you explicitly tell it you have a glm object.

Making setNode() and modelToNode() play together

HydeNetwork.list() is awesome.

It would be equally beneficial to be able to apply modelToNode() for individual nodes. I think this is as simple as making a network object argument to the modelToNode methods (set to NULL). By default the function will work as-is, i.e., as called from HydeNetwork.list(). But if the user feeds a network object to the method, it will:

check that indeed the argument is of class 'HydeNetwork'
check that the name of the response variable in the 'model' argument is actually the name of a node in the network
check all.equal(sort(unique(names_of_variables_in_regression_equation))), sort(unique(names_of_parents_in_network_object)))
set all the list elements for that node within the HydeNetwork object and return an updated object.

Thoughts on its usefulness?

plot.HydeNetwork() showing up in browser

Is there an option in DiagrammeR or something that tells R to open a browser window for plot output?

Default behavior of HydeNetwork() w/ Training Dataset

See the dataset PE I inserted in the package.

I'm getting a rather cryptic error message:

data(PE, package='Hyde')
net <- HydeNetwork(~ wells
               + pe | wells
               + d.dimer | pregnant*pe
               + angio | pe
               + treat | d.dimer*angio
               + death | pe*treat,
               data = PE)
Error in [[<-(*tmp*, which(sapply(nodeFitter, function(x) x == "glm")), :
no such index at level 2
head(PE)
wells pregnant pe angio d.dimer treat death
1 3 No No Negative 206 No No
2 2 No No Positive 180 Yes No
3 3 No No Negative 197 No No
4 3 No No Negative 192 Yes No
5 2 No No Negative 281 Yes No
6 6 No Yes Positive 318 Yes No

Complex effects (interactions, polynomials, etc.) in writeJagsModel.xxx()

This might be a toughie.

The interactions don't seem to be getting converted to JAGS code from inputted R model objects. See code below for illustration. I think the problem is within writeJagsModel.lm(), writeJagsModel.glm(), and writeJagsModel.multinom(), specifically in the way the model objects are used to parse out terms --line 10 in writeJagsModel.lm() seems to be agnostic to interactions.

A natural extension of this issue is with polynomial terms, interactions between continuous/polynomial terms and factors, etc., etc., etc. However, why don't we constrain the issue for the beta release to merely making sure that a) interactions and b) polynomial terms are supported, unless a general solution is relatively easily identified.

I will add a little section warning the user on the limited scope of writeJagsModel.xxx, and that setNode can be used in place when complex models need to be specified.

bagOfModels <- list(g1,g2,g3,g4,g5,g6,g7)

bagNet <- HydeNetwork(bagOfModels)
writeNetworkModel(bagNet, pretty=TRUE)

model{
wells ~ dnorm(wells ~ 3.7941999999999, 0.630504251834359)
pe ~ dbern(pe ~ ilogit(-3.90355 + 0.5757_wells))
d.dimer ~ dnorm(d.dimer ~ 210.24251 + 68.37938_(pe==2) + 29.29496_(pregnant==2), 0.0334725036145318)
pi.pregnant[1] <- 0.9014; pi.pregnant[2] <- 0.0986
pregnant ~ dcat(pi.pregnant)
angio ~ dbern(angio ~ ilogit(-2.22585 + 3.28411_(pe==2)))
treat ~ dbern(treat ~ ilogit(-5.89316 + 0.01994_d.dimer + 1.73354_(angio==2)))
death ~ dbern(death ~ ilogit(-4.18763 + 5.48082_(pe==2) + -1.93576_(treat==2)))
}

new.DDimer.Model <- lm(d.dimer ~ pe * pregnant, data=PE)
bagNet <- setNodeModels(bagNet, new.DDimer.Model)

writeNetworkModel(bagNet, pretty=TRUE)

model{
wells ~ dnorm(wells ~ 3.7941999999999, 0.630504251834359)
pe ~ dbern(pe ~ ilogit(-3.90355 + 0.5757_wells))
d.dimer ~ dnorm(d.dimer ~ 210.033541553503, 0.0335041033258058) ##no interaction!
pi.pregnant[1] <- 0.9014; pi.pregnant[2] <- 0.0986
pregnant ~ dcat(pi.pregnant)
angio ~ dbern(angio ~ ilogit(-2.22585 + 3.28411_(pe==2)))
treat ~ dbern(treat ~ ilogit(-5.89316 + 0.01994_d.dimer + 1.73354_(angio==2)))
death ~ dbern(death ~ ilogit(-4.18763 + 5.48082_(pe==2) + -1.93576_(treat==2)))
}

update.HydeNetwork method & parent nodes

What should be the default behavior for update() when we are trying to remove parent nodes?

One option would be to make the user specify the removal of BOTH the parent node and the link to all child nodes (e.g., ". ~ . - pregnant - d.dimer | pregnant"). This seems cumbersome.

Another option would be to only have the user specify minimal syntax but give warning messages:

net2 <- update(net, . ~ . - pregnant)

warning in update.HydeNetwork():
Node 'pregnant' had the following child node(s): 'd.dimer'.
Conditional probability distributions have been removed from these nodes.
Use 'setNode()' or 'setNodeModels()' to repopulate these distributions.

HydeNet Model Gallery

We should start thinking about how we advertise and interact with the user community.
What should we put on the GitHub package homepage?

One idea I had was to put, right at the top of the page, an invitation for people to contribute working HydeNet analyses to a HydeNet model gallery that we host. What is the minimal set of things needed to do something like this?

Since I'm trying to learn GitHub, maybe you can walk me through this. What I'm thinking is a subdirectory called modelGallery with its own subdirectories called 'vignettes' and 'data'. We would need to put stuff in the .gitIgnore and .RBuildIgnore files to ignore this stuff. Then, we would manually edit a table on the package homepage with the various analyses people contribute.

However, I'm confused by the fact that this approach wouldn't allow users to download the package and play with contributers' .Rdata objects. That would mean the .RData objects would have to be built with the package, right?

What if the modelGallery subdirectory was actually an R package itself? Still ignored, but downloadable and installable, either as a zip file or from source.

decimals not truncated to 5 in writeNetworkModel(pretty=TRUE)

This is a small issue - somehow the parameters for node 'wells' were not printed with 5 digits. no biggie...

model{

wells ~ dnorm(3.7941999999999, 0.630504251834359)

pe ~ dbern(ilogit(-3.90355 + 0.5757 * wells))

d.dimer ~ dnorm(210.24251 + 68.37938 * (pe == 2) + 29.29496 * (pregnant == 2), 0.0334725036145318)

pi.pregnant[1] <- 0.9014; pi.pregnant[2] <- 0.0986

pregnant ~ dcat(pi.pregnant)

angio ~ dbern(ilogit(-2.22585 + 3.28411 * (pe == 2)))

treat ~ dbern(ilogit(-5.89316 + 0.01994 * d.dimer + 1.73354 * (angio == 2)))

death ~ dbern(ilogit(-4.18763 + 5.48082 * (pe == 2) + -1.93576 * (treat == 2)))

}

Add-on: Implement Cox Proportional Hazards

Sometime, if ever.

modelToNode() and writeJagsFormula() methods for survreg objects

This is a low priority, intended for post-beta release.

Temporal Bayesian Networks

Time 0 model (baseline conditions and time-dependent variables at T=1)
- Time t model (maps time-dependent variables at time t to the same at time t+1) • unroll.markovNetwork(startTime=NULL, stoptime)
This function just creates a pgmNetwork object from the Markov model by repeating the transition structure over a user-specified span of time o if startTime==NULL, unroll the network from t=0 to t=stoptime
- prior distributions defined by parameterized densities for the t=0 nodes o if startime>=1, unroll the network from time=startTime to time=stopTime
- We may need to assume all (root) nodes at time=startTime have either been observed or computed as the posterior mode from a previous run of JAGS

Decision and Utility Nodes

Decision nodes act like regular random variables. Only later functions like MEU() (maximum expected utility) will leverage the fact that they are decision nodes (e.g., creating a policy matrix). So we just need to the following for decision nodes:

Create a list object within the HydeNetwork object that stores which nodes are decision nodes.
Modify the plot method for HydeNetwork objects to make decision nodes square with white background and some color text
Modify the plot method for compiledHydeNetwork objects to make observed decision nodes square with some color background. Could simply invert the colors above in (2).

Utility nodes are a little trickier.

They are part of the graph object, but never get fed into JAGS since they're not random variables.
They are numeric vectors, all of which are the same length and all of which have the same "slots" (e.g., cost in first slot, QALYs in second slot)
They are defined by functions, not models. The functions must map all combinations of parent nodes to a vector utility
We should probably restrict naming of utility nodes to something like U1, U2, U3, ... because of the above restrictions.
To visualize utility nodes, I'd say use gray (to denote they are fixed) with triangle shape.

Deterministic nodes

There needs to be a mechanism to define deterministic nodes.

Factor Conversion problems in compileJagsModel

I'm trying to set observed values for factor nodes in the PE network, and compileJagsModel() seems to be setting them to NA. I tried using numeric (JAGS-esqe) values for the factor levels (1 and 2), and it sets the node to NA. When I try using R-esqe factor labels ("No" and "Yes"), I get an error.

compiled.autoNet <- compileJagsModel(autoNet, data=list(wells=7, treat=1), n.chains=3)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 53

Initializing model

head(compiled.autoNet$observed, 20)
$wells
[1] 7

$treat
[1] NA

There needs to be a better package description

The package description in the DESCRIPTION file needs to be written.

The text in the Title: field will appear next to the package name on the CRAN listing and in bold at the top of the package summary on CRAN example. It should be no more than one line, and give a good idea of what the package does. Perhaps "Hybrid Decision Graphical Networks Using R and JAGS"

The text in the Description: field is what will appear on CRAN next to the package. It may be multiple lines. It should explain why someone would be interested in using the package, and should not start with the words 'the package' (Professor Ripley will sometimes criticize vague descriptions)

Do we need to export as many functions as we do?

It seems to me that a lot of the functions I've exported aren't really intended to be called by the user. This is especially true in the case of the methods like writeJagsFormula. I need to figure out how to get these to work without exporting them.

PE needs to be documented

The template is laid out in

R/PEData.R

The documentation is produced using the roxygen2 package. Please let me know if you have questions

Redundancy of utility node specification? Or overkill?

This is minor...

Not sure what's going on under the hood, but in the blackjack model I was working on the payoff node. It's a utility node, defined by a deterministic function.

Here is what I had:

net <- setUtilityNodes(net, payoff)
net <- setNode(net, payoff, "determ", define=fromFormula(),
               nodeFormula = payoff ~ 
                  ifelse(playerFinalPoints > 21,
                     -1,
                     ifelse(dealerFinalPoints > 21,
                        1,
                        ifelse(playerFinalPoints > dealerFinalPoints,
                           1,
                           ifelse(playerFinalPoints == dealerFinalPoints, 0, -1)))))

But I ran it another time without the setUtilityNodes call, and just had this:

net <- setNode(net, payoff, "determ", define=fromFormula(),
               nodeFormula = payoff ~ 
                  ifelse(playerFinalPoints > 21,
                     -1,
                     ifelse(dealerFinalPoints > 21,
                        1,
                        ifelse(playerFinalPoints > dealerFinalPoints,
                           1,
                           ifelse(playerFinalPoints == dealerFinalPoints, 0, -1)))),
                utility = TRUE)

That seemed to work, at least with plot.HydeNetwork(). And the network object stored the fact that it was a utility node.

So I guess the problem is that if I run setUtilityNodes(), and then run setNode without explicitly saying utility=TRUE, then the fact that it was a utility node gets overridden by setNode. This may be conservative behavior which avoids trouble down the line; however, it seems to call into question the, er, utility, of the setUtilityNodes() function.

Bug in setNode() - parameter value check for univ distributions

net <- setNode(net, d.dimer, nodeType = "dpois", lambda=-10)

This should return an error message like the following:

paste0("Node '", node, "': parameter '", bad.param,"' outside allowable range in '", nodeDist,"' (",jagsDists$paramLimit[jagsDists$FnName==nodeDist],")")

plot.HydeNetwork() defaults for decision nets

Here is some code that uses separate node shapes/colors depending on the type (random vs. determ vs. decision vs. utility). I didn't do anything with the edges yet. Any ideas on edge coloring? May be good to somehow gray out edges associated with deterministic nodes.

By the way, ignore the actual formulas in the setNode commands - I just did those commands so I could color the deterministic nodes.

net <- HydeNetwork(~ card1.ace | card1
                   + card2.ace | card2
                   + initialPoints | card1*card2*card1.ace*card2.ace
                   + hit1 | initialPoints*dealerUpcard
                   + card3 | hit1
                   + card3.ace | card3
                   + pointsAfterCard3 | initialPoints*card3*card3.ace
                   + hit2 | pointsAfterCard3*dealerUpcard
                   + card4 | hit2
                   + card4.ace | card4
                   + pointsAfterCard4 | pointsAfterCard3*card4*card4.ace
                   + hit3 | pointsAfterCard4*dealerUpcard
                   + card5 | hit3
                   + card5.ace | card5
                   + pointsAfterCard5 | pointsAfterCard4*card5*card5.ace
                   + playerFinalPoints | initialPoints*hit1*pointsAfterCard3
                                         *hit2*pointsAfterCard4*hit3*pointsAfterCard5
                   + dealerFinalPoints | dealerUpcard
                   + payoff | playerFinalPoints*dealerFinalPoints)

net <- setNode(net, card1.ace, "determ", define=fromFormula(), nodeFormula=card1.ace ~ ifelse(card1==1,1,0))
net <- setNode(net, card2.ace, "determ", define=fromFormula(), nodeFormula=card2.ace ~ ifelse(card2==1,1,0))
net <- setNode(net, card3.ace, "determ", define=fromFormula(), nodeFormula=card3.ace ~ ifelse(card3==1,1,0))
net <- setNode(net, card4.ace, "determ", define=fromFormula(), nodeFormula=card4.ace ~ ifelse(card4==1,1,0))
net <- setNode(net, card5.ace, "determ", define=fromFormula(), nodeFormula=card5.ace ~ ifelse(card5==1,1,0))
net <- setNode(net, initialPoints, "determ", define=fromFormula(), nodeFormula=initialPoints~card1+card2)
net <- setNode(net, pointsAfterCard3, "determ", define=fromFormula(),
               nodeFormula=pointsAfterCard3 ~ initialPoints+card3)
net <- setNode(net, pointsAfterCard4, "determ", define=fromFormula(),
               nodeFormula=pointsAfterCard4 ~ pointsAfterCard3+card4)
net <- setNode(net, pointsAfterCard5, "determ", define=fromFormula(),
               nodeFormula=pointsAfterCard5 ~ pointsAfterCard4+card5)
net <- setNode(net, playerFinalPoints, "determ", define=fromFormula(),
               nodeFormula=playerFinalPoints ~ playerFinalPoints+3)


net <- setDecisionNodes(net, hit1, hit2, hit3)
net <- setUtilityNodes(net, payoff)

decisionNodes <- names(which(unlist(net$nodeDecision)))
utilityNodes <- "payoff"  #names(which(unlist(net$nodeDecision)))
determNodes <- names(which(unlist(net$nodeType)=="determ"))

shapes <- rep("box",length(decisionNodes)+length(utilityNodes));
names(shapes) <- c(decisionNodes, utilityNodes)

fills <- c(rep("cyan",length(decisionNodes)), rep("yellow",length(utilityNodes)))
names(fills) <- c(decisionNodes, utilityNodes)

fontcols <- rep("gray70", length(determNodes))
names(fontcols) <- c(determNodes)

cols <- rep("gray70",length(determNodes));
names(cols) <- determNodes

nodeAttribs <- list(shape=shapes, fillcolor=fills, fontcolor=fontcols, color=cols)

attribs <- list(node=list(shape="ellipse", fixedsize=FALSE, fillcolor="white", style="solid"))

plot(net, attrs=attribs, nodeAttrs=nodeAttribs)

Passing multiple model objects to setNode (+ Issue #6)

In Issue #6 (get node distn from model object) I assumed only a single model object.

Maybe the modelObjects parameter should be a list of model objects instead of a single model object.

Here's what I think the user should be able to do.

mod.pe <- glm(pe ~ wells, family="binomial", data=d1)
mod.d.dimer <- lm(d.dimer ~ pregnant_pe, data=d2)
mod.angio <- glm(angio ~ pe, family="binomial", data=d1)
mod.treat <- glm(treat~ d.dimer_angio, family="binomial", data=d3)
mod.death <- glm(death ~ pe*treat, family="binomial", data=d2)

my.models <- list(mod.pe, mod.d.dimer, mod.angio, mod.treat, mod.death)

net <- setNode(net, modelObjects = my.models)
net <- setNode(net, wells, "dnorm", mu=5, tau=1/(1.5^2))
net <- setNode(net, pregnant, "dbern", p=0.10)

**Even better:

my.models <- c(my.models, lm(wells ~ 1, data=d2), glm(pregnant ~ 1, family="binomial", data=d1))
net <- HydeNetwork(modelObjects = my.models)

Efficiency issues with categorical nodes in HydeNetwork.formula()

Currently, our logic for categorical nodes is to use glm() if the node has 2 levels and to use multinom() if the node has 3 or more levels.

Here is the important thing: this logic applies in a manner that is naive to the nature of the node's parents.

Here is an example of what I'm talking about:

n <- 50000
d <- data.frame(
  di1=as.factor(1:6 %*% rmultinom(n,1,prob=c(.4,.3,.15,.10,.03,.02))),
  di2=as.factor(1:6 %*% rmultinom(n,1,prob=rev(c(.4,.3,.15,.10,.03,.02)))),
  di3=as.factor(1:6 %*% rmultinom(n,1,prob=c(.15,.10,.02,.3,.4,.03)))
)

g <- HydeNetwork(~di3|di1*di2, data=d)
system.time(writeNetworkModel(g, pretty=TRUE))

 user  system elapsed 
15.48    0.00   15.74

For situations where the child and all parent nodes are categorical (regardless of the number of levels), something that leverages plyr (or dplyr, but I'm not yet there with my skills) would be much, much faster:

cptFromData <- function(data, y, x, countVar=NULL){
  stopifnot(is.character(y) && length(y)==1)
  stopifnot(y %in% names(data))
  stopifnot(all(x %in% names(d)))
  if(!is.null(countVar)) stopifnot(countVar %in% names(data)) else {
    countVar <- "..cptCount..";
    data[,"..cptCount.."] <- 1
  }

  comboCounts <- daply(data, c(y,x), function(x) sum(x[,countVar]), .drop_i=FALSE)
  cpt <- aaply(comboCounts, seq_along(c(y,x))[-1], function(x) x/sum(x))

  names(dimnames(cpt))[length(c(y,x))] <- y
  return(cpt)
}

system.time(cptFromData(d, "di3", c("di1","di2")))

user  system elapsed 
0.04    0.00    0.05

The function above gives a multidimensional array, where the last dimension represents the child node:

cptFromData(d, "di3", c("di1","di2"))
, , di3 = 1

   di2
di1         1         2         3         4         5         6
  1 0.1550000 0.1669368 0.1544554 0.1464297 0.1524012 0.1455817
  2 0.1578947 0.1629956 0.1390952 0.1385150 0.1533109 0.1494196
  3 0.1103896 0.1405622 0.1403509 0.1540470 0.1385957 0.1670588
  4 0.1130435 0.1568627 0.1512097 0.1608877 0.1527869 0.1578178
  5 0.2580645 0.1739130 0.1708861 0.1644444 0.1371158 0.1537162
  6 0.1666667 0.1428571 0.1397849 0.1127820 0.1373802 0.1882641

, , di3 = 2

   di2
di1          1          2          3          4          5          6
  1 0.10000000 0.08265802 0.10247525 0.09838763 0.09523032 0.09523204
  2 0.12982456 0.10792952 0.09385550 0.10551455 0.10460157 0.09718076
  3 0.12337662 0.08835341 0.11257310 0.07745866 0.10142267 0.08806723
  4 0.11304348 0.09803922 0.10685484 0.08737864 0.09901639 0.09547004
  5 0.09677419 0.08695652 0.12025316 0.12888889 0.10874704 0.10810811
  6 0.16666667 0.05714286 0.09677419 0.12781955 0.08945687 0.09290954

, , di3 = 3

   di2
di1          1           2           3          4          5          6
  1 0.00750000 0.030794165 0.018316832 0.01612372 0.01878471 0.02212333
  2 0.02456140 0.017621145 0.020256583 0.02171081 0.02042649 0.02006633
  3 0.01948052 0.008032129 0.021929825 0.02436902 0.02019275 0.02117647
  4 0.04347826 0.013071895 0.008064516 0.01941748 0.02098361 0.01997077
  5 0.00000000 0.086956522 0.012658228 0.02666667 0.02127660 0.01858108
  6 0.00000000 0.000000000 0.021505376 0.01503759 0.01277955 0.02200489

, , di3 = 4

   di2
di1         1         2         3         4         5         6
  1 0.3025000 0.2901135 0.2861386 0.3132609 0.3008821 0.2953592
  2 0.2947368 0.3215859 0.3112762 0.2952670 0.2947250 0.2995025
  3 0.3376623 0.3534137 0.2865497 0.2863359 0.3074805 0.2931092
  4 0.3304348 0.2679739 0.3004032 0.3106796 0.3121311 0.2990745
  5 0.2903226 0.3260870 0.2721519 0.2711111 0.3238771 0.3006757
  6 0.3333333 0.2285714 0.2580645 0.3082707 0.3258786 0.2518337

, , di3 = 5

   di2
di1         1         2         3         4         5         6
  1 0.4075000 0.3970827 0.4034653 0.3984863 0.4041163 0.4099174
  2 0.3614035 0.3678414 0.4044564 0.4090317 0.3928171 0.4056385
  3 0.3896104 0.3815261 0.4020468 0.4290688 0.4015603 0.3973109
  4 0.3739130 0.4117647 0.4133065 0.4008322 0.3763934 0.3994155
  5 0.3548387 0.3043478 0.4050633 0.3644444 0.3829787 0.3834459
  6 0.3333333 0.5714286 0.4623656 0.3984962 0.3993610 0.4083130

, , di3 = 6

   di2
di1          1          2          3          4          5          6
  1 0.02750000 0.03241491 0.03514851 0.02731162 0.02858543 0.03178640
  2 0.03157895 0.02202643 0.03106009 0.02996092 0.03411897 0.02819237
  3 0.01948052 0.02811245 0.03654971 0.02872063 0.03074805 0.03327731
  4 0.02608696 0.05228758 0.02016129 0.02080444 0.03868852 0.02825134
  5 0.00000000 0.02173913 0.01898734 0.04444444 0.02600473 0.03547297
  6 0.00000000 0.00000000 0.02150538 0.03759398 0.03514377 0.03667482

We can, of course, check the math:

di3.cpt <- cptFromData(d, "di3", c("di1","di2"))
aaply(di3.cpt, 1:2, sum)

   di2
di1 1 2 3 4 5 6
  1 1 1 1 1 1 1
  2 1 1 1 1 1 1
  3 1 1 1 1 1 1
  4 1 1 1 1 1 1
  5 1 1 1 1 1 1
  6 1 1 1 1 1 1

Hide deterministic nodes in plot.HydeNetwork()

Realize that I'm coming up with a bunch of stuff with the expectation that very little of it will be incorporated into Version 1.0. So let's stay on track with the current plan of getting v1.0 out. But I digress...

It occurred to me that users might want to just visualize random variables, decision nodes, and utilities.

Maybe there's already a function that will generate a subnetwork of a DAG. It's actually a little challenging. I came up with the following (naive and slow) algorithm:

for each 'determ' node D_j:
|  make a link between all children of D_j and all parents of D_j
|  delete the node D_j and all links involving D_j (both incoming and outgoing)
end for

I investigated whether or not there are actually functions that do this. I'm sure there are. I found something useful in the igraph library but it doesn't give us exactly what we want. It looks like the induced.subgraph() algorithm just includes edges that existed among included vertices. That totally makes sense. Maybe the igraph package has something that will do it though.

Or maybe a home-cooked algorithm on the adjacency matrix will be best. I don't know.

require(igraph)

g <- dag(~ card1.ace | card1
         + card2.ace | card2
         + initialPoints | card1*card2*card1.ace*card2.ace
         + hit1 | initialPoints*dealerUpcard
         + card3 | hit1
         + card3.ace | card3
         + pointsAfterCard3 | initialPoints*card3*card3.ace
         + hit2 | pointsAfterCard3*dealerUpcard
         + card4 | hit2
         + card4.ace | card4
         + pointsAfterCard4 | pointsAfterCard3*card4*card4.ace
         + hit3 | pointsAfterCard4*dealerUpcard
         + card5 | hit3
         + card5.ace | card5
         + pointsAfterCard5 | pointsAfterCard4*card5*card5.ace
         + playerFinalPoints | initialPoints*hit1*pointsAfterCard3
         *hit2*pointsAfterCard4*hit3*pointsAfterCard5
         + dealerFinalPoints | dealerUpcard
         + payoff | playerFinalPoints*dealerFinalPoints)

#coerce dag to igraph object

h <- gRbase::coerceGraph(g, "matrix")
j <- igraph::graph.adjacency(h, mode="directed")

keepNodes <- c("card1","card2","hit1","card3","hit2","card4","hit3",
               "card5","dealerUpcard","dealerFinalPoints","payoff")
k <- igraph::induced.subgraph(j,vids = which(V(j)$name %in% keepNodes))

#plot(j, layout=layout.sphere)
#plot(k, layout=layout.kamada.kawai)



jj <- as(get.adjacency(j), "graphNEL")
kk <- as(get.adjacency(k), "graphNEL")

plot(jj, attrs=list(node=list(shape="ellipse", fixedsize=FALSE)))
plot(kk, attrs=list(node=list(shape="ellipse", fixedsize=FALSE)))

Tabular Input and Normalization

In the original pass at the package, while we were still in the conceptualization phase, you made some functions which facilitated tabular input of factors.

I think that either you made other functions or such functions already existed which normalized conditional probability tables so that they represented proper conditional probability distributions (specifically, integrate to 1, all probabilities in [0,1] for each combination of unobserved parent nodes). Or maybe you (or others) didn't do this. Can't remember.

setNode(): params argument should now be optional

net <- HydeNetwork(~ x)
net <- setNode(net, x, nodeType="dcat", pi=vectorProbs(c(3, 6, 2), x))
Error in setNode(net, x, nodeType = "dcat", pi = vectorProbs(c(3, 6, 2), :
argument "params" is missing, with no default

Documentation to complete

My to do list of functions to finish documenting. I can't finish the examples until I work out the issue with policy matrices in Issue #34

bindPosterior.R done
compileDecisionModel.R done
HydePosterior.R done
plot.compiledHydeNetwork.R delayed
print.HydePosterior.R done

So close I can taste it.

Using functions in equations creates new nodes

When I use

g1 <- lm(x ~ a + b + c, ...)
g2 <- lm(y ~ poly(x, 2) + d, ...)
HydeNetwork(list(g1, g2))

I get a node named x and a node named poly(x, 2). This will apply to factors and any other function.

I don't know if there's a good way to address this, but I don't want to forget it. For now, I'm thinking about how to strip a function out and leave the variable.

We may need to reconsider the name again

There is a popular website generator written in python called Hyde. Will we create confusion about what this package does if we call it Hyde?

Deterministic decision nodes and random utility nodes

(You might want to find something on which you can direct your frustration before reading on! Or at least a quiet place to lament collaborating with someone who doesn't quite have stuff figured out. In my defense, I'm not sure the rest of the statistical community has worked through these issues!)

The thought just occurred to me that I may have placed an unnecessary restriction on decision nodes. Currently, we restrict decision nodes to be random variables. So, the probability of making a decision of do this vs. do that for node D depends on the values of node D's parents.

The implication of this is that we can look at decisions as exhibiting a distribution across a population of decision makers, and that distribution depends on the parents of D.

What if a high-level decision maker sets policy so that they uniformly occur? These would be decisions like:

always give antibiotics if there is a positive throat culture
(blackjack decision net:) always take another card if the point total is <16

I can wrap my head around the first example. That would require some modifications to some of the downstream functions, but basically allowing for the possibility of a decision node to be deterministic would let us model that situation. But notice the second example represents a strategy that involves a sequence of decision nodes.

I've seen some literature describing sequential decisions in networks. That literature breaks networks down into the following structure:

Net = (X0 -> D1 -> X1 -> D2 -> X2 -> ... -> Dn -> Xn)

where X* and D* are subnetworks including random/deterministic nodes and decision nodes, respectively. (I guess utility nodes just do their thing and hang out as leaf nodes wherever they want. But I'm not 100% on this.) Maybe we can use that construct somehow.

For that matter, I think we may have also placed an unnecessary restriction on utility nodes to be deterministic? What if you wanted to represent utility as a belief distribution instead of a point-mass (single number)? In the below figure, the mean of this distribution is 0.75 (blue line) and the mode is around 0.94 (red line). Neither seem to be as good as using the whole belief density of utility.