-
Notifications
You must be signed in to change notification settings - Fork 23
Calibrated Posterior Predictive P values
This page will serve as a design document for the implementation of calibrated posterior predictive p-values in NIMBLE. The functionality of the algorithm is detailed below, along with an example of running the algorithm on a model, and some open questions about some implementation specifics. Anyone is welcome to comment by adding to this page, or via email or any other preferred method!
Calibrated posterior predictive p-values (CPPPs) were designed to address the fact that the null distribution of posterior predictive p-values (PPPs) is not a uniform(0, 1) distribution, and thus such un-calibrated p-values were frequently misinterpreted. CPPP calculation proceeds by first calculating a PPP according to some discrepancy measure. Then, parameters and data are simulated from the model. Using this new, simulated data, a new PPP is calculated. This simulation and p-value calculation scheme is repeated many times. The simulated PPPs provide an empirical distribution against which the PPP calculated from the real, observed data can be compared, and a CPPP is obtained as such.
The proposed name for the function that calculates the CPPP in nimble is runCPPP(). The current code for the runCPPP() function can be found at: https://github.com/nimble-dev/nimble/blob/add_CPPP_etc/packages/nimble/R/algo_cppp.R
The repeated simulate - calculate p-value scheme used to calculate CPPPs raises some design choices:
- (An open question) Do we want to allow users to parallelize the calculation of PPPs from simulated data? Generally, a user would probably want to simulate at least 100 data sets (and likely more), and for each of these simulated data sets calculate a PPP. For each simulated data set, this involves running an MCMC and then using the posterior samples provided from that MCMC to calculate the PPP. Each simulated data set and PPP calculated in this manner is entirely independent of the other simulated data sets / PPPs, and as such, this process could be readily parallelized to multiple cores (e.g. using the
parallelR package). Parallelizing does mean, however, that the inputs to therunCPPP()function become a bit more awkward:- For an un-parallelized version of
runCPPP(), therunCPPP()function could be called using inputs similar to a call torunMCMC().runCPP()could take an MCMC algorithm object created from a call tobuildMCMC()(either compiled or uncompiled), and use that MCMC object to run an MCMC for each of the simulated data sets. - On the other hand, a parallelized version of
runCPPP()would need a separate MCMC algorithm object for each core that was being used for parallelization. As such, if a user wanted a specialized MCMC (i.e. non-default), instead of providing a single MCMC algorithm object as an argument torunCPPP(), one idea is to have the user provide a function that takes as input a model, and returns an MCMC object. For example, this could be a user-written function such as:
- For an un-parallelized version of
mcmcGeneratorFunc <- function(model){
mcmc.spec <- configureMCMC(model, onlySlice = TRUE,
print = FALSE)
mcmc <- buildMCMC(mcmc.spec)
}
Note that the parallelized version of runCPPP() can default to using nimble's default MCMC config, so that providing a "MCMC generator function" such as the one above would be optional, and only necessary if the user wanted a custom MCMC for their model.
- (Not a question, but a design note open for comments) Calculating a PPP or a CPPP requires a discrepancy function that measures how well the data agrees with the current state of the model. Discrepancy functions are, by their nature, highly model-specific. In the current design, the
runCPPP()function accepts an argument, nameddiscrepancyFunction, that is a user-definednimbleFunctionwith the following stipulations:-
discrepancyFunctionmust containsetupcode andruncode. - The setup code to
discrepancyFunctionmust accept two arguments:model, which is the model the CPPP is being calculated for, anddiscrepancyFunctionArgs, which is an R list with any information the user wishes to have available to calculate the discrepancy (e.g. node names, tuning parameters, etc., similar to thecontrollist for MCMC samplers). - The
runcode of the discrepancy function must take no arguments and return a scalar double, and mustcontainthediscrepancyFunction_BASEvirtualnimbleFunction.
-
Below is an example of the steps that could be taken to calculate the CPPP for the pump model. Note that the naming of the arguments to runCPPP() is not final, and can easily be changed if anyone has suggestions.
pumpCode <- nimbleCode({
for (i in 1:N){
theta[i] ~ dgamma(alpha, beta)
lambda[i] <- theta[i]*t[i]
x[i] ~ dpois(lambda[i])
}
alpha ~ dexp(1.0)
beta ~ dgamma(0.1,1.0)
})
pumpConsts <- list(N = 10,
t = c(94.3, 15.7, 62.9, 126, 5.24,
31.4, 1.05, 1.05, 2.1, 10.5))
pumpData <- list(x = c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22))
pumpModel <- nimbleModel(code=pumpCode,
constants=pumpConsts,
data=pumpData)
# Define the discrepancy function.
# Note that the discrepancy function must be a nimbleFunction that takes
# model and discFunctionArgs as its only two arguments.
pumpDiscFunction <- nimbleFunction(
setup = function(model, discFunctionArgs){
dataNames <- discFunctionArgs[['dataNames']]
},
run = function(){
dataVals <- values(model, dataNames)
lambdaVals <- values(model, 'lambda')
disc <- 0
for(i in 1:dim(dataVals)[1]){
disc <- disc + ((dataVals[i] - lambdaVals[i])/sqrt(lambdaVals[i])) # subtract the expected value, divide by the sd.
# Note here the expected value and sd are the same,
# as the data comes from a poisson distribution.
}
returnType(double(0)) # must return a double(0)
return(disc)
},
contains = discrepancyFunction_BASE
)
pumpCPPP <- runCPPP(pumpModel,
dataNames = 'x',
discrepancyFunction = pumpDiscFunction,
discrepancyFunctionArgs = list(dataNames = 'x'),
mcmcControl = list(nMCMCIters = 20000))
- Does using the R file name
CPPP.Rseem cleaner, than the currentalgo_cppp.R? - I think the parallelized version of
runCPPP()could be implemented perhaps simpler, by having the user supply the MCMC configuration object as an argument (rather than an MCMC-generating-function), which might be more straightforward for usage.
- It seems to me that clean handling of parallelized MCMC should be considered in conjunction with a discussion of how we want to proceed in terms of parallelizing MCMC chains generally. I think this is waiting on questions of redesign/save/reload. That said, could we simply allow the user to provide either an MCMC configuration object or an MCMC algo object and the parallelized case would only work if the configuration is provided, I think.
- I agree with CJP about the parallelization questions. What do you think of renaming
discrepancyFunctionArgsto something likecontrol?