Skip to content

Calibrated Posterior Predictive P values

perrydv edited this page Sep 27, 2017 · 4 revisions

This page will serve as a design document for the implementation of calibrated posterior predictive p-values in NIMBLE. The functionality of the algorithm is detailed below, along with an example of running the algorithm on a model, and some open questions about some implementation specifics. Anyone is welcome to comment by adding to this page, or via email or any other preferred method!

CPPP

Calibrated posterior predictive p-values (CPPPs) were designed to address the fact that the null distribution of posterior predictive p-values (PPPs) is not a uniform(0, 1) distribution, and thus such un-calibrated p-values were frequently misinterpreted. CPPP calculation proceeds by first calculating a PPP according to some discrepancy measure. Then, parameters and data are simulated from the model. Using this new, simulated data, a new PPP is calculated. This simulation and p-value calculation scheme is repeated many times. The simulated PPPs provide an empirical distribution against which the PPP calculated from the real, observed data can be compared, and a CPPP is obtained as such.

The proposed name for the function that calculates the CPPP in nimble is runCPPP(). The current code for the runCPPP() function can be found at: https://github.com/nimble-dev/nimble/blob/add_CPPP_etc/packages/nimble/R/algo_cppp.R

The repeated simulate - calculate p-value scheme used to calculate CPPPs raises some design choices:

  1. (An open question) Do we want to allow users to parallelize the calculation of PPPs from simulated data? Generally, a user would probably want to simulate at least 100 data sets (and likely more), and for each of these simulated data sets calculate a PPP. For each simulated data set, this involves running an MCMC and then using the posterior samples provided from that MCMC to calculate the PPP. Each simulated data set and PPP calculated in this manner is entirely independent of the other simulated data sets / PPPs, and as such, this process could be readily parallelized to multiple cores (e.g. using the parallel R package). Parallelizing does mean, however, that the inputs to the runCPPP() function become a bit more awkward:
    1. For an un-parallelized version of runCPPP(), the runCPPP() function could be called using inputs similar to a call to runMCMC(). runCPP() could take an MCMC algorithm object created from a call to buildMCMC() (either compiled or uncompiled), and use that MCMC object to run an MCMC for each of the simulated data sets.
    2. On the other hand, a parallelized version of runCPPP() would need a separate MCMC algorithm object for each core that was being used for parallelization. As such, if a user wanted a specialized MCMC (i.e. non-default), instead of providing a single MCMC algorithm object as an argument to runCPPP(), one idea is to have the user provide a function that takes as input a model, and returns an MCMC object. For example, this could be a user-written function such as:
mcmcGeneratorFunc <- function(model){
  mcmc.spec <- configureMCMC(model, onlySlice = TRUE,
                             print = FALSE)
  mcmc <- buildMCMC(mcmc.spec)
}

Note that the parallelized version of runCPPP() can default to using nimble's default MCMC config, so that providing a "MCMC generator function" such as the one above would be optional, and only necessary if the user wanted a custom MCMC for their model.

  1. (Not a question, but a design note open for comments) Calculating a PPP or a CPPP requires a discrepancy function that measures how well the data agrees with the current state of the model. Discrepancy functions are, by their nature, highly model-specific. In the current design, the runCPPP() function accepts an argument, named discrepancyFunction, that is a user-defined nimbleFunction with the following stipulations:
    1. discrepancyFunction must contain setup code and run code.
    2. The setup code to discrepancyFunction must accept two arguments: model, which is the model the CPPP is being calculated for, and discrepancyFunctionArgs, which is an R list with any information the user wishes to have available to calculate the discrepancy (e.g. node names, tuning parameters, etc., similar to the control list for MCMC samplers).
    3. The run code of the discrepancy function must take no arguments and return a scalar double, and must contain the discrepancyFunction_BASE virtual nimbleFunction.

Example

Below is an example of the steps that could be taken to calculate the CPPP for the pump model. Note that the naming of the arguments to runCPPP() is not final, and can easily be changed if anyone has suggestions.

  pumpCode <- nimbleCode({
   for (i in 1:N){
     theta[i] ~ dgamma(alpha, beta)
     lambda[i] <- theta[i]*t[i]
     x[i] ~ dpois(lambda[i])
   }
   alpha ~ dexp(1.0)
   beta ~ dgamma(0.1,1.0)
 })
 
 pumpConsts <- list(N = 10,
                    t = c(94.3, 15.7, 62.9, 126, 5.24,
                          31.4, 1.05, 1.05, 2.1, 10.5))
 
 pumpData <- list(x = c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22))
 
 pumpModel <- nimbleModel(code=pumpCode,
                          constants=pumpConsts,
                          data=pumpData)
 
 # Define the discrepancy function.
 # Note that the discrepancy function must be a nimbleFunction that takes
 # model and discFunctionArgs as its only two arguments.
 
 pumpDiscFunction <- nimbleFunction(
   setup = function(model, discFunctionArgs){ 
     dataNames <- discFunctionArgs[['dataNames']]
   },
   run = function(){
     dataVals <- values(model, dataNames)
     lambdaVals <- values(model, 'lambda')
     disc <- 0
     for(i in 1:dim(dataVals)[1]){
       disc <- disc + ((dataVals[i] - lambdaVals[i])/sqrt(lambdaVals[i])) # subtract the expected value, divide by the sd.
                                                                          # Note here the expected value and sd are the same,
                                                                          # as the data comes from a poisson distribution.
     }
     returnType(double(0))  # must return a double(0)
     return(disc)
   },
   contains = discrepancyFunction_BASE
 )
 
 pumpCPPP <- runCPPP(pumpModel,
                     dataNames = 'x',
                     discrepancyFunction = pumpDiscFunction,
                     discrepancyFunctionArgs = list(dataNames = 'x'),
                     mcmcControl = list(nMCMCIters = 20000))

DT Comments

  • Does using the R file name CPPP.R seem cleaner, than the current algo_cppp.R?
  • I think the parallelized version of runCPPP() could be implemented perhaps simpler, by having the user supply the MCMC configuration object as an argument (rather than an MCMC-generating-function), which might be more straightforward for usage.

CJP comments

  • It seems to me that clean handling of parallelized MCMC should be considered in conjunction with a discussion of how we want to proceed in terms of parallelizing MCMC chains generally. I think this is waiting on questions of redesign/save/reload. That said, could we simply allow the user to provide either an MCMC configuration object or an MCMC algo object and the parallelized case would only work if the configuration is provided, I think.

PV comments

  • I agree with CJP about the parallelization questions. What do you think of renaming discrepancyFunctionArgs to something like control?
Clone this wiki locally