Should Subsampling be Recommended?

My understanding is: subsampling is recommended so that Equation 4.2 of [Kong et al., 2003](https://academic.oup.com/jrsssb/article/65/3/585/7110677) , which is derived for uncorrelated samples, can be used to estimate the variance. However, subsampling increases the variance. It seems unintuitive to increase the variance so that it can be better estimated. Would it not be better to minimise the variance by retaining all samples, and use a variance estimator which directly accounts for autocorrelation?

## The Issue: Subsampling Increases the Variance

[Geyer, 1992](https://www.jstor.org/stable/2246094) (Section 3.6) discusses subsampling. He points out:

- Subsampling decreases the statistical inefficiency in units of samples, but increases the statistical inefficiency in units of sampling time (Theorem 3.3)
- "If the cost of using samples is negligible, any subsampling is wrong. One doesn't get a better answer by throwing away data."

I'm assuming that the cost of using samples is generally negligible compared to the cost of generating them. 

The increase in variance caused by subsampling seems to be shown, for example, in Table III of [Tan, 2012](https://pubs.aip.org/aip/jcp/article/136/14/144102/190955), where the variance of the MBAR/UWHAM uncertainties increase after subsampling (the variances without subsampling are calculated using block-bootstrapping). 
## Possible Solutions: Directly Accounting for Autocorrelation in the Variance Estimates

To account for autocorrelation in the variance estimates without subsampling, block bootstrapping could be used, with the block size selected according to the procedure of [Politis and White, 2004](https://www.tandfonline.com/doi/full/10.1081/ETC-120028836) (and [correction](https://www.tandfonline.com/doi/full/10.1080/07474930802459016?src=recsys#d1e347)), for example. However, I understand that fast analytical estimates may be preferred to avoid repeated MBAR evaluations. Could the analytical estimates from [Geyer, 1994](https://conservancy.umn.edu/items/aeef1706-9705-4703-a886-bd8a9487d217)/ [Li et al., 2023](https://pubs.aip.org/aip/jcp/article/158/21/214107/2893708) be used?

## Why This May Be Irrelevant

I'm biased by the fact I work with ABFE calculations and regularly feed MBAR very highly correlated data which are aggressively subsampled, sometimes producing unreliable estimates (which are reasonable without subsampling). I understand that for most applications relatively few samples will be discarded and any increase in uncertainty may be small. 

It would be great to hear some thoughts on this/ be corrected if I am misunderstanding.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should Subsampling be Recommended? #545

The Issue: Subsampling Increases the Variance

Possible Solutions: Directly Accounting for Autocorrelation in the Variance Estimates

Why This May Be Irrelevant

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Should Subsampling be Recommended? #545

Description

The Issue: Subsampling Increases the Variance

Possible Solutions: Directly Accounting for Autocorrelation in the Variance Estimates

Why This May Be Irrelevant

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions