1 Supplemental Materials: An Ethical Approach to Peeking at Data

1
Supplemental Materials:
An Ethical Approach to Peeking at Data
Brad J. Sagarin, James K. Ambler, and Ellen M. Lee
Northern Illinois University
paugmented
paugmented represents the Type I error rate implied by the augmentation decisions
and observed outcomes (both interim and final) of a study. Consider, for example, a study
in which a researcher runs 100 subjects (N1 = 100), producing a p-value of .15 (p1 = .15).
The researcher evaluates this against the usual .05 criterion for significance (pcrit = .05).
Because the results are non-significant, the researcher augments the dataset with another
50 subjects (N2 = 50), producing a p-value in the combined dataset of .09 (p12 = .09).
Because the results are only marginally significant, the researcher then augments the
dataset a second time with another 50 subjects (N3 = 50), producing a p-value in the
combined dataset of .03 (p123 = .03). The results are now significant, so the researcher
stops data collection.
As discussed in the main article, paugmented represents a range of Type I error rates.
The lower bound of paugmented represents the best-case scenario in which pmax is the
maximum p-value observed on any of the interim steps. Thus, in this case, pmax = .15 for
the lower bound of paugmented. The upper bound of paugmented represents the worst-case
scenario in which pmax = 1.
ETHICAL DATASET AUGMENTATION
2
Because a Type I error rate represents the probability of a significant result, given
that the null hypothesis is true, the distributions of p-values for N1, N2, and N3 used to
calculate paugmented are uniform distributions. For each round of data collection (the initial
round: N1, the first round of augmentation: N2, the second round of augmentation: N3), we
divide the distribution of possible p-values into 10000 slices of width .0001, with each
slice represented by a p-value from .00005 to .99995 (dividing the distribution into
100,000 slices of width .00001 or 1,000,000 slices of width .000001 does not change the
results appreciably). Then, a particular combination of slices from N1, N2, and N3 would
produce a significant result within the lower bound if (a) the p-value in the initial N1
participants < .05, or (b) the p-value in the initial N1 participants < .15 and the p-value in
the combined N1+N2 participants < .05, or (c) the p-value in the combined N1+N2
participants < .15 and the p-value in the combined N1+N2+N3 participants < .03. As can
be seen in (a) and (b) the criterion for significance is pcrit (.05, in this case) for the initial
round of data collection and for all but the last round of augmentation. As can be seen in
(c), the criterion for significance is the final observed p-value (.03, in this case) for the
last round of augmentation. As can be seen in (b) and (c), for the lower bound, pmax = .15
(the highest interim p-value).
For the upper bound, pmax = 1. Therefore, a particular combination would produce
significant results within the upper bound if (a) the p-value in the initial N1 participants <
.05, or (b) the p-value in the combined N1+N2 participants < .05, or (c) the p-value in the
combined N1+N2+N3 participants < .03.
The p-values for the combined subsamples (N1+N2, N1+N2+N3) are calculated
using the equation for the weighted Z-method (Whitlock, 2005):
ETHICAL DATASET AUGMENTATION
3
k
åw Z
i
Zw =
i
i=1
k
åw
2
i
i=1
in which Zi is the Z-score associated with the p-value for Ni, and wi =
.
The most straightforward method to calculate paugmented would be to test all
combinations of possible p-values for each of the rounds of data collection. The
proportion of combinations that produce significant results would represent paugmented.
With 10000 slices, there are 100003 combinations of possible p-values for N1, N2,
and N3. Because the number of combinations is an exponential function of the number of
rounds of data collection, the calculations become extremely lengthy with more than two
rounds of augmentation. To mitigate this, we made two changes to the algorithm to
increase its efficiency: (a) When incorporating a new round of dataset augmentation, we
combined earlier rounds of data collection into a single distribution, thus changing the
algorithm from an exponential function to a linear function, and (b) When incorporating
the final round of dataset augmentation, we used the cumulative distribution function of
the normal distribution to avoid the need to explicitly combine slices from the earlier
rounds of data collection with slices from the final round of dataset augmentation.
Instead, for each slice from the earlier rounds of data collection, the proportion of the
distribution from the final round of dataset augmentation that would produce significant
results is calculated directly.
Below, we derive the formulas for making this latter calculation. The formulas are
presented in a general form in which Nprev is the sample prior to the last round of
augmentation, Nnew is the supplemental sample added as the last round of augmentation,
ETHICAL DATASET AUGMENTATION
4
and pcrit is the criterion for significance. From the example above (the study in which the
researcher runs 100 subjects and then augments the dataset twice with 50 additional
subjects each time), these formulas would be applied for the second round of dataset
augmentation. Thus, for that example, Nprev = 150 (because the sample prior to the last
round of augmentation would consist of the combination of the initial 100 subjects plus
the 50 subjects from the first round of dataset augmentation), Nnew = 50 (because the final
round of dataset augmentation contained 50 subjects), and pcrit = .03 (because the
criterion for significance after the final round of dataset augmentation is .03).
This calculation begins with the equation for the weighted Z-method (Whitlock,
2005):
k
åw Z
i
Zw =
i
i=1
k
åw
2
i
i=1
Because there are two samples (Nprev and Nnew), k = 2:
Here, Zw = Zcrit (the Z-score associated with pcrit), wprev =
, wnew =
,
Zprev is the Z-score associated with the significance test of the Nprev participants, and Znew
is the Z-score associated with the significance test of the supplemental Nnew participants.
Substituting these values and solving for Znew yields the Z-score associated with the exact
level of significance that must appear in the supplemental Nnew participants such that the
significance of the combined Nprev+Nnew sample will be equal to pcrit:
ETHICAL DATASET AUGMENTATION
5
For a one-tailed test with the region of significance appearing in the upper tail of
the normal distribution, Znew represents the minimum Z-score that would produce a
significance level of the combined Nprev+Nnew sample ≤ pcrit. Using the R functions
qnorm(p) and pnorm(z) (which return, respectively, the Z-score associated with a p-value
for a normal distribution with a mean of 0 and standard deviation of 1, and the
cumulative distribution function for a particular Z-score for a normal distribution with a
mean of 0 and a standard deviation of 1), the value assigned for the slice of Nprev with a pvalue of p is:
1- pnorm(
qnorm(1- pcrit ) N prev + N new - N prev (qnorm(1- p))
N new
)
For a two-tailed test with half of the region of significance appearing in the upper
tail of the normal distribution and half appearing in the lower tail, Zcrit represent the
positive and negative Z-scores associated with pcrit/2. For +Zcrit, Znew represents the
minimum Z-score that would produce a significance level in the combined Nprev+Nnew
sample ≤ pcrit. For –Zcrit, Znew represents the maximum Z-score that would produce a
significance level in the combined Nprev+Nnew sample ≤ pcrit. Using the same R functions,
the value assigned for the slice of Nprev with a p-value of p is:
qnorm(11- pnorm(
+pnorm(
pactual
-qnorm(1-
pcrit
) N prev + N new - N prev (qnorm(1- p))
2
)
N new
pcrit
) N prev + N new - N prev (qnorm(1- p))
2
)
N new
ETHICAL DATASET AUGMENTATION
6
The calculation of pactual is the same as the calculation of paugmented with two
exceptions: (a) The value of pmax is specified explicitly, producing a single value for
pactual (rather than a range for paugmented), and (b) The criterion for significance after the
final round of dataset augmentation is pcrit (rather than the observed final p-value for
paugmented).
Adjusted pcrit
The calculation of the adjusted pcrit necessary to maintain a desired Type I error
rate while allowing for dataset augmentation is done through an iterative process with
repeated calculations of pactual made, adjusting pcrit up or down so that pactual converges on
the desired Type I error rate.