Why Bayesians and frequentists need each other Contents Jon Williamson

Why Bayesians and frequentists need each other
Jon Williamson
Pluralism in the Foundations of Statistics
Kent, 9 September 2010
Contents
1
The Statistical View
2
2
The Epistemological View
6
3
Frequentist Statistics for Calibration
9
1
1
The Statistical View
Stats works like this:
1. Conceptualise the problem. Isolate a set M of models for consideration.
◦ Models are probability functions.
2. Gather evidence E.
3. Apply statistical methods to evaluate models in M in the light of E.
◦ Inferences and decisions will be made on the basis of a set ME ⊆ M of models that
are appropriate given E.
2
Frequentist stats works like this:
1. Conceptualise the problem. Isolate a set M of models for consideration.
◦ M is a set of candidates for physical probability P∗ .
◦ Usually limiting relative frequency or generic propensity.
2. Gather evidence E.
3. Apply statistical methods to evaluate models in M.
◦ Inferences and decisions will be made on the basis of a set ME ⊆ M of models that
render the evidence sufficiently likely, assuming the evidence is gathered in an
appropriate way etc.
3
Bayesianism works like this:
1. Choose appropriate variables or models, M, and a prior function P over M.
2. Gather evidence, ensuring E is in the domain of P.
3. Take a new stance P0 = P(·|E ) over M (conditionalisation).
◦ Typically via Bayes’ theorem.
◦ ME is the set of models with sufficiently high P0 .
Subjective. P is a matter of personal choice.
Objective. P is sufficiently equivocal.
It looks like the two approaches are just incompatible ways of doing stats.
4
But what are the questions that the two approaches are trying to answer?
Frequentist. How does evidence impact on the set of candidate physical probability functions?
Bayesian. How does evidence impact on rational degree of belief?
∴ Not so incompatible after all?
5
2
The Epistemological View
Key Question. How strongly should an agent with evidence E believe the various propositions expressible in her language L?
EG
A finite propositional language L = {A1 , . . . , An }.
◦ E is total evidence, everything that is taken for granted: data, assumptions, theoretical
knowledge . . .
◦ Not necessarily expressible as propositions of L.
The Bayesian Answer. An agent’s belief function PE over L should satisfy certain norms:
Probability. It should be a probability function.
Calibration. It should be compatible with evidence,
◦ in particular, calibrated with known physical probabilities.
Equivocation. It should otherwise equivocate sufficiently between the basic possibilities
expressible in L.
Strict subjectivism: Probability (+ conditionalisation)
Empirically-based subjectivism: Probability + Calibration (+ conditionalisation)
Objectivism: Probability + Calibration + Equivocation
6
Explicating the Norms
Probability. PE should be a probability function:
P1: PE (ω) ≥ 0 for each ω ∈ Ω = {±A1 ∧ · · · ∧ ±An },
P2: PE (τ) = 1 for some tautology τ ∈ SL, and
P
P3: PE (θ) = ω|=θ PE (ω) for each θ ∈ SL.
Calibration. It should be compatible with evidence,
∗〉 ∩ S
C1: PE ∈ E = 〈PL
◦
◦
◦
◦
P∗ set of candidate physical probability functions.
∗ single-case consequences (n.b. reference class problem).
PL
〈·〉 convex hull.
S structural constraints.
Equivocation. It should otherwise equivocate sufficiently between the basic possibilities
expressible in L.
E1: PE sufficiently close to equivocator P= (ω) = 1/ 2n .
P
◦ d(P, P= ) = ω∈Ω P(ω) log P(ω)/ P= (ω).
◦ Proximity to the equivocator = entropy.
[Williamson, J. (2010). In defence of objective Bayesianism. Oxford University Press, Oxford.]
7
Explicating the Norms
Probability. PE should be a probability function:
P1: PE (ω) ≥ 0 for each ω ∈ Ω = {±A1 ∧ · · · ∧ ±An },
P2: PE (τ) = 1 for some tautology τ ∈ SL, and
P
P3: PE (θ) = ω|=θ PE (ω) for each θ ∈ SL.
Calibration. It should be compatible with evidence,
∗〉 ∩ S
C1: PE ∈ E = 〈PL
◦
◦
◦
◦
P∗ set of candidate physical probability functions. ←− Frequentist statistics!
∗ single-case consequences (n.b. reference class problem).
PL
〈·〉 convex hull.
S structural constraints.
Equivocation. It should otherwise equivocate sufficiently between the basic possibilities
expressible in L.
E1: PE sufficiently close to equivocator P= (ω) = 1/ 2n .
P
◦ d(P, P= ) = ω∈Ω P(ω) log P(ω)/ P= (ω).
◦ Proximity to the equivocator = entropy.
[Williamson, J. (2010). In defence of objective Bayesianism. Oxford University Press, Oxford.]
8
3
Frequentist Statistics for Calibration
A new picture of the relation between frequentist statistics and Bayesianism:
1. Be explicit about language L and evidence E.
2. M = P∗ = candidate physical probability functions, given E (frequentist statistics).
∗ 〉 ∩ S = belief functions compatible with evidence.
3. E = 〈PL
4. Choose a belief function PE in E
◦ Objectivists require that this be sufficiently close to the equivocator.
◦ Use PE as a basis for action.
9
Extended Example
◦ An agent is sampling 100 vehicles at a road T-junction.
◦ She want to predict whether the 101st vehicle will turn left (L) or right (¬L).
◦ She observes vehicles 1 , . . . , 100 and finds that 41 of them turn left.
◦ The sample does not indicate any dependence of an outcome on the past sequence of
outcomes, so the agent is prepared to grant that the outcomes are iid.
10
Step 1. Determine a threshold of acceptance.
◦ Let τ0 be the minimum degree to which the agent would need to believe a statement of
the form P∗ (L) ∈  for her to grant it.
IE
A threshold of acceptance.
Where utilities are available, they can be used to determine the threshold:
θ
¬θ
grant θ
S1
E2
don’t grant θ
E1
S2
Grant θ iff the expected utility of granting θ outweighs that of not granting θ, i.e., iff
P(θ)S1 + (1 − P(θ))E2 ≥ P(θ)E1 + (1 − P(θ))S2 ,
i.e., iff
P(θ) ≥
EG
S2 − E2
S1 + S2 − E1 − E2
.
for the utility matrix
P∗ (L) ∈ 
P∗ (L) 6∈ 
grant P∗ (L) ∈ 
1
−5
the threshold of acceptance τ0 is 6/ 8 = .75.
11
don’t grant P∗ (L) ∈ 
−1
1
Step 2. Determine a confidence interval.
¯ τ0 ) such that P∗ (P∗ (L) ∈ (X,
¯ τ0 )) ≈ τ0 :
◦ Given τ0 , one can find a confidence interval (X,
¯ − P∗ (L)| ≤ δ) ≈ τ.
Classical frequentist estimation methods yield assertions of the form P∗ (|X
IE
EG
¯ of vehicles turning left in the
In the limit, in roughly 100τ% of samples, the proportion X
sample will be within δ of the physical probability of vehicles at the junction turning left.
¯ ∼ N (p, p(1 − p)/ n):
Take L to be binomially distributed and X
¯ is approximately normal
◦ The Central Limit Theorem implies that the distribution of X
with mean p and standard deviation p(1 − p)/ n, where p = P∗ (L) and n is the sample
size (100 in this case).
p
¯ ≤ r) ≈ ((r − p)/ p(1 − p)/ n) where  is the standard normal distribution func∴ P∗ (X
tion.
EG
¯ ≤ .41) ≈ (−.09/ .05) = 0.0359.
If p = .5 then P∗ (X
Then,
δ
¯ − p| ≤ δ) ≈ 
P∗ (|X
p
= 2
p
!
p(1 − p)/ n
!
δ
p(1 − p)/ n
say.
12
−
−δ
p
−1=τ
p(1 − p)/ n
!
On the other hand δ can be construed as a function of τ:
p
¯ − p| ≤ δ) ≈ τ.
◦ Given τ one can choose δ = −1 (1/ 2 + τ/ 2) p(1 − p)/ n so that P∗ (|X
IE
¯ − δ, X
¯ + δ]) = τ.
P∗ (p ∈ [X
◦ A 100τ% confidence interval for p.
NB
δ depends on p and hence is also unknown.
∴ Need an identifiable confidence interval for p.
13
df
◦ Let k = −1 (1/ 2 + τ/ 2).
¯ − p| ≤ δ if and only if
Now, |X
È
¯ − p| ≤ k
|X
Squaring both sides,
¯ 2 − 2Xp
¯ + p2 ≤
X
p(1 − p)
n
k2p
n
−
.
k 2 p2
n
,
i.e., as a quadratic in p,
‚
1+
k2
n
Œ
2
‚
¯+
p −2 X
k2
2n
Œ
¯ 2 ≤ 0.
p+X
This inequality holds when p is between the two zeros of this quadratic, i.e., when p is in the
interval


p
p
2
2
2
2
2
2
¯
¯
¯
¯
¯
¯
 X + k / 2n − k X(1 − X)/ n + k / 4n X + k / 2n + k X(1 − X)/ n + k / 4n 
,

,
1 + k2/ n
1 + k2/ n
an identifiable confidence interval for p.
¯ τ) to refer to the intersection of this interval with the unit interval.
◦ We shall use (X,
¯ τ0 )) ≈ τ0 .
∴ We can apply frequentist methods at this step to infer that P∗ (P∗ (L) ∈ (X,
14
Step 3. Calibrate.
¯ τ)) ≈ τ0 .
◦ We have that P∗ (P∗ (L) ∈ (X,
◦ We know that the sample s in question is a sample of the kind for which this holds.
¯ s , τ0 )) = τ0 .
∴ The Calibration norm implies that PE (P∗ (L) ∈ (X
IE
EG
The agent should believe to degree τ that the physical probability of turning left lies
within the confidence interval induced by this specific sample.
¯ τ0 )) ≈ τ says that in about 75% of samples t ∈ S, the
If τ0 = .75 then P∗ (P∗ (L) ∈ (X,
¯ t , .75) bounds the physical probability of turning left.
interval (X
◦ All the agent knows is that s ∈ S.
¯ s , .75) bounds the physical
∴ She should believe to degree .75 that the interval (X
probability of turning left.
¯ s = .41 so the agent should believe to degree .75 that the interval (X
¯ s , .75) =
◦ X
[.355, .467] bounds the physical probability of turning left.
15
Step 4. Accept.
¯ s , τ0 )) = τ0 .
◦ So far, PE (P∗ (L) ∈ (X
◦ But τ0 is the acceptance threshold for statements of the form P∗ (L) ∈ .
¯ s , τ0 ).
∴ The agent should grant that P∗ (L) ∈ (X
◦ Let E 0 be her new evidence base after granting this.
16
Step 5. Recalibrate.
¯ s , τ0 ), and that 101 is an instantiation of this type.
◦ The agent grants that P∗ (L) ∈ (X
¯ s , τ0 ).
∴ The Calibration norm implies that PE 0 (L101 ) ∈ (X
IE
EG
The agent should believe that the next vehicle will turn left to some degree within
¯ s , τ0 ).
the confidence interval (X
In our example, the agent grants that P∗ (L) ∈ [.355, .467], i.e., that the proportion of
vehicles at the junction that turn left is in the interval [.355, .467], and knows only that
vehicle 101 is a vehicle at the junction, so her degree of belief that vehicle 101 turns left
should be within the interval [.355, .467].
While the emprically-based subjective Bayesian would stop here, the objective Bayesian
would proceed to the following step.
17
Step 6. Equivocate.
◦ The Equivocation norm says that the agent should believe that the next vehicle will turn
left to some degree within the interval that is sufficiently equivocal:
◦ PE 0 (L101 ) should lie within the interval and should be sufficiently close to P= (L101 ) =
1/ 2, where, as before, P= signifies the equivocator function.
◦ In our example, since 1/ 2 > .467, the agent should believe that vehicle 101 turns
left to degree .467 or thereabouts.
18
Summary
Step 1. Let τ0 be the minimum degree to which the agent would need to believe a statement
of the form P∗ (L) ∈  for her to grant it.
¯ τ0 ) such that P∗ (P∗ (L) ∈ (X,
¯ τ0 )) ≈ τ0 :
Step 2. Given τ0 , find a confidence interval (X,
¯ s , τ0 )) = τ0 .
Step 3. The Calibration norm implies that PE (P∗ (L) ∈ (X
¯ s , τ0 ).
Step 4. The agent should grant that P∗ (L) ∈ (X
¯ s , τ0 ).
Step 5. The Calibration norm implies that PE 0 (L101 ) ∈ (X
Step 6. The Equivocation norm implies that PE 0 (L101 ) should lie within the interval and
should be sufficiently close to P= (L101 ) = 1/ 2.
19
Are the Confidence-Interval Methods Kosher?
NB
The frequentist confidence-interval methods applied at Step 2 are uncontentious.
◦ It is thus not until the (Bayesian) Step 3 that a concrete interval is actually isolated.
Howson and Urbach (1989, pp. 240–241) objected to an analogous Bayesian casting of
confidence-interval methods via the Principal Principle.
¯ τ)) ≈ τ does not license the inference to PE (P∗ (L) ∈
◦ In our terms: P∗ (P∗ (L) ∈ (X,
¯ s , τ0 )) = τ, because (X,
¯ τ0 ) is not an interval of numbers, but rather a function of
(X
¯ to an interval of numbers).
possible experimental outcomes (a function mapping X
But this objection cannot be right:
◦ Any application of the Calibration norm / Principal Principle in which physical probabilities
are construed as generic rather than single-case must draw inferences from a function
of possible experimental outcomes.
IE
EG
An inference from P∗ (θ()) ∈ Y to PE (θ(s)) ∈ Y.
An inference from a physical probability of .7 of surviving 5 years after diagnosis with
prostate cancer to a degree of belief of .7 that Bob will survive 5 years after diagnosis
with prostate cancer is an inference from the probability of a propositional function ( will
survive 5 years after diagnosis with prostate cancer) to the probability of a proposition
(Bob will survive 5 years after diagnosis with prostate cancer).
¯ τ) is a function cannot be problematic in itself.
∴ So the fact that (X,
20
Howson and Urbach draw the following analogy:
For example, the physical probability of getting a number of heads greater than 5 in
20 throws of a fair coin is 0.86 . . . That is, P∗ (K > 5) = 0.86, where K is the number
of heads obtained. According to the Principal Principle, P[(K > 5)t | P∗ (K > 5) =
0.86] = 0.86, so 0.86 is also the confidence that you should place in any particular
trial of 20 throws of a fair coin producing a number of heads greater than 5.
Suppose a trial is made and 2 heads are found in a series of 20 throws with a coin
that is known to be fair. To infer that we should now be 86 per cent confident that 2
is greater than 5 would be absurd and a misapplication of the Principal Principle. If
one could substitute numbers for K in the Principle, it would be hard to see why the
substitution should be restricted to the term’s first occurrence. But no such substitution is allowed. For the Principal Principle does not assert a general rule for each
number K from 0 to 20; the K-term is not in fact a number, it is a function which
takes different values depending on the outcome of the underlying experiment.
(Howson and Urbach, 1989, p. 240)
This is a misdiagnosis:
◦ Before the trial t takes place it is reasonable to believe that (K > 5)t to degree 0.86.
◦ (Just as at one time it was reasonable to believe that the number of the planets < 8.)
◦ After the fact, it is known that the number of heads is 2, not 5.
◦ This information blocks the previous inference (0 chance / inadmissible information).
21
A closer analogy:
¯ .75)) ≈ .75 to PE (p ∈ (X¯s , .75)) = .75 (p unknown).
◦ Our inference is from P∗ (p ∈ (X,
◦ This is like the move from P∗ (Hp > H) = .75, to PE (Hp > 20) = .75 where Hs = 20.
◦ But this is a harmless application of the Principal Principle:
◦ Suppose that it is known that P∗ (H < Hp ) = .75,
◦ And that it is known that Steve has been randomly sampled,
◦ But Steve’s height has not yet been obtained.
É
Then uproblematically PE (Hs < Hp ) = .75.
NB
Neither Hs nor Hp are known at this stage.
◦ Then Steve’s height is measured and it is learned that Hs = 20.
◦ This new knowledge provides no grounds for moving away from PE (Hs < Hp ) = .75.
◦ But it is known that Hs = 20 so PE (20 < Hp ) = .75.
◦ Analogously, in our example the agent ought to believe to degree .75 that p ∈ [.355, .467].
22
Qualifying the Acceptance Assumption
The procedure needs to be qualified in order to avoid another objection:
¯ τ) such that P∗ (P∗ (L) ∈  0 (X,
¯ τ)) ≈ τ and the results of the
◦ There are other intervals  0 (X,
procedure will depend on the chosen interval.
EG
A one-sided confidence interval.
The problem is at Step 1:
◦ The assumption that there is a threshold degree of belief τ0 above which the agent
should grant any statement of the form P∗ (L) ∈ .
× Given τ0 there may be various sets  for which the agent believes that P∗ (L) ∈  to this
threshold degree of belief.
◦ This family of sets will normally have empty intersection, and so—if the assumption
were true—the agent would be forced to believe that P∗ (L) is no number at all.
Standard solution:
◦ Suppose instead that there is a threshold degree of belief τ0 above which the agent
should grant P∗ (L) ∈  where  is the narrowest interval which meets this threshold (see,
e.g., Kyburg Jr and Teng, 2001, §11.5).
23
Conclusion
One can view frequentist and Bayesian methods as living in harmonious symbiosis:
Ø The Bayesian needs the frequentist for calibration.
Ø The frequentist needs the Bayesian for applications to the single case.
◦ In particular in order to justify the application of confidence intervals to single cases.
One integration of the methods goes like this:
Step 1. Let τ0 be the minimum degree to which the agent would need to believe a statement
of the form P∗ (L) ∈  for her to grant it.
¯ τ0 )) ≈ τ0 :
¯ τ0 ) such that P∗ (P∗ (L) ∈ (X,
Step 2. Given τ0 , find a confidence interval (X,
¯ s , τ0 )) = τ0 .
Step 3. The Calibration norm implies that PE (P∗ (L) ∈ (X
¯ s , τ0 ).
Step 4. The agent should grant that P∗ (L) ∈ (X
¯ s , τ0 ).
Step 5. The Calibration norm implies that PE 0 (L101 ) ∈ (X
Step 6. The Equivocation norm implies that PE 0 (L101 ) should lie within the interval and
should be sufficiently close to P= (L101 ) = 1/ 2.
See also http://www.kent.ac.uk/secl/philosophy/jw
24
References
Howson, C. and Urbach, P. (1989). Scientific reasoning: the Bayesian approach. Open Court,
Chicago IL, second (1993) edition.
Kyburg Jr, H. E. and Teng, C. M. (2001). Uncertain inference. Cambridge University Press,
Cambridge.
Williamson, J. (2010). In defence of objective Bayesianism. Oxford University Press, Oxford.
25