Download Report

Statistics 102
Unit 2
Probability
Suggested Reading
Open Intro, Sections 2.1, 2.2, 2.6
27 February 2015
1 / 68
Outline for Unit 2
Introduction
Basic concepts from probability
Conditional probability, general multiplication formula for
probability
Positive Predictive Value of a Diagnostic Test and Bayes’
Rule
2 / 68
Progress This Unit
Introduction
Basic concepts from probability
Conditional probability, general multiplication formula for
probability
Positive Predictive Value of a Diagnostic Test and Bayes’
Rule
3 / 68
Two ‘easy’ probability problems
Probability problems can be both surprisingly easy to
phrase and maddeningly difficult to solve.
The next two slides contain problems of this sort; try to
guess at solutions using your intuition.
We will return to these later to solve them, using the
advantage we get from precise, ‘math-like’ formulations.
4 / 68
Mandatory drug testing
A false positive in a drug screening test occurs when the
test incorrectly indicates that a screened person is a drug
user.
Suppose a mandatory drug test has a false-positive rate of
1.2% (or 0.012), and suppose a company uses the test to
screen employees for drug use.
• Given 150 employees who are drug free, what is the
probability that at least one will (falsely) test positive
1. < 0.25
2. Between 0.25 and 0.75
3. > 0.75.
5 / 68
Breast cancer and mammograms
The National Cancer Institute estimates that
approximately 3.65% of women in their 60’s get breast
cancer. A mammogram typically identifies a breast cancer
about 85% of the time, and is correct 95% of the time when
a woman does not have breast cancer.
If a woman in her 60’s has a positive mammogram, what is
the likelihood she has breast cancer?
(1) less than 0.5
(2) 0.5 or greater
6 / 68
Announcements: Friday February 20
• P-sets: P-set 3 now posted, due Friday, February 27 at
•
•
•
•
•
the usual time.
Solutions to P-set 2 now posted.
Clicker questions today, channel 41.
Quiz next Wednesday, February 25. Topics to be
announced in weekend email.
Begin today with slide 7 Unit 2.
Some R tidbits. . .
Main Goal for the unit
Use the language of probability and mathematics to help
draw conclusions about populations from datasets.
Material will be a mix of intuition and formal rules about
probability and random phenomena.
Important notions:
• Rules for probability
• Conditional probabilities
Important material will be presented on the board in
lecture.
7 / 68
Progress This Unit
Introduction
Basic concepts from probability
Conditional probability, general multiplication formula for
probability
Positive Predictive Value of a Diagnostic Test and Bayes’
Rule
8 / 68
Main ideas
• Some definitions
• The axioms (rules) of probability
• Combining events in sample spaces
• Venn diagrams
• Independence
9 / 68
Mandatory drug testing
A false positive in a drug screening test occurs when the
test incorrectly indicates that a screened person is a drug
user.
Suppose a mandatory drug test has a false-positive rate of
1.2% (or 0.012), and suppose a company uses the test to
screen employees for drug use.
• Given 150 employees who are drug free, what is the
probability that at least one will (falsely) test positive
1. < 0.25
2. Between 0.25 and 0.75
3. > 0.75.
Two solutions: one using independence on the board (and
in clickers!); one in R.
10 / 68
Mandatory drug tests
• Example: A mandatory drug test has a false-positive
rate of 1.2% (or 0.012)
• Given 150 employees who are drug free, what is the
probability that at least one will (falsely) test positive?
• Pr(At least 1 "+") = Pr(1 or 2 or 3 ... or 150 "+")
= 1 – Pr (None "+") = 1 - Pr(150 "-")
• Pr(150 "-") = Pr(1 "-")150 = (0.988)150 = 0.16
• Pr(At least 1 "+") = 1 - Pr(150 "-") = 0.84
11 / 68
Announcements: Monday February 23
• P-sets: P-set 3 due Friday, February 27 at the usual
time.
• Clicker questions today, channel 41.
• Quiz Wednesday, February 25. Will cover the code
used in the Golub analysis and elementary ideas in
probability.
• Begin today with slide 12 Unit 2.
Solution in R
We use R to ‘model’ the problem, then run a simulation to
estimate the answer.
Two steps in the modeling:
• Conduct one set of 150 tests, in one set of 150
employees
• Replicate the 150 employee tests 1,000 times.
• Calculate the proportion of outcomes with at least one
positive test in the 1,000 replicates.
Next slide shows logic of the simulation; always best to
start with this.
12 / 68
Modeling one set of drug tests in 150
employees
Main steps:
• Initialize parameters of the problem, using information
given, including creating an initial vector of test
results, all negative.
• Simulate test results by sampling from the two values
0 (neg. test) and 1 (pos test), according to
probabilities for each outcome.
• Record whether there was at least one positive test.
• Also record proportion of positive tests.
Next slide shows the R code. The comments use as much
space as the actual commamnds. We will run the code in
lecture.
13 / 68
The R code for one set of tests
## coding for the drug testing problem
## begin with one set of employees
## set parameters and initialize
prob.false.positive = 0.012
num.employees = 150
set.seed(6578)
## initialize population
## default for function vector() sets values to 0
## This call to vector() creates a numeric vector
##
of length num.employees, with all values = 0
test.result = vector("numeric", num.employees)
## now sample from test results
##
using function sample()
## Type help(sample) for a complete explanation of
##
the function
## 0 = neg result, 1 = post result
test.result = sample(c(0,1), size = num.employees,
prob=c(1 - prob.false.positive,
prob.false.positive),replace = TRUE)
14 / 68
R for drug tests . . .
## at least one positive test?
## Use the logical condition (num.pos.tests > 0) sets
##
the variable at.least.one.pos = TRUE if there is
##
at least one positive test
num.pos.tests = sum(test.result)
at.least.one.pos = (num.pos.tests > 0)
at.least.one.pos
##
Also look at the number and proportion of positive tests
num.pos.tests = sum(test.result)
prop.pos.tests = num.pos.tests/num.employees
prop.pos.tests
15 / 68
Replicating the 150 tests in R:
Replicating in R uses a simple for() loop.
The syntax is
for(ii in number of loops){
code to be repeated
}
The use of ii in the loop counter is arbitrary.
The next slides look more complicated than than they are.
16 / 68
The R code for replicating
## now replicate 1,000 times
## initialize again
prob.false.positive = 0.012
num.employees = 150
num.replicates = 1000
set.seed(6578)
## initialize for replicates
at.least.one.pos = vector("numeric", num.replicates)
## Nest earlier simulation in a ‘for’ loop which
##
repeats the 150 tests num.replicates times
## Record in each for() loop whether or not at least one
##
test was positive
17 / 68
Replicating . . .
for(ii in 1:num.replicates){
test.result = vector("numeric", num.employees)
test.result = sample(c(0,1), size = num.employees,
prob=c(1 - prob.false.positive,
prob.false.positive),replace = TRUE)
num.pos.tests = sum(test.result)
## at least one positive test?
at.least.one.pos[ii] = (num.pos.tests > 0)
}
## Now calculate the proportion of replicates that produced
##
at least one positive test
sum(at.least.one.pos)/num.replicates
18 / 68
Recap
The coding is less important than what can be learned.
• Setting up the problem requires understanding the
problem statement.
• Interpreting the result helps reinforce probability
concepts.
• You can check your math solution
• Once the program is written (and running!) it can be
run many times with different parameters.
On a p-set, you will have a chance to run this code, and to
modify it a bit to examine different situations.
19 / 68
Independence and the
Hardy-Weinberg distribution
Genes can be defined in two ways:
• A gene is a determinant, or a co-determinant, of a
character that is inherited according to Mendel’s rules.
• A gene is a functional unit of DNA.
Previous unit used the both the first and the second
definition; we look at the first here.
A locus (plural loci) is a unique chromosomal location
defining the position of an individual gene or DNA
sequence. (Example: ABO blood group locus)
20 / 68
The Hardy-Weinberg distribution . . .
Consider two alleles, A1 and A2 at the A locus.
Assume ‘gene frequencies’ in the population are p and q,
respectively, i.e., assume that p is the probability that a
randomly chosen member of a population will have allele
A1 at locus A.
In our language,
• Pr(randomly chosen allele = A1 ) = p
• Pr(randomly chosen allele = A2 ) = q = 1 − p
What happens in inheritance through reproduction?
21 / 68
Hardy-Weinberg . . .
• The chance that both alleles are A1 is p2 .
• The chance that both alleles are A2 is q 2 .
• The chance that the first allele was A1 and the second
A2 is pq. The chance that the first was A2 and the
second A1 is qp.
• Overall, the chance of picking one A1 and one A2
allele is 2pq.
The above proportions are called the Hardy-Weinberg
proportions.
What important assumptions are being made here?
22 / 68
Disease inheritance
An autosomal recessive condition affects 1 newborn in
10,000. What is the expected frequency of carriers?
If a parent of a child affected by this condition remarries,
what is the risk of producing an affected child in the new
marriage? Assume that affected individuals do not live long
enough to reproduce.
Solutions in class
23 / 68
Clicker Q: Disease inheritance
An autosomal recessive condition affects 1 newborn in
10,000, so the expected frequency of carriers is
approximately 1/50.
If a parent of a child affected by this condition remarries,
what is the risk of producing an affected child in the new
marriage? Assume that affected individuals do not live long
enough to reproduce.
1. 1/200
2. 1/100
3. 1/1000
Checking H-W
In any population or sample from a population, the
genotype frequency will match H-W predicted frequencies
only approximately
Let’s check H-W for the distribution of the SNP
actn3_r577x examined in Unit 1.
Recall from Unit 1 . . .
24 / 68
Genotype and allele distribution
## assumes genetics package has been loaded,
## FAMuSS data set has been loaded, and fms attached.
> r577x.genotype = genotype(actn3_r577x, sep="")
> summary(r577x.genotype)
Number of samples typed: 735 (52.6%)
Allele Frequency: (2 alleles)
Count Proportion
C
750
0.51
T
720
0.49
NA 1324
NA
Genotype Frequency:
Count Proportion
C/C
216
0.29
C/T
318
0.43
T/T
201
0.27
NA
662
NA
25 / 68
Progress This Unit
Introduction
Basic concepts from probability
Conditional probability, general multiplication formula for
probability
Positive Predictive Value of a Diagnostic Test and Bayes’
Rule
26 / 68
From xkcd . . .
27 / 68
A more serious example of
conditional probabilities
Published in Patel, et al., NEJM (2015) Vol 372, pp 331 340, posted on web site.
28 / 68
Conditional Probability
The notions of conditional probability and conditional
distributions are pervasive in statistics, for both simple and
complex problems.
Needed because not all events are independent
29 / 68
Announcements: Wednesday February
25
• P-sets:: P-set 3 due Friday, February 27 at the usual
•
•
•
•
time.
P-sets: P-set 4 will be posted Friday, February 27.
P-set 4 will be due Monday, March 9. Last p-set before
the March 11 midterm.
No Clicker questions today.
Quiz today.
Begin today with slide 30 Unit 2.
Main ideas
• Definition of conditional probability
• A general multiplication rule for probabilities
30 / 68
Conditional Probability: Concept
Conceptual Definition: The conditional probability of an
event B, given a second event A, is the probability of B
happening, knowing that the event A has happened.
Notation: conditional probability is denoted by Pr(B|A).
Coin tossing example:
• Toss a fair coin 3 times. B = (exactly two heads), A =
(at least 2 heads).
• Pr(B|A) is the probability of having exactly two heads
among the outcomes that have at least two heads.
• Conditioning on A means we know the outcome is in
the set (HHH , HHT , HTH , THH )
• In this set of outcomes, B consists of the last three, so
Pr(B|A) = 3/4
• Note that Pr(B) = 3/8
31 / 68
Conditional Probability in life tables
• B = (a randomly chosen person from a population
lives at least 65 years)
• A = (a randomly chosen person lives at least 60 years)
• Then Pr(B|A) is the probability a person lives at least
65 years, given that they have been selected from the
population of people 60 years of age or older.
• Solution using US life tables. . .
32 / 68
x
given age is the average number of years remaining to be lived by those
surviving to that age on the basis of a given set of age-specific rates
of dying. It is derived by dividing the total person-years that would be
lived above age x by the number of persons who survived to that age
interval (Tx / lx). Thus, the average remaining lifetime for males who
reach age 20 is 56.2 years (5,537,328 divided by 98,486) (Table 2).
Life expectancy in the United States
Tables 1–9 show complete life tables by race (white and black)
and sex for 2004. Tables A and B summarize life expectancy and
survival by age, race, and sex. Life expectancy at birth for 2004
represents the average number of years that a group of infants would
live if the infants were to experience throughout life the age-specific
US Life Table, 2004 Population
Table B. Number of survivors by age, out of 100,000 born alive, by race and sex: United States, 2004
All races
Age
0. .
1. .
5. .
10 . .
15 . .
20 . .
25 . .
30 . .
35 . .
40 . .
45 . .
50 . .
55 . .
60 . .
65 . .
70 . .
75 . .
80 . .
85 . .
90 . .
95 . .
100 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
White
Black
Total
Male
Female
Total
Male
Female
Total
Male
Female
100,000
99,320
99,202
99,129
99,036
98,709
98,246
97,776
97,250
96,517
95,406
93,735
91,357
88,038
83,114
76,191
66,605
53,925
38,329
22,219
9,419
2,510
100,000
99,253
99,124
99,043
98,936
98,486
97,809
97,148
96,455
95,527
94,154
92,078
89,089
85,067
79,213
71,168
60,336
46,461
30,619
15,948
5,808
1,261
100,000
99,391
99,283
99,218
99,142
98,944
98,710
98,442
98,088
97,555
96,709
95,445
93,676
91,058
87,043
81,200
72,748
61,045
45,438
27,782
12,448
3,460
100,000
99,434
99,327
99,261
99,175
98,856
98,420
97,992
97,512
96,831
95,797
94,249
92,044
88,908
84,145
77,338
67,756
54,953
39,024
22,460
9,330
2,381
100,000
99,378
99,261
99,187
99,085
98,655
98,020
97,418
96,784
95,915
94,617
92,680
89,894
86,103
80,450
72,531
61,683
47,622
31,324
16,145
5,720
1,175
100,000
99,493
99,397
99,339
99,268
99,068
98,849
98,608
98,292
97,809
97,047
95,896
94,282
91,810
87,930
82,206
73,794
62,031
46,175
28,082
12,362
3,299
100,000
98,616
98,441
98,334
98,210
97,809
97,131
96,321
95,404
94,200
92,396
89,614
85,599
80,282
73,268
64,578
53,914
41,332
28,260
16,403
7,554
2,534
100,000
98,475
98,285
98,171
98,030
97,436
96,415
95,241
94,011
92,504
90,366
86,946
81,898
75,282
66,782
56,723
44,994
31,985
20,021
10,432
4,180
1,178
100,000
98,763
98,603
98,503
98,396
98,195
97,865
97,402
96,774
95,849
94,347
92,146
89,063
84,923
79,231
71,774
62,028
49,714
35,600
21,627
10,374
3,559
33 / 68
30 . . . . . . . .
35 . . . . . . . .
40 . . . . . . . .
45 . . . . . . . .
50 . . . . . . . .
55 . . . . . . . .
60 . . . . . . . .
65 . . . . . . . .
70 . . . . . . . .
75 . . . . . . . .
80 . . . . . . . .
85 . . . . . . . .
90•. . 83,114
......
95 . . . . . . . .
100 . 65.
......
......
......
......
......
......
......
......
......
......
......
......
......
.out
. . . of
..
......
......
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.100,000
. . . . . . . live
.
.........
.........
97,776
97,250
96,517
95,406
93,735
91,357
88,038
83,114
76,191
66,605
53,925
38,329
22,219
births
9,419
2,510
97,148
96,455
95,527
94,154
92,078
89,089
85,067
79,213
71,168
60,336
46,461
30,619
15,948to
expected
5,808
1,261
US Life Table, ages 55 - 70
98,442
98,088
97,555
96,709
95,445
93,676
91,058
87,043
81,200
72,748
61,045
45,438
live27,782
to age
12,448
3,460
97,9
97,5
96,8
95,7
94,2
92,0
88,9
84,1
77,3
67,7
54,9
39,0
22,4
9,3
2,3
• Probability that randomly selected person live to age
65 is approximately 0.83.
• Of the 88,038 who live to age 60, 83,114 live to age 65
• Conditional probability of living to at least age 65
among those who live to at least age 60 is
(83,114/88,038) = 0.94
34 / 68
Conditional Probability:
Mathematical Definition
As long as Pr(A) > 0
Pr(B|A) = Pr(A ∩ B)/ Pr(A)
In life table example,
Pr(A) =
=
Pr(B ∩ A) =
=
=
Pr(person lives at least 60 years)
88, 038/100, 000
Pr(lives at least 60 years and at least 65 years)
Pr(person lives at least 65 years)
83, 114/100, 000
So
Pr(B|A) =
83, 114/100, 000
= 0.94
88, 038/100, 000
35 / 68
Alcohol dependency
Suppose a study of the US population showed that 10% of
the population have some mental disorder, 8% have an
alcohol related disorder, and 6% have both.
If a person has been diagnosed with an alcohol related
disorder, what is the probability that he/she has a mental
disorder?
If a person has been diagnosed with an mental disorder,
what is the probability that he/she has an alcohol related
disorder?
36 / 68
Independence again. . .
A simple consequence of the definition of conditional
probability:
• A and B are independent if Pr(B|A) = Pr(B)
Example
• B = (a randomly chosen person from a population
lives at least 65 years); A = (a randomly chosen person
lives at least 60 years).
Are A and B are independent?
37 / 68
Announcements: Friday February 27
• P-sets: P-set 3 due today at the 5:00 pm.
• P-sets: P-set 4 will be posted later today. P-set 4 will
•
•
•
•
•
be due Monday, March 9. Last p-set before the March
11 midterm.
Collection of review problems with solutions coming
early next week.
Exam 1 last year covered slightly different material, so
it will not be posted.
Clicker questions today.
Quiz 2 with solutions now posted on the web site. The
quiz will be returned in section next week.
Begin today with slide 38 Unit 2.
More conditional probabilities
Suppose a disease is caused by a single major gene with two
alleles (a) and (A) with frequencies 0.90 and 0.10,
respectively.
• If we assume independent mating (non-associative
mating), what are the probabilities of the genotypes
(aa), (aA) and (AA)?
• Suppose the allele A causes a disease but that the gene
is not fully penetrant, so that the probability of
developing the disease is 0.8 for genotype (AA), 0.4 for
genotype (Aa), and 0.1 for genotype (aa). Disease in
the presence of genotype (aa) is called sporadic disease.
• What is the overall probability of disease in the
population? This overall probability is called the
prevalence.
38 / 68
Conditional probabilities . . .
• Suppose an individual is known to have the disease.
Now what are the probabilities of the genotypes (aa),
(aA) and (AA)?
39 / 68
Main ideas
• Definition of conditional probability
• A general multiplication rule for probabilities
40 / 68
Multiplication Rule of Probability
Pr(A ∩ B) = Pr(B|A) Pr(A)
We can also write
Pr(A ∩ B) = Pr(A|B) Pr(B)
Example
• A bag contains 3 red and 3 white balls. Two balls are
drawn from the bag, the second without replacing the
first.
• A = (first ball is red), B = (second ball is white).
• Pr(A ∩ B) = Pr(B|A) Pr(A) = (3/5) × (1/2) = 3/10
41 / 68
Examples
First with independence:
• Toss a coin and let A = (observe heads in the toss)
• Pr(A) = 1/2
• What is probability get 5 heads in a row when flip coin
5 times?
• 12 × 12 × 12 × 12 ×
1
2
=
1
32
Now without independence:
• Draw 3 balls (without replacement) from an urn with
10 balls, 5 red, 5 green.
• The probability of getting 5 green balls is
5
120
1
× 49 × 38 × 27 × 61 = 30240
= 252
10
42 / 68
Conditional distributions of heights
In the US population, approximately 20% of men and 3%
of women are taller than 6 feet (72 in)
Let F = the event that someone is female and T = the
event the person is taller than 6 feet
1. What is Pr(T |F )?
2. What is Pr(T |F c )?
3. What is the probability that the next person walking
through the door is a woman and taller than 6 feet?
4. What is the probability that the next person walking
through the door is taller than 6 feet tall?
43 / 68
TreeTree
diagrams
can help to organize
diagrams (IPS p320) can help to
your thinking...
organize your thinking…
0.03
0.5
Female:
Yes
Person is…
0.97
0.2
0.5
Female:
No
0.8
Taller than 6’
0.015
Not Taller
than 6’
0.485
Taller than 6’
0.10
Not Taller
than 6’
0.40
33
44 / 68
Progress This Unit
Introduction
Basic concepts from probability
Conditional probability, general multiplication formula for
probability
Positive Predictive Value of a Diagnostic Test and Bayes’
Rule
45 / 68
Main ideas this section
Diagnostic (or screening) tests and measures of test
accuracy
Calculating the positive predictive value of a test using
• Tables
• Bayes’ rule
46 / 68
Pre-natal testing for trisomy 21, 13,
and 18
Some congenital disorders are caused by an additional copy
of a chromosome being attached (translocated) to another
chromosome during reproduction.
• Trisomy 21: Down syndrome, approximately 1 in 800
births
• Trisomy 13: Patau’s syndrome, physical and mental
disabilities, approximately 1 in 16,000 newborns
• Trisomy 18: Edward’s syndrome, nearly always fatal,
either in stillbirth or infant mortality. Occurs in about
1 in 6,000 births
Until recently, testing for these conditions consisted of
screening the mother’s blood for serum markers, followed
by amniocentesis in women who test positive.
47 / 68
Cell-free fetal DNA (cfDNA)
cfDNA consists of copies of embryo DNA present in
maternal blood.
Recent advances in sequencing DNA provide possibility of
non-invasive testing for these disorders, using only a blood
sample.
Initial testing of the technology was done using archived
samples of genetic material from children whose trisomy
status was known.
The results are variable, but generally very good.
48 / 68
• Of 1000 children with the one of the disorders,
approximately 980 have cfDNA that tests positive.
The test has high sensitivity.
• Of 1000 children without the disorders, approximately
995 test negative. The test has high specificity.
49 / 68
What do the parents of the unborn
child care about?
The designers of a test want a test to have high sensitivity
and specificity. That makes it a good test.
But a family undergoing testing wants to know the
likelihood of the condition being present, if the test is
positive.
Suppose a child has tested positive for trisomy 23. What is
the probability that the child in fact does have the trisomy
23 condition?
50 / 68
Trisomy 23
We will show 3 solutions (!) to this problem.
• Intuitive solution that requires only common sense
(and a bit of clear thinking. . . ). Will also illustrate use
of R as a calculator
• Algebraic solution using Bayes’ rule.
• Simulation based solution, similar to that used in the
drug testing problem.
Each solution provides a different way to think about the
problem.
Intuitive solution based on a simple two-way table on the
board.
51 / 68
Using R as a ‘programmable’
calculator
## calculations for trisomy 23 example, unit 2
## parameters of the problem
tri23.prevalence = 1/800
cfdna.sens = 0.980
cfdna.spec = 0.005
pop.size = 10000
## expected number of healthy children and children with
##
the disorder
expected.cases = pop.size * tri23.prevalence
expected.noncases = pop.size - expected.cases
##
##
Number of children testing positive will consist
of both true and false positives
expected.true.pos.tests = expected.cases * cfdna.sens
expected.false.pos.tests = expected.noncases * (1 - cfdna.spec)
52 / 68
Trisomy 13. . .
##
##
now calculate expected number of positive tests
in population
expected.pos.tests = expected.true.pos.tests +
expected.false.pos.tests
## Among all positive tests, calculate the fraction of
## positive tests correctly identifying trisomy 23
cfdna.ppv = expected.true.pos.tests /expected.pos.tests
cfdna.ppv
53 / 68
Diagnostic Tests
Events of interest, where () denotes an event:
• D = (person has disease)
• D C = (person does not have disease)
• T + = (positive screening result)
• T − = (negative screening result). Could use T and
T C , but T + , T − are consistent with notation in
medical and public health literature.
54 / 68
Measures of accuracy for diagnostic
tests
• Sensitivity = Pr(T + |D) (want very high!)
• Specificity = Pr(T − |D C ) (want high!)
• False negative rate = Pr(T − |D) = 1 - sensitivity
• False positive rate = Pr(T + |D C ) = 1 - specificity
These measures are all characteristics of a diagnostic test.
55 / 68
Positive predictive value of a test
Suppose an individual tests positive for a disease D.
Positive Predictive Value (PPV): The PPV of a diagnostic
test is the probability that a person has a disease D, given
that he/she has tested positive.
• PPV = Pr(D|T + )
• The conditioning here is in the reverse order from the
test characteristics
The characteristics of the test give us Pr(T + |D) (among
other things) but not Pr(D|T + ).
56 / 68
Bayes’ Theorem, aka Bayes’ Rule
Simple form first:
Pr(A|B) =
Pr(A) Pr(B|A)
Pr(B)
Follows directly by noting that
Pr(A ∩ B)
Pr(B)
Pr(A) Pr(B|A)
=
Pr(B)
Pr(A|B) =
Back to pre-natal screening (on the board, again).
57 / 68
The denominator Pr(B) in Bayes’
Theorem
Seldom stated as simply as in earlier slides, because in
many problems, Pr(B) is not given directly, but is
calculated using the general multiplication formula for
probabilities
Suppose A and B are events. Then
Pr(B) = Pr(B ∩ A) + Pr(B ∩ AC )
= Pr(A) Pr(B|A) + Pr(AC ) Pr(B|AC )
A!B
A
B
58 / 68
More complicated form of Bayes’
Theorem
Pr(A ∩ B)
Pr(B)
Pr(A) Pr(B|A)
=
Pr(B)
Pr(A) Pr(B|A)
=
Pr(A) Pr(B|A) + Pr(AC ) Pr(B|AC )
Pr(A|B) =
Next slide converts this to notation of diagnostic testing.
59 / 68
Bayes’ Rule for diagnostic tests
Take A = D, B = T +
Recall PPV is Positive Predictive Value
PPV = P(D|T + )
P(D)P(T + |D)
P(D)P(T + |D) + (1 − P(D))P(T + |D C )
prevalence × sensitivity
=
[prev × sensitivity] + [(1-prev) × (1-specificity)]
=
60 / 68
Venn diagram of Bayes’ Theorem
D
T+
Dc
7
Breast cancer and mammograms
The National Cancer Institute estimates that
approximately 3.65% of women in their 60’s get breast
cancer. A mammogram typically identifies a breast cancer
about 85% of the time, and is correct 95% of the time when
a woman does not have breast cancer.
If a woman in her 60’s has a positive mammogram, what is
the likelihood she has breast cancer?
62 / 68
Two solutions to breast cancer
example
• Algebraic solution in the p-set, using the Bayes’ rule
formula
• Simulation of a single large population shown on next
slide
The R code for the simulation is spread over 3 slides, but
much of the code is just comment.
Examining and using the code provides an opportunity to
match probabilisitic concepts with algorithmic thinking.
63 / 68
Using R to construct and examine a
large population
Code is a bit longer than drug testing problem, so split over
several slides.
Instead of simulating 150 drug tests with many replicates,
we use R to construct a large population of individuals.
Logic of the simulation:
• Initialize
• Loop through the population, simulating test outcome
conditional on disease status by using a if()
statement.
• Calculate PPV empirically.
64 / 68
The R code
##
##
Simulation of setting for diagnostic testing
Based on breast cancer example discussed in class
## step 1, initialize population with no disease, no test
population.size = 100000
disease.prevalence = 0.0365
test.sens = 0.85
test.spec = 0.90
set.seed(6579)
## Initialize the population, create the disease marker
## This code starts by establishing a population whose
##
members have the disease with probabilith equal to
##
the prevalence of the disease
disease.presence = vector("numeric", population.size)
disease.presence = sample(c(0,1), size = population.size,
prob=c(1 - disease.prevalence,
disease.prevalence),replace = TRUE)
## Now initialize the diagnostic test outcome vector
diag.test.outcome = vector("numeric", population.size)
65 / 68
The R code . . .
##
##
##
##
##
The code now loops through the population, sampling outcomes
for the diagnostic test using sensitivity and specificity,
conditional on the disease status of each member.
The if() statements sample outcomes of tests conditional on
disease status
for (ii in 1:population.size) {
if(disease.presence[ii] == 1) {
diag.test.outcome[ii] = sample(c(0,1), size=1,
prob = c(1 - test.sens, test.sens)) }
if(disease.presence[ii] == 0) {
diag.test.outcome[ii] = sample(c(0,1), size=1,
prob = c(test.spec, 1 - test.spec)) }
} ## end for loop
##
##
Create a matrix where each row contains the disease status and
test outcome for each population member
disease.pres.and.diag.test = cbind(disease.presence,
diag.test.outcome)
66 / 68
R code for mammogram example
##
##
##
##
##
##
As in the trisomy example, calculate ppv by finding the
proportion of true positive tests among all positive tests.
Since disease and test outcome are labeled 0 and 1, the sum of
the vector disease.presence is the number of members of th e
population with the disease. The same reasoning is used to
calculate the number of positive tests
num.disease = sum(disease.presence)
num.pos.test = sum(diag.test.outcome)
## Number of true positives is the number of positive
##
tests among members with the disease
d = (disease.presence == 1)
num.true.pos = sum(diag.test.outcome[d])
diag.test.ppv = num.true.pos/num.pos.test
diag.test.ppv
67 / 68
Summary, and a look ahead
The notions of probability, independence and conditional
probability provide ways to make probabilistic statements
(assessments of uncertainty) about events in populations.
The problems that arise (especially word problems) are
often easy to state, but difficult to solve.
Problems can be solved with algebraic calculations or using
R.
• Algebraic calculations are more familiar (and seem
easier), but getting to the right calculation can be
difficult.
• Using R requires dealing with R syntax, but is often
conceptually easier because simulating a population or
replicates of an experiment follows the problem
statement directly.
68 / 68