Download Report

Familial searching
Maarten Kruijver
[email protected]
VU University, Amsterdam
23 April, 2015
DNA databases often identify suspects, but
what if there is no match?
POLICYFORUM
HUMAN GENETICS
Finding Criminals Through DNA
of Their Relatives
• 
If there is no match, one can look for a relative
of the donor of a crime scene profile through
Familial Searching
• 
‘Finding Criminals Through DNA of Their
Relatives’ – Bieber, Brenner, Lazer (Science,
2006)
• 
Bieber et al. showed that familial searching is
feasible using a simulation study
Analyses of the DNA databases maintained by
criminal justice systems might enable criminals
to be caught by recognizing their kin, but this
raises civil liberties issues.
D
NA methods are now widely
used for many forensic purposes, including routine investigation of serious crimes and
for identification of persons killed
in mass disasters or wars (1–4).
DNA databases of convicted offenders are maintained by every U.S.
state and nearly every industrialized
country, allowing comparison of
crime scene DNA profiles to one
another and to known offenders (5).
The policy in the United Kingdom
stipulates that almost any collision
with law enforcement results in the
collection of DNA (6). Following
the U.K. lead, the United States has
shifted steadily toward inclusion of
all felons, and federal and six U.S.
state laws now include some provision for those arrested or indicted.
At present, there are over 3 million
samples in the U.S. offender/arrestee state and
federal DNA databases (7). Statutes governing
the use of such samples and protection against
misuse vary from state to state (8).
Although direct comparisons of DNA profiles of known individuals and unknown biological evidence are most common, indirect
genetic kinship analyses, using the DNA of
biological relatives, are often necessary for
humanitarian mass disaster and missing person
identifications (1, 2, 9). Such methods could
potentially be applied to searches of the convicted offender/arrestee DNA databases. When
crime scene samples do not match anyone in a
search of forensic databases, the application of
indirect methods could identify individuals in
the database who are close relatives of the
potential suspects. This raises compelling policy questions about the balance between collective security and individual privacy (10).
To date, searching DNA databases to identify close relatives of potential suspects has been
used in only a small number of cases, if some-
CREDIT: BARBARA D. CUMMINGS
1Department
of Pathology, Brigham and Women’s Hospital
and Harvard Medical School, Boston, MA 02115, USA. 2DNAVIEW, Oakland, CA 94611, and School of Public Health,
University of California, Berkeley, CA 94720, USA. 3John F.
Kennedy School of Government, Harvard University,
Cambridge, MA 02138, USA.
Authors are alphabetized to reflect equal contributions.
Comments and ideas expressed herein are their own.
*Author for correspondence. E-mail: [email protected]
times to dramatic effect. For example, the brutal
1988 murder of 16-year-old Lynette White, in
Cardiff, Wales, was finally solved in 2003. A
search of the U.K. National DNA Database for
individuals with a specific single rare allele
found in the crime scene evidence that identified a 14-year-old boy with a similar overall
DNA profile. This led police to his paternal
uncle, Jeffrey Gafoor (11). Investigation of the
1984 murder of Deborah Sykes revealed a
close, but not perfect, match to a man in the
North Carolina DNA offender database, which
led investigators to his brother, Willard Brown
(12). Both Gafoor and Brown matched the
DNA from the respective crime scenes, confessed, and were convicted.
Although all individuals have some genetic
similarity, close relatives have very similar
DNA profiles because of shared ancestry. We
demonstrate the potential value of kinship
analysis for identifying promising leads in
forensic investigations on a much wider scale
than has been used to date.
Let us assume that a sample from a crime
scene has been obtained that is not an exact
match to the profile of anyone in current DNA
databases. Using Monte Carlo simulations
(13, 14), we investigated the chances of successfully identifying a biological relative of
someone whose profile is in the DNA database as a possible source of crime scene evidence (15). Each Monte Carlo trial simulates a
database of known offenders, a sample found
at a crime scene, and a search. The
search compares the crime sample
with each catalogued offender in
turn by computing likelihood
ratios (LRs) that assess the likelihood of parent-child or of sibling
relationships (1, 16). We used published data on allele frequencies
of the 13 short tandem repeat
(STR) loci on which U.S. offender
databases are based and basic
genetic principles (17–19). A high
LR is characteristic of related
individuals and is an unusual but
possible coincidence for unrelated
individuals. The analysis of each
simulation therefore assumes that
investigators would follow these
leads in priority order, starting
with those in the offender database with the highest LR for being
closely related to the owner of the
crime scene DNA sample.
Our simulations demonstrate that kinship
analysis would be valuable now for detecting
potential suspects who are the parents, children, or siblings of those whose profiles are in
forensic databases. For example, assume that
the unknown sample is from the biological
child of one of the 50,000 offenders in a typical-sized state database. Of the 50,000 LRs
comparing the “unknown” sample to each registered offender in the database, the child corresponds to the largest LR about half the time,
and has a 99% chance of appearing among the
100 largest LRs (see chart). An analysis of
potential sibling relationships produced a similar curve (13).
These results could be refined by additional
data—for example, large numbers of singlenucleotide polymorphisms (SNPs). Better and
immediately practical, a seven-locus Y-STR
haplotype analysis on the crime scene and the
list of database leads would eliminate 99% of
those not related by male lineage (20). Datamining (vital records, genealogical and geographical data) for the existence of suitable
suspects related to the leads can also help to
refine the list.
The potential for improving effectiveness
of DNA database searches is large. Consider a
hypothetical state in which the “cold-hit”
rate—the chance of finding a match between
a crime scene sample and someone in the
offender database—is 10%. Suppose that
www.sciencemag.org SCIENCE VOL 312 2 JUNE 2006
Downloaded from www.sciencemag.org on June 2, 2013
Frederick R. Bieber,1* Charles H. Brenner,2 David Lazer3
1315
Published by AAAS
Familial searching
2
Overview
Familial searching is the process of looking for close
relatives of an offender in a DNA database
• 
• 
• 
• 
• 
• 
• 
Basic principles
The search process
Strategies for generating the candidate list
Simulation studies of search strategies
Wrap-up
Computational aspects
Exercises
Familial searching
3
Familial searching works by conducting
kinship tests on a large scale
•  For each database profile, compute a Kinship Index (KI)
with the case profile
•  When searching for full siblings, use the Sibling Index
(SI). Parent index (PI) for parent/offspring searches.
•  For crime scene profile x and database profile y,
SI =
P(x, y | full siblings)
,
P(x, y | unrelated)
P(x, y | parent/offspring)
PI =
.
P(x, y | unrelated)
Familial searching
4
The familiar interpretation and caveats of likelihood
ratios apply to kinship indices
Interpretation
Caveats
•  Kinship Index is a likelihood ratio
•  Larger LR means stronger
evidence
•  Bayes rule: posterior odds = prior
odds x likelihood ratio
•  LR is sufficient: it completely
conveys the strength of the
evidence with regard to the two
hypotheses. It does not matter, for
example, whether 10 or 15 loci
were typed if LRs are equal
•  Large LR does not imply large
posterior odds, since prior odds
may be very small
•  Large posterior odds do not imply
large posterior probability, since
the two competing hypotheses
need not be exhaustive. For
instance, a parent/offspring pair
often has a large SI
•  ‘Law of truly large numbers’: any
outrageous thing is likely to
happen at some point. Large LRs
may very well be false positives
Familial searching
5
base themselves, even 5% of them have a close
(parent/child or sibling) relative who is. From
our projections that up to 80% (counting the 10
best leads) of those 5% could be indirectly
identified, it follows that the kinship analyses
we describe could increase a 10% cold-hit rate
to 14%—that is, by 40%. There have been
30,000 cold hits in the United States up to now
(5). Kinship searching has the potential for
thousands more.
Success of kinship searching depends most
saliently on a close relative of the perpetrator
actually being in the offender database. Studies
clearly indicate a strong probabilistic depend-
considerable controversy, especially in light of
recent trends to expand collections to arrestees
and those convicted of minor crimes and misdemeanors (25). Although use of retained samples
for other purposes is prohibited by federal and
several state laws, sample retention also has
been a controversial practice.
Debates on the expansion of the scope of
DNA collections for offender and arrestee
databases, as well as collections of volunteer
samples, e.g., through DNA dragnets, have
concentrated on the balance between society’s
interests in security and privacy interests of
individuals who might be included in the database and on the fairness and
equity of including some in
the databases but not others
1
(26, 27). Privacy interests
include genetic privacy [as
0.8
DNA samples can yield medical and other information
0.6
(28)] and locational privacy
(where the contributor has
0.4
been and left DNA). As with
any investigative technique,
0.2
these DNA matching strategies will lead to investigation
0
of the innocent.
0
10
100
1000
Existing state and federal
Number of leads investigated (k)
statutes do not specifically
Finding the genetic needle in a large haystack. The probability of address familial searches, and
identifying a close relative (i.e., parent/child) of a known offender by it is unlikely such search stratekinship searching is shown. Crime scene evidence would be searched gies were even considered at
against each profile in a simulated offender DNA database. A parent/ the time original statutes were
child would be identified 62% of the time as the very first lead, and written. Use of familial search99% of the time among the first 100 leads. Although these familial ing methods described herein
searching methods do not invariably distinguish parent/child from sib- could raise new legal challings, they have a high chance of identifying close relatives, if they exist, lenges, as a new category of
among the database samples with the highest LRs.
people effectively would be
placed under lifetime genetic
ency between the chances of conviction of par- surveillance. Its composition would reflect
ents and their children, as well as among sib- existing demographic disparities in the crimilings (21). Consistent with these studies, in a nal justice system, in which arrests and convicU.S. Department of Justice survey, 46% of jail tions differ widely based on race, ethnicity,
inmates indicated that they had at least one geographic location, and social class. Familial
Familial
searching
close relative
who had been incarcerated (22). searching potentially amplifies these existing
Such observations do not define or delineate disparities. These issues need to be confronted,
ical, and legal implications, in addition to their
valuable investigatory potential.
References and Notes
Bieber et al. find that a true relative often has the
1. C. H. Brenner, B. S. Weir, Theor. Popul. Biol. 63, 173
(2003).
2. C. H. Brenner, Forensic Sci. Int. 157, 172
(2006).
3. F. R. Bieber, in DNA and the Criminal Justice System,
D. Lazer, Ed. (MIT Press, Cambridge, MA, 2004), pp.
23–72.
4. L. Biesecker et al., Science 310, 1122 (2005).
5. F. Bieber, J. Law Med. Ethics 34, 222 (2006).
6. U.K. Criminal Justice Act, 2003
(www.opsi.gov.uk/acts/en2003/2003en44.htm).
7. See (www.fbi.gov/hq/lab/codis/index1.htm).
8. For a summary of DNA database legislation in the
United States, see (www.aslme.org).
9. B. Budowle, F. R. Bieber, A. Eisenberg, Legal Med. 7, 230
(2005).
10. D. Lazer, M. Meyer, in DNA and the Criminal Justice
System, D. Lazer, Ed. (MIT Press, Cambridge, MA, 2004),
pp. 357–390.
11. BBC News, “How police found Gafoor,” 4 July 2003
(http://news.bbc.co.uk/1/hi/wales/3038138.stm).
12. R. Willing, USA Today, 7 June 2005, p. 1 (www.usatoday.
com/news/nation/2005-06-07-dna-cover_x.htm).
13. Materials and methods are available as supporting
material on Science Online.
14. N. Metropolis, S. Ulam, J. Am. Stat. Assoc. 44, 335
(1949).
15. In the simulations, we made a variety of simplifying
assumptions (e.g., regarding random mating, mutation
rates). These results are thus, of course, approximations
that will need experimental validation.
16. C. C. Li, L. Sacks, Biometrics 10, 347 (1954).
17. B. Budowle et al., J. Forensic Sci..44, 1277 (1999).
18. J. Butler, Forensic DNA Typing (Elsevier Academic Press,
Burlington, MA, ed. 2, 2005).
19. A. J. F. Griffiths et al., An Introduction to Genetic Analysis
(Freeman, New York, ed. 2, 2004).
20. Data from Y-Chromosome Haplotype Reference STR
database (YHRD), see (www.yhrd.org).
21. C. Smith, D. Farrington, J. Child Psychol. Psychiatr. 45,
230 (2004).
22. U.S. Bureau of Justice Statistics, Correctional Populations
in the United States, 1996 (NJC 170013, U.S.
Department of Justice, Washington, DC, April 1999),
p. 62 (www.ojp.usdoj.gov/bjs/pub/pdf/cpius964.pdf).
23. See United States v. Kincade, 379 F. 3d 813 (9th Cir.
2004) (en banc).
24. State v. Raines, 875 A. 2d 19 (Md. 2004) (collecting
cases).
25. D. Cardwell, New York Times, 4 May 2006.
26. A. Etzioni, in DNA and the Criminal Justice System,
D. Lazer, Ed. (MIT Press, Cambridge, MA, 2004), pp.
197–224.
•  A number of test searches are
simulated
•  For a number of case profiles, a
search is simulated
•  62% of the parent/offspring
searches had the true relative
ranked first
•  Searching is feasible. How do
we proceed?
Downloaded from www.sciencemag.org on June 2, 2013
Probability to find this relative if one exists
largest kinship index of all database members
6
The familial searching process
Case profile
Database LRs
Candidate list
Eliminate false
positives
Final
candidate list
•  Case profile does not match any database profile
•  Compute LR for kinship (full siblings or parent/offspring) with all database
members
•  Top k: select the top k LRs, for some k
•  Fixed LR: select DB members with LR > t, for some threshold t
•  Profile-dependent: select DB members with LR > tα, s.t. P(LR > tα) = α for true sibs
•  Conditional: select DB members such that posterior probability ≥ α
•  Lineage markers (Y-STR typing) can eliminate many false positives
•  Typing additional loci can also eliminate false positives
•  The final candidate list is investigated tactically by the authorities
Familial searching
7
There is a trade-off between length of the
candidate list and power of detection
Generating the candidate list:
•  Select a number of database profiles with the highest LRs
•  Trade-off between workload (eliminating false positives) and probability of
detection (PoD)
Workload and PoD per case are driven by:
•  Case profile (rare alleles or common alleles?);
•  Search strategy and tuning parameters (k, t, α);
•  Database size (N).
Familial searching
8
Likelihood ratio distributions depend on the
case profile
• 
For 1,000 simulated SGMplus profiles, the SI-distribution is obtained with respect to a
true full sibling and an unrelated person
unrelated
• 
• 
• 
sibling
Distribution differs a lot between case profiles. Large effect on TPR and FPR.
Variation is caused by rarity of the profile. Profiles with rare alleles are especially
amenable to familial searching.
Effect on search strategies is discussed next.
Familial searching
9
We discuss four search strategies
• 
• 
• 
• 
Top k:
o  select the top k LRs for some fixed k;
o  workload is fixed, PoD depends on case profile.
Fixed LR threshold:
o  select DB members with LR > t, for some fixed threshold t;
o  both workload and PoD depend on case profile;
o  optimal in the long run.
Profile-dependent:
o  select DB members with LR > tα, s.t. P(LR > tα) = α for true sibs;
o  PoD is fixed, workload depends on case profile.
Conditional:
o  select DB members such that posterior probability that relative is on candidate
list, conditional on a relative being present and the observed LRs is at least α;
o  both workload and PoD depend on case profile, but average PoD ≥ α.
Familial searching
10
Conditional
• 
• 
• 
• 
Suppose the database contains at most one relative
Prior probability that database contains a relative: πD
Prior probability that database member i is a relative: πi
Likelihood ratio for database member i: ri
•  Posterior probability that database member i is the
relative:
ri π i
N
∑r π
k
k
+1− π D
k=1
•  Strategy: select database members such that the sum of
the posterior probabilities is at least α
Familial searching
11
Simulation: top k strategy (1,000 profiles)
• 
• 
• 
Familial searching
Large variation in
PoD for fixed k
Increasing k gives
quickly diminishing
returns in terms of
PoD
Using 15 instead of
10 loci makes it
possible to increase
DB size ~10 times,
while retaining the
PoD
12
Simulation: fixed LR strategy (1,000 profiles)
• 
• 
• 
Familial searching
Large variation in
PoD for fixed t
Variance of #
candidates is larger
for smaller
thresholds
Number of false
positives increases
linearly with
database size
13
Simulation: conditional (1,000 profiles)
• 
• 
• 
• 
Familial searching
Uniform prior on
database members
For each case profile,
simulate 1,000 times
with a full sibling
added to the database
Number of candidates
has high variance
When all LRs are
small, many false
leads have to be
excluded
14
The fixed-LR strategy is optimal in the long run
• 
• 
• 
Fixed-LR strategy is most efficient in the long-run: lowest FPR for given TPR.
How many more false positives with top k or profile-dependent threshold?
Take top 168 strategy as point of reference in fixed DB (N=100,000). Tuning
parameters (t, α) such that the average PoD coincides with top 168.
Fixing workload is cheap; fixing PoD is not.
0.008
Top k
SI−threshold
Profile−centered
Conditional
0
0.000
0.004
4
Density
Top k
SI−threshold
Profile−centered
Conditional
2
Density
6
8
• 
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
PoD
Familial searching
0
200
400
600
800
# candidates selected
15
Wrap-up
•  Workload and PoD per case may depend on case profile
•  A fixed-LR threshold is optimal in the long run
•  Fixing workload is cheap; fixing PoD not
Familial searching
16
Computational aspects
•  Suppose we have a case profile and work with a fixed SI
threshold of 500. How to compute TPR and FPR for this
threshold?
Forensic Science International: Genetics 14 (2015) 116–124
Contents lists available at ScienceDirect
Forensic Science International: Genetics
journal homepage: www.elsevier.com/locate/fsig
Efficient computations with the likelihood ratio distribution
TPR(t) = P(KI > t | H p )
FPR(t) = P(KI > t | H d )
Maarten Kruijver *
Department of Mathematics, VU University, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands
A R T I C L E I N F O
A B S T R A C T
Article history:
Received 12 February 2014
Received in revised form 4 August 2014
Accepted 23 September 2014
What is the probability that the likelihood ratio exceeds a threshold t, if a specified hypothesis is true?
This question is asked, for instance, when performing power calculations for kinship testing, when
computing true and false positive rates for familial searching and when computing the power of
discrimination of a complex mixture. Answering this question is not straightforward, since there is are a
huge number of possible genotypic combinations to consider. Different solutions are found in the
literature. Several authors estimate the threshold exceedance probability using simulation. Corradi and
Ricciardi [1] propose a discrete approximation to the likelihood ratio distribution which yields a lower
and upper bound on the probability. Nothnagel et al. [2] use the normal distribution as an approximation
to the likelihood ratio distribution. Dørum et al. [3] introduce an algorithm that can be used for exact
computation, but this algorithm is computationally intensive, unless the threshold t is very large.
We present three new approaches to the problem. Firstly, we show how importance sampling can be
used to make the simulation approach significantly more efficient. Importance sampling is a statistical
technique that turns out to work well in the current context. Secondly, we present a novel algorithm for
computing exceedance probabilities. The algorithm is exact, fast and can handle relatively large
problems. Thirdly, we introduce an approach that combines the novel algorithm with the discrete
approximation of Corradi and Ricciardi. This last approach can be applied to very large problems and
yields a lower and upper bound on the exceedance probability. The use of the different approaches is
illustrated with examples from forensic genetics, such as kinship testing, familial searching and mixture
interpretation. The algorithms are implemented in an R-package called DNAprofiles, which is freely
available from CRAN.
! 2014 Elsevier Ireland Ltd. All rights reserved.
Keywords:
Likelihood ratio
Exceedance probability
Exact computation
Monte Carlo
Simulation
Importance sampling
Familial searching
R Package: DNAprofiles
17
The KI follows a probability distribution
•  First obtain distribution of KI per locus
118
M. Kruijver / Forensic Science International: Genetics 14 (2015) 116–124
Table 1
Probability distribution of the likelihood ratio for pairwise kinship at a locus with a fixed individual. See [16] for a version including subpopulation correction.
G1
G2
P(G2jG1, Hp)
P(G2jG1, Hd)
LR
a/a
a/a
a/a
z/z
a/z
a/a
k0 p2z
k02papz + k1pz
k0 p2a þ k1 pa þ k2
p2z
2papz
p2a
k0
k0 þ k1 12 p!1
a
!2
k0 þ k1 p!1
a þ k2 p a
a/b
a/b
a/b
a/b
a/b
a/b
z/z
a/z
a/a
b/z
b/b
a/b
k0 p2z
k0 2 pa pz þ k1 12 pz
k0 p2a þ k1 12 pa
k0 2 pb pz þ k1 12 pz
k0 p2b þ k1 12 pb
k0 2 pa pb þ k1 12 ð pa þ pb Þ þ k2
p2z
2papz
p2a
2pbpz
p2b
2papb
k0
k0 þ k1 14 p!1
a
k0 þ k1 12 p!1
a
k0 þ k1 14 p!1
b
k0 þ k1 12 p!1
b
!1
1 !1 !1
k0 þ k1 14 ð p!1
b þ pa Þ þ k2 2 pa pb
2.1. Example (likelihood ratio for pairwise kinship with a fixed person)
Consider the likelihood ratio for kinship of two persons for m
autosomal short tandem repeat (STR) loci. First, we work on a
single locus. Suppose person 1 is fixed and has genotype G1. The
likelihood ratio for kinship of person 1 and person 2 compares the
hypotheses:
pair from the population, either related (Hp) or unrelated (Hd). The
evidence now consists of two DNA profiles, so there are many more
possible likelihood ratios than in the previous example. Assume
the alleles at the locus are labelled a1, . . ., aL. If both person 1 and
person 2 are homozygous (a1, a1), then the likelihood ratio is equal
!2
to k0 þ k1 p!1
a1 þ k2 pa1 . Under Hp, this happens with probability
•  Then estimate or compute the exceedance probability
Hp
Hd
: Person 1and person 2 are related according to
IBD ! probabilities k0 ; k1 ; k2 ;
: Person 1 and person 2 are unrelated;
Familial searching
where the identical by descent (IBD)-probabilities k0, k1, k2 denote
the probabilities that person 1 and person 2 share 0, 1 or 2 alleles
p2a1 ðk0 p2a1 þ k1 pa1 þ k2 Þ; under Hd with probability p2a1 p2a1 . Similar
computations are made for all combinations of genotypes, partly
shown in Table 2. The last column contains the outcomes of the
likelihood ratio. The fourth column lists the probabilities under Hd;
the third under Hp.
18
Table 2 shows that there are many more outcomes for the
likelihood ratio than in the previous example, where one of the
We use algorithms to compute or estimate
probabilities (TPRs, FPRs)
•  Estimation:
o  Sample a large number of profiles according to hypothesis;
o  Compute LR for each profile;
o  Estimate exceedance probability by empirical fraction of LRs above threshold.
•  Estimation is problematic when probability is small, since
no sample will be above threshold
•  Importance sampling is an alternative to regular
sampling. Idea: sample from a different hypothesis (Hp
instead of Hd) and correct for the bias.
•  Exact computation is possible when number of markers
is not too large
Familial searching
19
Example
Familial searching
20
Example
Familial searching
21
Questions?
Familial searching
22