Familial searching Maarten Kruijver [email protected] VU University, Amsterdam 23 April, 2015 DNA databases often identify suspects, but what if there is no match? POLICYFORUM HUMAN GENETICS Finding Criminals Through DNA of Their Relatives • If there is no match, one can look for a relative of the donor of a crime scene profile through Familial Searching • ‘Finding Criminals Through DNA of Their Relatives’ – Bieber, Brenner, Lazer (Science, 2006) • Bieber et al. showed that familial searching is feasible using a simulation study Analyses of the DNA databases maintained by criminal justice systems might enable criminals to be caught by recognizing their kin, but this raises civil liberties issues. D NA methods are now widely used for many forensic purposes, including routine investigation of serious crimes and for identification of persons killed in mass disasters or wars (1–4). DNA databases of convicted offenders are maintained by every U.S. state and nearly every industrialized country, allowing comparison of crime scene DNA profiles to one another and to known offenders (5). The policy in the United Kingdom stipulates that almost any collision with law enforcement results in the collection of DNA (6). Following the U.K. lead, the United States has shifted steadily toward inclusion of all felons, and federal and six U.S. state laws now include some provision for those arrested or indicted. At present, there are over 3 million samples in the U.S. offender/arrestee state and federal DNA databases (7). Statutes governing the use of such samples and protection against misuse vary from state to state (8). Although direct comparisons of DNA profiles of known individuals and unknown biological evidence are most common, indirect genetic kinship analyses, using the DNA of biological relatives, are often necessary for humanitarian mass disaster and missing person identifications (1, 2, 9). Such methods could potentially be applied to searches of the convicted offender/arrestee DNA databases. When crime scene samples do not match anyone in a search of forensic databases, the application of indirect methods could identify individuals in the database who are close relatives of the potential suspects. This raises compelling policy questions about the balance between collective security and individual privacy (10). To date, searching DNA databases to identify close relatives of potential suspects has been used in only a small number of cases, if some- CREDIT: BARBARA D. CUMMINGS 1Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA. 2DNAVIEW, Oakland, CA 94611, and School of Public Health, University of California, Berkeley, CA 94720, USA. 3John F. Kennedy School of Government, Harvard University, Cambridge, MA 02138, USA. Authors are alphabetized to reflect equal contributions. Comments and ideas expressed herein are their own. *Author for correspondence. E-mail: [email protected] times to dramatic effect. For example, the brutal 1988 murder of 16-year-old Lynette White, in Cardiff, Wales, was finally solved in 2003. A search of the U.K. National DNA Database for individuals with a specific single rare allele found in the crime scene evidence that identified a 14-year-old boy with a similar overall DNA profile. This led police to his paternal uncle, Jeffrey Gafoor (11). Investigation of the 1984 murder of Deborah Sykes revealed a close, but not perfect, match to a man in the North Carolina DNA offender database, which led investigators to his brother, Willard Brown (12). Both Gafoor and Brown matched the DNA from the respective crime scenes, confessed, and were convicted. Although all individuals have some genetic similarity, close relatives have very similar DNA profiles because of shared ancestry. We demonstrate the potential value of kinship analysis for identifying promising leads in forensic investigations on a much wider scale than has been used to date. Let us assume that a sample from a crime scene has been obtained that is not an exact match to the profile of anyone in current DNA databases. Using Monte Carlo simulations (13, 14), we investigated the chances of successfully identifying a biological relative of someone whose profile is in the DNA database as a possible source of crime scene evidence (15). Each Monte Carlo trial simulates a database of known offenders, a sample found at a crime scene, and a search. The search compares the crime sample with each catalogued offender in turn by computing likelihood ratios (LRs) that assess the likelihood of parent-child or of sibling relationships (1, 16). We used published data on allele frequencies of the 13 short tandem repeat (STR) loci on which U.S. offender databases are based and basic genetic principles (17–19). A high LR is characteristic of related individuals and is an unusual but possible coincidence for unrelated individuals. The analysis of each simulation therefore assumes that investigators would follow these leads in priority order, starting with those in the offender database with the highest LR for being closely related to the owner of the crime scene DNA sample. Our simulations demonstrate that kinship analysis would be valuable now for detecting potential suspects who are the parents, children, or siblings of those whose profiles are in forensic databases. For example, assume that the unknown sample is from the biological child of one of the 50,000 offenders in a typical-sized state database. Of the 50,000 LRs comparing the “unknown” sample to each registered offender in the database, the child corresponds to the largest LR about half the time, and has a 99% chance of appearing among the 100 largest LRs (see chart). An analysis of potential sibling relationships produced a similar curve (13). These results could be refined by additional data—for example, large numbers of singlenucleotide polymorphisms (SNPs). Better and immediately practical, a seven-locus Y-STR haplotype analysis on the crime scene and the list of database leads would eliminate 99% of those not related by male lineage (20). Datamining (vital records, genealogical and geographical data) for the existence of suitable suspects related to the leads can also help to refine the list. The potential for improving effectiveness of DNA database searches is large. Consider a hypothetical state in which the “cold-hit” rate—the chance of finding a match between a crime scene sample and someone in the offender database—is 10%. Suppose that www.sciencemag.org SCIENCE VOL 312 2 JUNE 2006 Downloaded from www.sciencemag.org on June 2, 2013 Frederick R. Bieber,1* Charles H. Brenner,2 David Lazer3 1315 Published by AAAS Familial searching 2 Overview Familial searching is the process of looking for close relatives of an offender in a DNA database • • • • • • • Basic principles The search process Strategies for generating the candidate list Simulation studies of search strategies Wrap-up Computational aspects Exercises Familial searching 3 Familial searching works by conducting kinship tests on a large scale • For each database profile, compute a Kinship Index (KI) with the case profile • When searching for full siblings, use the Sibling Index (SI). Parent index (PI) for parent/offspring searches. • For crime scene profile x and database profile y, SI = P(x, y | full siblings) , P(x, y | unrelated) P(x, y | parent/offspring) PI = . P(x, y | unrelated) Familial searching 4 The familiar interpretation and caveats of likelihood ratios apply to kinship indices Interpretation Caveats • Kinship Index is a likelihood ratio • Larger LR means stronger evidence • Bayes rule: posterior odds = prior odds x likelihood ratio • LR is sufficient: it completely conveys the strength of the evidence with regard to the two hypotheses. It does not matter, for example, whether 10 or 15 loci were typed if LRs are equal • Large LR does not imply large posterior odds, since prior odds may be very small • Large posterior odds do not imply large posterior probability, since the two competing hypotheses need not be exhaustive. For instance, a parent/offspring pair often has a large SI • ‘Law of truly large numbers’: any outrageous thing is likely to happen at some point. Large LRs may very well be false positives Familial searching 5 base themselves, even 5% of them have a close (parent/child or sibling) relative who is. From our projections that up to 80% (counting the 10 best leads) of those 5% could be indirectly identified, it follows that the kinship analyses we describe could increase a 10% cold-hit rate to 14%—that is, by 40%. There have been 30,000 cold hits in the United States up to now (5). Kinship searching has the potential for thousands more. Success of kinship searching depends most saliently on a close relative of the perpetrator actually being in the offender database. Studies clearly indicate a strong probabilistic depend- considerable controversy, especially in light of recent trends to expand collections to arrestees and those convicted of minor crimes and misdemeanors (25). Although use of retained samples for other purposes is prohibited by federal and several state laws, sample retention also has been a controversial practice. Debates on the expansion of the scope of DNA collections for offender and arrestee databases, as well as collections of volunteer samples, e.g., through DNA dragnets, have concentrated on the balance between society’s interests in security and privacy interests of individuals who might be included in the database and on the fairness and equity of including some in the databases but not others 1 (26, 27). Privacy interests include genetic privacy [as 0.8 DNA samples can yield medical and other information 0.6 (28)] and locational privacy (where the contributor has 0.4 been and left DNA). As with any investigative technique, 0.2 these DNA matching strategies will lead to investigation 0 of the innocent. 0 10 100 1000 Existing state and federal Number of leads investigated (k) statutes do not specifically Finding the genetic needle in a large haystack. The probability of address familial searches, and identifying a close relative (i.e., parent/child) of a known offender by it is unlikely such search stratekinship searching is shown. Crime scene evidence would be searched gies were even considered at against each profile in a simulated offender DNA database. A parent/ the time original statutes were child would be identified 62% of the time as the very first lead, and written. Use of familial search99% of the time among the first 100 leads. Although these familial ing methods described herein searching methods do not invariably distinguish parent/child from sib- could raise new legal challings, they have a high chance of identifying close relatives, if they exist, lenges, as a new category of among the database samples with the highest LRs. people effectively would be placed under lifetime genetic ency between the chances of conviction of par- surveillance. Its composition would reflect ents and their children, as well as among sib- existing demographic disparities in the crimilings (21). Consistent with these studies, in a nal justice system, in which arrests and convicU.S. Department of Justice survey, 46% of jail tions differ widely based on race, ethnicity, inmates indicated that they had at least one geographic location, and social class. Familial Familial searching close relative who had been incarcerated (22). searching potentially amplifies these existing Such observations do not define or delineate disparities. These issues need to be confronted, ical, and legal implications, in addition to their valuable investigatory potential. References and Notes Bieber et al. find that a true relative often has the 1. C. H. Brenner, B. S. Weir, Theor. Popul. Biol. 63, 173 (2003). 2. C. H. Brenner, Forensic Sci. Int. 157, 172 (2006). 3. F. R. Bieber, in DNA and the Criminal Justice System, D. Lazer, Ed. (MIT Press, Cambridge, MA, 2004), pp. 23–72. 4. L. Biesecker et al., Science 310, 1122 (2005). 5. F. Bieber, J. Law Med. Ethics 34, 222 (2006). 6. U.K. Criminal Justice Act, 2003 (www.opsi.gov.uk/acts/en2003/2003en44.htm). 7. See (www.fbi.gov/hq/lab/codis/index1.htm). 8. For a summary of DNA database legislation in the United States, see (www.aslme.org). 9. B. Budowle, F. R. Bieber, A. Eisenberg, Legal Med. 7, 230 (2005). 10. D. Lazer, M. Meyer, in DNA and the Criminal Justice System, D. Lazer, Ed. (MIT Press, Cambridge, MA, 2004), pp. 357–390. 11. BBC News, “How police found Gafoor,” 4 July 2003 (http://news.bbc.co.uk/1/hi/wales/3038138.stm). 12. R. Willing, USA Today, 7 June 2005, p. 1 (www.usatoday. com/news/nation/2005-06-07-dna-cover_x.htm). 13. Materials and methods are available as supporting material on Science Online. 14. N. Metropolis, S. Ulam, J. Am. Stat. Assoc. 44, 335 (1949). 15. In the simulations, we made a variety of simplifying assumptions (e.g., regarding random mating, mutation rates). These results are thus, of course, approximations that will need experimental validation. 16. C. C. Li, L. Sacks, Biometrics 10, 347 (1954). 17. B. Budowle et al., J. Forensic Sci..44, 1277 (1999). 18. J. Butler, Forensic DNA Typing (Elsevier Academic Press, Burlington, MA, ed. 2, 2005). 19. A. J. F. Griffiths et al., An Introduction to Genetic Analysis (Freeman, New York, ed. 2, 2004). 20. Data from Y-Chromosome Haplotype Reference STR database (YHRD), see (www.yhrd.org). 21. C. Smith, D. Farrington, J. Child Psychol. Psychiatr. 45, 230 (2004). 22. U.S. Bureau of Justice Statistics, Correctional Populations in the United States, 1996 (NJC 170013, U.S. Department of Justice, Washington, DC, April 1999), p. 62 (www.ojp.usdoj.gov/bjs/pub/pdf/cpius964.pdf). 23. See United States v. Kincade, 379 F. 3d 813 (9th Cir. 2004) (en banc). 24. State v. Raines, 875 A. 2d 19 (Md. 2004) (collecting cases). 25. D. Cardwell, New York Times, 4 May 2006. 26. A. Etzioni, in DNA and the Criminal Justice System, D. Lazer, Ed. (MIT Press, Cambridge, MA, 2004), pp. 197–224. • A number of test searches are simulated • For a number of case profiles, a search is simulated • 62% of the parent/offspring searches had the true relative ranked first • Searching is feasible. How do we proceed? Downloaded from www.sciencemag.org on June 2, 2013 Probability to find this relative if one exists largest kinship index of all database members 6 The familial searching process Case profile Database LRs Candidate list Eliminate false positives Final candidate list • Case profile does not match any database profile • Compute LR for kinship (full siblings or parent/offspring) with all database members • Top k: select the top k LRs, for some k • Fixed LR: select DB members with LR > t, for some threshold t • Profile-dependent: select DB members with LR > tα, s.t. P(LR > tα) = α for true sibs • Conditional: select DB members such that posterior probability ≥ α • Lineage markers (Y-STR typing) can eliminate many false positives • Typing additional loci can also eliminate false positives • The final candidate list is investigated tactically by the authorities Familial searching 7 There is a trade-off between length of the candidate list and power of detection Generating the candidate list: • Select a number of database profiles with the highest LRs • Trade-off between workload (eliminating false positives) and probability of detection (PoD) Workload and PoD per case are driven by: • Case profile (rare alleles or common alleles?); • Search strategy and tuning parameters (k, t, α); • Database size (N). Familial searching 8 Likelihood ratio distributions depend on the case profile • For 1,000 simulated SGMplus profiles, the SI-distribution is obtained with respect to a true full sibling and an unrelated person unrelated • • • sibling Distribution differs a lot between case profiles. Large effect on TPR and FPR. Variation is caused by rarity of the profile. Profiles with rare alleles are especially amenable to familial searching. Effect on search strategies is discussed next. Familial searching 9 We discuss four search strategies • • • • Top k: o select the top k LRs for some fixed k; o workload is fixed, PoD depends on case profile. Fixed LR threshold: o select DB members with LR > t, for some fixed threshold t; o both workload and PoD depend on case profile; o optimal in the long run. Profile-dependent: o select DB members with LR > tα, s.t. P(LR > tα) = α for true sibs; o PoD is fixed, workload depends on case profile. Conditional: o select DB members such that posterior probability that relative is on candidate list, conditional on a relative being present and the observed LRs is at least α; o both workload and PoD depend on case profile, but average PoD ≥ α. Familial searching 10 Conditional • • • • Suppose the database contains at most one relative Prior probability that database contains a relative: πD Prior probability that database member i is a relative: πi Likelihood ratio for database member i: ri • Posterior probability that database member i is the relative: ri π i N ∑r π k k +1− π D k=1 • Strategy: select database members such that the sum of the posterior probabilities is at least α Familial searching 11 Simulation: top k strategy (1,000 profiles) • • • Familial searching Large variation in PoD for fixed k Increasing k gives quickly diminishing returns in terms of PoD Using 15 instead of 10 loci makes it possible to increase DB size ~10 times, while retaining the PoD 12 Simulation: fixed LR strategy (1,000 profiles) • • • Familial searching Large variation in PoD for fixed t Variance of # candidates is larger for smaller thresholds Number of false positives increases linearly with database size 13 Simulation: conditional (1,000 profiles) • • • • Familial searching Uniform prior on database members For each case profile, simulate 1,000 times with a full sibling added to the database Number of candidates has high variance When all LRs are small, many false leads have to be excluded 14 The fixed-LR strategy is optimal in the long run • • • Fixed-LR strategy is most efficient in the long-run: lowest FPR for given TPR. How many more false positives with top k or profile-dependent threshold? Take top 168 strategy as point of reference in fixed DB (N=100,000). Tuning parameters (t, α) such that the average PoD coincides with top 168. Fixing workload is cheap; fixing PoD is not. 0.008 Top k SI−threshold Profile−centered Conditional 0 0.000 0.004 4 Density Top k SI−threshold Profile−centered Conditional 2 Density 6 8 • 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 PoD Familial searching 0 200 400 600 800 # candidates selected 15 Wrap-up • Workload and PoD per case may depend on case profile • A fixed-LR threshold is optimal in the long run • Fixing workload is cheap; fixing PoD not Familial searching 16 Computational aspects • Suppose we have a case profile and work with a fixed SI threshold of 500. How to compute TPR and FPR for this threshold? Forensic Science International: Genetics 14 (2015) 116–124 Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig Efficient computations with the likelihood ratio distribution TPR(t) = P(KI > t | H p ) FPR(t) = P(KI > t | H d ) Maarten Kruijver * Department of Mathematics, VU University, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands A R T I C L E I N F O A B S T R A C T Article history: Received 12 February 2014 Received in revised form 4 August 2014 Accepted 23 September 2014 What is the probability that the likelihood ratio exceeds a threshold t, if a specified hypothesis is true? This question is asked, for instance, when performing power calculations for kinship testing, when computing true and false positive rates for familial searching and when computing the power of discrimination of a complex mixture. Answering this question is not straightforward, since there is are a huge number of possible genotypic combinations to consider. Different solutions are found in the literature. Several authors estimate the threshold exceedance probability using simulation. Corradi and Ricciardi [1] propose a discrete approximation to the likelihood ratio distribution which yields a lower and upper bound on the probability. Nothnagel et al. [2] use the normal distribution as an approximation to the likelihood ratio distribution. Dørum et al. [3] introduce an algorithm that can be used for exact computation, but this algorithm is computationally intensive, unless the threshold t is very large. We present three new approaches to the problem. Firstly, we show how importance sampling can be used to make the simulation approach significantly more efficient. Importance sampling is a statistical technique that turns out to work well in the current context. Secondly, we present a novel algorithm for computing exceedance probabilities. The algorithm is exact, fast and can handle relatively large problems. Thirdly, we introduce an approach that combines the novel algorithm with the discrete approximation of Corradi and Ricciardi. This last approach can be applied to very large problems and yields a lower and upper bound on the exceedance probability. The use of the different approaches is illustrated with examples from forensic genetics, such as kinship testing, familial searching and mixture interpretation. The algorithms are implemented in an R-package called DNAprofiles, which is freely available from CRAN. ! 2014 Elsevier Ireland Ltd. All rights reserved. Keywords: Likelihood ratio Exceedance probability Exact computation Monte Carlo Simulation Importance sampling Familial searching R Package: DNAprofiles 17 The KI follows a probability distribution • First obtain distribution of KI per locus 118 M. Kruijver / Forensic Science International: Genetics 14 (2015) 116–124 Table 1 Probability distribution of the likelihood ratio for pairwise kinship at a locus with a fixed individual. See [16] for a version including subpopulation correction. G1 G2 P(G2jG1, Hp) P(G2jG1, Hd) LR a/a a/a a/a z/z a/z a/a k0 p2z k02papz + k1pz k0 p2a þ k1 pa þ k2 p2z 2papz p2a k0 k0 þ k1 12 p!1 a !2 k0 þ k1 p!1 a þ k2 p a a/b a/b a/b a/b a/b a/b z/z a/z a/a b/z b/b a/b k0 p2z k0 2 pa pz þ k1 12 pz k0 p2a þ k1 12 pa k0 2 pb pz þ k1 12 pz k0 p2b þ k1 12 pb k0 2 pa pb þ k1 12 ð pa þ pb Þ þ k2 p2z 2papz p2a 2pbpz p2b 2papb k0 k0 þ k1 14 p!1 a k0 þ k1 12 p!1 a k0 þ k1 14 p!1 b k0 þ k1 12 p!1 b !1 1 !1 !1 k0 þ k1 14 ð p!1 b þ pa Þ þ k2 2 pa pb 2.1. Example (likelihood ratio for pairwise kinship with a fixed person) Consider the likelihood ratio for kinship of two persons for m autosomal short tandem repeat (STR) loci. First, we work on a single locus. Suppose person 1 is fixed and has genotype G1. The likelihood ratio for kinship of person 1 and person 2 compares the hypotheses: pair from the population, either related (Hp) or unrelated (Hd). The evidence now consists of two DNA profiles, so there are many more possible likelihood ratios than in the previous example. Assume the alleles at the locus are labelled a1, . . ., aL. If both person 1 and person 2 are homozygous (a1, a1), then the likelihood ratio is equal !2 to k0 þ k1 p!1 a1 þ k2 pa1 . Under Hp, this happens with probability • Then estimate or compute the exceedance probability Hp Hd : Person 1and person 2 are related according to IBD ! probabilities k0 ; k1 ; k2 ; : Person 1 and person 2 are unrelated; Familial searching where the identical by descent (IBD)-probabilities k0, k1, k2 denote the probabilities that person 1 and person 2 share 0, 1 or 2 alleles p2a1 ðk0 p2a1 þ k1 pa1 þ k2 Þ; under Hd with probability p2a1 p2a1 . Similar computations are made for all combinations of genotypes, partly shown in Table 2. The last column contains the outcomes of the likelihood ratio. The fourth column lists the probabilities under Hd; the third under Hp. 18 Table 2 shows that there are many more outcomes for the likelihood ratio than in the previous example, where one of the We use algorithms to compute or estimate probabilities (TPRs, FPRs) • Estimation: o Sample a large number of profiles according to hypothesis; o Compute LR for each profile; o Estimate exceedance probability by empirical fraction of LRs above threshold. • Estimation is problematic when probability is small, since no sample will be above threshold • Importance sampling is an alternative to regular sampling. Idea: sample from a different hypothesis (Hp instead of Hd) and correct for the bias. • Exact computation is possible when number of markers is not too large Familial searching 19 Example Familial searching 20 Example Familial searching 21 Questions? Familial searching 22
© Copyright 2024