1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 A Pairwise Relatedness Estimator bias-corrected for sample kin structure and size. Dik Heg* and Dmitry A. Konovalov† *Department of Behavioural Ecology, Zoological Institute, University of Bern, Hinterkappelen, Switzerland, †School of Mathematics, Physics & Information Technology, James Cook University, Townsville, Queensland, Australia RUNNING TITLE: Unbiased relatedness estimator Keywords: bias correction, relatedness, rare alleles, allele frequencies, microsatellites Correspondence: Dik Heg, Fax: +41 31 6319141; E-mail [email protected] 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Abstract Currently used pairwise relatedness (r) estimators (based on codominant DNA markers) are only unbiased when population allele frequencies are known exactly. In practice, the allele frequencies must often be estimated from a population sample of limited size and of unknown kin structure. A new estimator of r is presented, which uses population heterozygosity but not the frequencies of individual alleles. Based on P-values and mean squared errors obtained from Monte Carlo simulations it is shown that the new estimator outperforms the other considered estimators for small samples and samples with internally high average pairwise relatedness. For large samples with low internal relatedness the new estimator is comparable to the existing estimators. The P-values calculated using the new estimator are independent of the estimated heterozygosity. 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Introduction 45 (e.g. BROMAN 2001; MCPEEK et al. 2004). Equivalently, the sample X could be Estimation of pairwise relatedness using microsatellite (codominant) markers is used extensively in biology, e.g. in studies of kin structure (e.g. Dierkes et al. 2005) and population management (e.g. Liebert & Starks 2006; Romo et al. 2006). There is a number of relatedness estimators available with the LR (Lynch & Ritland 1999), QG (Queller & Goodnight 1989) and W (Wang 2002) estimators being the most commonly used in practice. All three (LR, QG and W) estimators require population allele frequencies which would normally be estimated from a population sample. Such simultaneous estimation of relatedness and allele frequencies affects reliability of the estimators, e.g. the estimators become biased with the amount of bias depending on the sample kin (pedigree) structure and size (Wang 2002). Wang (2002) has started to address the problem by developing the W relatedness estimator which is bias-corrected for the sample size. The purpose of the present study is to estimate pairwise relatedness, when unknown biases due to the sample kin structure may be present, i.e. the allele frequencies and pairwise relatedness values are estimated from the same sample. Note that this is the situation most commonly experienced by workers in the field using microsatellite DNA markers, where due to time and/or money constraints no large independent random sample of individuals of one generation is available to capture all alleles in the population and to calculate population frequencies of the alleles with high accuracy. Here we show that the commonly used (LR, QG and W) relatedness estimators become severely biased when sample size is small and complex kin relationships are present in the sample. We derive a new regression-based relatedness estimator (HK) which depends only on the population heterozygosity averaged across observed loci (in addition to the actual genotypes). Using Monte Carlo simulations the HK estimator is compared to the LR, QG and W estimators and found to be less biased than the other three estimators regardless of the sample kin structure and size. All four estimators have been integrated into the KINGROUP v2 program (Konovalov et al. 2004) which is freely available from www.kingroup.org and could be run on all java-enabled computer platforms such as Unix/Linux, Windows and Mac-OSX (java release 1.5 or higher is required). Methods Terminology and definitions Let X be a sample from an outbred population containing n individuals who are genotyped at a single locus (generalization for multiple loci is given below) containing k codominant alleles { A1 , A2 ,..., Ak } . The sample genotypes are recorded as a matrix X [ xm,i ] , where xm ,i is equal to the number of Am alleles observed in the i‘th individual 3 1 2 3 expressed as X (x1 , x 2 ,..., x n ) , where xi ( x1i , x2i ,..., xki ) is the i’th column of the matrix X , i.e. the i’th individual of the sample. For a diploid population each sample k genotype xi contains exactly two alleles, m 1 xmi 2 . For example, genotypes ( A1 , A2 ) 4 5 and ( A3 , A3 ) are encoded as (1,1, 0, 0) and (0, 0, 2, 0) , respectively, at a locus with four (k 4) alleles { A1 , A2 , A3 , A4 } . The i’th index (second index in xm,i ) refers to an 6 7 individual in the sample and could be omitted when working with individuals identified by other means, e.g. x x1 , x2 ,..., xk or y y1 , y2 ,..., yk . 8 9 10 11 12 13 14 15 If no further information is available about the sample, the m’th allele frequency is estimated by (e.g. Nei 1978) 1 n qm (1) xmi 2n i 1 and will be referred to as sample allele frequency. Queller & Goodnight (1989) discussed an approach where two focal individuals (dyad) are excluded from the allele frequencies calculation, i.e. n 1 qm (i, j ) xm,i , (2) 2(n 2) ii j 1 and will be referred to as outer-pairs allele frequencies. Another allele frequency biascorrection of Queller & Goodnight (1989) is the outer-groups allele frequencies, where the individuals from the focal group (or groups) are excluded from the allele frequencies calculation for that group(s). The outer-groups correction uses (or assumes) additional information about the sample, i.e. the knowledge of statistically uncorrelated (unrelated) groups. When such information is not available (this study), the correction is also unavailable. 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 The second form could be obtained using common notation (Lynch & Ritland 1999; Wang 2002) and Kronecker delta ij defined as i , j i 1 and i , j i 0 (e.g. Wang 32 33 34 35 2004). Let individuals X {x(l 1), x(2),..., x( L)} and Y {y (1), y (2),..., y ( L)} be genotyped at L unlinked loci and kl alleles at l’th locus. At locus l , denoting two alleles observed in x(l ) and y (l ) genotypes as (a, b) and (c, d ) , respectively, the QG estimator is given by One of the most commonly used relatedness estimator is the QG (Queller & Goodnight 1989) estimator in its symmetric form as implemented in the RELATEDNESS, KINSHIP (Goodnight & Queller 1999) and IDENTIX (Belkhir et al. 2002) programs. The QG estimator could be expressed in different forms yielding identical estimates of r. The first (original) form is given by Queller & Goodnight (1989). L rQG ( X , Y ) 36 S (ab, cd ) p 2 l 1 37 a l 1 L ab pb pc pd cd pa pb pc pd , (3) where 4 1 2 3 4 5 6 7 8 9 10 11 S (ab, cd ) ac ad bc bd , (4) and where the values of (a, b, c, d ) alleles together with the corresponding population allele frequencies ( pa , pb , pc , pd ) are locus specific. The explicit dependency on the locus index (l ) is omitted hereafter to simplify the notation if the locus is implied by context. Note that the QG estimator permits a number of variations (not just different forms of the same variant), e.g. a variant studied in Lynch & Ritland (1999) is different from Equation 3. The QG estimator could also be rewritten in vector notation (the third form) using S (a, b, c, d ) x y , pa pb p x , pc pd p y , 1 ab 12 x x and 1 cd 12 y y , obtaining L 12 T rQG ( X , Y ) B x y p x p y l 1 L l 1 1 2 xx y y p x py , (5) 1 2 13 where the dot-product notation x y ml 1 xm ym is used to sum over allele index m and 14 where p p(l ) ( p1 , p2 ,..., pkl ) is the vector of (locus specific) population allele 15 frequencies. Population locus homozygosity and heterozygosity are defined by k k 16 (l ) p(l ) p(l ) pm2 (l ) and h(l ) 1 (l ) , (6) m 1 17 18 19 20 21 22 23 24 25 respectively. New relatedness estimator The QG estimator becomes biased when the sample or outer-pairs allele frequencies are used instead of the population allele frequencies, see Appendix. From the derivation of the expected value of the QG estimator (see Appendix) and after some rearrangement it is found that E (xi x j ) 2 E (xi x j ) (xi x j ) 4(1 rij )h(l ) depends only on the 30 31 locus heterozygosity h(l ) . Then an unbiased estimator rijHK of relatedness can be found within a class of best linear unbiased estimators (BLUE) which is given by, in most general terms (e.g. McPeek et al. 2004), L d 2 (l ) rijHK 1 wl ij , (7) h(l ) l 1 where dij2 (l ) 14 (xi x j ) 2 , (8) 32 33 34 and where w1 , w2 ,..., wL are the locus weights and the given L loci are assumed to be statistically independent (unlinked). The estimator is only approximately unbiased since it 2 contains a division, i.e. E A / B E A / E B cov A, B / E B . Without knowing 26 27 28 29 5 1 the true relatedness matrix the covariance term cov dij2 (l ), h(l ) can not be calculated 2 3 4 exactly and is omitted. Then the weights are selected proportional to the locus heterozygosity h(l ) obtaining rijHK (h ) 1 dij2 / h , (9) 5 where dij2 1 L L l 1 dij2 (l ) and h 1 L L l 1 h(l ) . The estimator is approximately (in the 6 7 8 9 10 11 12 13 14 15 Taylor expansion sense) unbiased but having the advantage of not depending on the frequencies of individual alleles. The estimator depends only on the population heterozygosity h (averaged across loci). Moreover the dependency is trivial, where h is just a scaling coefficient. The new estimator essentially reduces the problem of pairwise relatedness estimation to the problem of heterozygosity estimation which is well studied especially in relation to inbreeding (e.g. Rousset & Raymond 1995). In the case of outbred populations, a simple heterozygosity estimator could be obtained by observing that 1 2 E x i x i 1 (l ) 2 h(l ) is independent of kin structure (and a specific allele 16 17 frequency). A heterozygosity indicator could be defined in vector notation as (10) hii 2 12 xi xi , 18 where its expected value yields the locus heterozygosity E hii h(l ) . Then an unbiased 19 estimator he (l ) of heterozygosity at a locus can be found in the BLUE terms via n 20 he (l ) ui hii , (11) i 1 21 where the weights u1 , u2 ,..., un are normalized by 22 hii 0 for a heterozygote and homozygote, respectively . If the relatedness matrix rij 23 were known, the most optimal BLUE weights could be found by minimizing var he . 24 Since rij is not known, equal weights ui 25 26 27 28 29 30 necessarily the most efficient) estimator of heterozygosity in the absence of allelespecific frequencies. The he estimate at a locus is simply equal to the number of observed heterozygotes averaged over the sample size n. 31 1 n n u 1 , and where hii 1 and i 1 i are used, which yield an unbiased (but not Since the new estimator requires averaged heterozygosity, the he (l ) estimator is averaged across loci obtaining 1 L 1 L n he he (l ) hii (l ) , (12) L l 1 nL l 1 i 1 32 where E he h and var he 33 34 35 36 number of observed heterozygotes averaged over the sample size n and number of loci L. Note that the new relatedness estimator is unbiased only if it is used together with a heterozygosity estimator that is at least equally unbiased. For example, the commonly used heterozygosity estimator (e.g. Nei 1978) 1 L2 L l 1 var he (l ) , i.e. the he estimate is equal to the 6 hNei 1 k 2n 1 qm2 , (13) 2n 1 m 1 2 where qm are the sample allele frequencies, is bias corrected for sample size but not for 3 the sample kin structure, i.e. E hNei 1 h , where (n 1)r /(2n 1) and where 4 r 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 2 n ( n 1) n n i 1 j i 1 ij r . Simulations Following Milligan (2002), three different population allele-frequency distributions are considered: equally-frequent – all alleles occur at equal frequency; one-common – a single allele occurs with a frequency of 0.6 and the remainder are equally frequent; Dirichlet – allele frequencies are independently drawn from the same Dirichlet distribution (p.209, Stuart & Ord 1987) with all parameters set to unity. The uniform Dirichlet distribution is equivalent to the random distribution used in Belkhir et al. (2002) and produces results very similar to those obtained with the triangular distribution (Belkhir et al. 2002; Wang 2002). Four degrees of pairwise relationships are considered: FS – full-sibs (r 12 ) , HS – half-sibs (r 14 ) , NR – non-related (r 0) and PO - parent-offspring (r 12 ) . In order to obtain the desired pairwise relationships two sample kin structures are used. The first structure is inspired by Dierkes et al. (2005) and denoted by G ( g , s) , where g is the number of parental pairs (groups), s is the number of full-sibs in each group. In order to examine half-sibling relationships g 1 unrelated genotypes U1 ,U 2 ,...,U g 1 are 23 generated and then the full-sibs of the i’th group are generated by the U i , U i 1 parental 24 25 pair. For example, the family structure G (2,9) encodes n g ( s 1) 1 21 genotypes U1 ,U 2 ,U 3 , F1 (1, 2),..., F9 (1, 2), F1 (2,3),..., F9 (2,3) , where Fj (i, i 1) denotes the j’th 26 offspring of the U i , U i 1 parents. Each sample generated according to the structure 27 consists of N 28 29 30 31 32 33 34 35 36 37 38 39 N HS ( g 1) s 2 81 , N PO 2sg 36 , and N NR N N FS N HS N PO 21 yielding 210 21 210 90% of the sample having non-zero pairwise relatedness. n ( n 1) 2 210 unique dyads including N FS 12 s ( s 1) g 72 , The second structure (denoted single-family) is inspired by Figure 4 of Wang (2002). A single-family sample consists of n individuals including s full-sibs together with the two founding parents (in order to study the PO dyads) and n 2 s pairwise unrelated genotypes. Therefore a single-family sample yields N FS 12 s ( s 1) , N HS 0 , N PO 2s , and N NR 12 n(n 1) 12 s ( s 1) 2s . Statistical behaviour of the new estimator in relation to the existing LR, QG and W estimators is assessed by comparing: sampling variance and bias (e.g. Wang 2002); the mean-squared error (MSE) via its square root (RMSE) (e.g. Milligan 2003); and P-values 7 1 2 3 4 5 (e.g. Belkhir et al. 2002). Following Milligan (2002) the RMSE (for a chosen relatedness estimator) is broken down into the relationship specific components. For example the RMSE for ’th kinship relationship (e.g. PO for parent-offspring) in a sample is defined by RMSE(r ) 1 N N i 1 (r (i ) r ) 2 , (14) 6 7 8 where: r (i ) is the i’th estimate of r (e.g. rPO 0.5 ); r is the actual value used to generate the corresponding dyad in the sample; N is the number of ’th relationships in the sample. The RMSE is calculated via MSE’s equivalent expression in terms of 9 10 11 12 sampling variance and bias, RMSE(r ) var(r ) (bias(r )) 2 . Therefore the RMSE represents an overall estimation error and shows if an estimator is optimized for bias at the expense of variance and vice-versa. The kinship specific sampling variance and bias N N are calculated via var(r ) ( N11) i 1 (r (i ) r ) 2 and bias(r ) N1 i 1 (r (i ) r ) , 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 respectively. The P-values are calculated based on the resampling procedure of Guo & Thompson (1992). For each simulated sample and at each locus, 17000 null-hypothesis dyads are generated by randomly selecting alleles without replacement from the pool of alleles available in the sample. A chosen relatedness estimator is then used to calculate: the null-distribution from the null dyads; the pairwise relatedness estimates from the sample; and the corresponding pairwise P-values. Once a relatedness estimator, population allele-frequency distribution, and sample kin structure are selected, a single simulation trial is performed as follows: (1) a sample is generated based on the selected population allele-frequency distribution and sample kin structure; (2) sample allele frequencies are calculated; (3) all pairwise relatedness estimates are calculated using the selected relatedness estimator and the sample (not population) allele frequencies; (4) sampling variance, bias and RMSE are calculated from the sample for each of the kinship relationships present in the sample; (5) nulldistribution is calculated and used to calculate P-values. When the Dirichlet distribution is used, in step 1 a freshly generated set of Dirichlet allele frequencies is used for each sample. In step 3 the HK estimator does not use the sample allele frequencies but rather the sample heterozygosity is estimated and used to calculate the HK estimates. Note that all sampling statistics are calculated on a per-sample basis (single simulation trial) and then averaged over 10,000 simulated samples. This is done to simulate as close as possible what actually happens in practice where, effectively, only a single sample is analysed for a given population at a given location. The bias, variance and RMSE (broken down by the types of kinship relationships) values could only be calculated in simulation studies. On the other hand, the pairwise P-values could be calculated for real data (where the sample kin structure may not be known a priori) separating unrelated and non-zero related dyads. The P-values for all considered estimators can be calculated via the KINGROUP v2 (Konovalov et al. 2004) program. 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Results and Discussion Figure 1 displays results for the single-family sample structure, where each simulated sample consists of 100 individuals and a varied number of full-sibs. Even though the equally-frequent allele frequency distribution is chosen in this case, other considered allele frequency distributions (and sample sizes) produce qualitatively similar results and hence not shown. The same is applicable to the rest of the Figures where no special optimisation for the sample structure or allele frequency distribution is performed. A variety of samples and frequencies are presented to verify that the statistical properties of the new estimator are consistent across the sample structures and allele frequency distributions (as expected from the derivation of the estimator). The W estimates are calculated based on the variant of the estimator which is bias-corrected for a sample size as per Wang (2002). First, Figure 1 shows that the LR, QG and W estimators become severely biased (second column in Figure 1) with the sample allele frequencies if a large part of the sample consists of related individuals. The reason for the negativity of the bias is discussed extensively in Queller & Goodnight (1989). In summary, the relatedness estimators use allele frequencies which are overestimated from the sample due to the presence of one (as in Figure 1) or more groups of highly related individuals. Therefore the relatedness values are calculated to be smaller (hence the negative bias) than those calculated with the correct (but unknown in practice) population allele frequencies. Second, the bias is entirely due to the error in the estimation of allelic frequencies, which can be seen by comparing the dashed lines (the QG estimates with the sample allele frequencies) with the dotted lines (the QG estimates with the correct population allele frequencies). The classic allele frequency bias-correction (the outer-pairs allele frequencies) yields little improvement in the QG estimates (compare the dashed and solid lines in Figure 1). Moreover the outer-pairs frequencies can not be applied to the LR estimator without further assumptions since the occurrence of rare alleles yield infinite LR estimates (Wang 2002). The new HK estimator is significantly less biased (some small residual bias may always be present due to the statistical nature of the simulations) than the other estimators throughout. The only possible critique is that the new estimator has slightly larger variance for the full-sibling dyads (FS in Figure 1) but the absolute magnitude of the increase is biologically irrelevant (e.g. var(rFS(HK) ) var(rFS(W) ) 0.00001 for 60 full-sibs), 38 while the improvement in the bias is manyfold better (e.g. bias(rFS(HK) ) bias(rFS(W) ) 0.1 39 40 41 42 for 60 full-sibs). The overall estimation error is measured by the RMSE (third column in Figure 1) and is dominated by the bias while the contribution from the variance is hardly (if at all) noticeable. Note that Figure 1 reproduces the previously reported bias bias(rPO(W) ) 0.04 in the parent-offspring (PO) dyads with 40 full-sibs (Wang 2002). 43 44 45 The forth column in Figure 1 shows the P-values. In the case of the FS and PO dyads the new estimator rejects the null hypothesis at the lower levels of significance 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 (smaller P-values). The P-values for the unrelated dyads remains around 0.5 which could not lead to the unrelated dyads being misclassified as related. Moreover the average Pvalue of p 0.5 is the by-definition correct value for an unbiased estimator applied to the unrelated dyads which are equivalent to the null-hypothesis dyads. This could be seen in Figure 2 where all considered estimators become unbiased as the sample size increases while the number of related individuals in the sample remains constant. 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 derivation of the new estimator. A similar residual RMSE is present in all other estimators. However, such low quality loci are most likely to be rejected by not passing the Hardy Weinberg test, e.g. using GENEPOP (Raymond & Rousset 1995). Figure 2 reveals that the new estimator becomes asymptotically equivalent to the other estimators when applied to the large samples with small average pairwise relatedness. As in Figure 1, the new estimator achieves the lowest RMSE. Interestingly, for the one-common allele frequency distribution (h 0.6) , the LR estimator achieves the lowest P-values for the PO and FS dyads starting from sample sizes of about 70. On the other hand the LR estimates are the most biased (largest RMSE) for the same dyads (second column in Figure 2). Figure 2 also reveals that the P-values may be hard to improve upon (see FS and PO in the last column of Figure 2) by just increasing the sample size. This is due to the null-distribution being mainly controlled by the number of available loci and their heterozygosities. Figure 2 reminds that one needs to be cautious with loci of low heterozygosity (e.g. around 0.5 or less). In the case of the HK estimator the residual RMSE is mainly due to the omission of cov dij2 (l ), h(l ) / h 2 (l ) in the Figure 2 confirms that the HK estimator is comparable to the other estimators on large samples (with low average pairwise relatedness). However the main advantage of the new estimator is revealed for the samples which are small and/or highly related internally. In particular, Figure 3 shows that the new estimator is the only estimator that can resolve the half-sibling (HS) dyads (HS P-values in Figure 3) in the G(3,9) sample structure. The main reason why the other estimators fail is because the estimators remain severely biased regardless of the number of loci. The new estimator remains (see also Figure 1) to be more powerful in resolving the PO and FS dyads (the lowest P-values). The problem of rare alleles has been already effectively solved by the W (Wang 2002) estimator. The HK estimator is also unaffected by the presence or absence of rare alleles, e.g. in Figure 3 in nearly every sample there are one or more rare alleles (alleles present only in a single group of related individuals). Another advantage of the HK estimator is that the P-values calculated using the estimator are independent of the estimated heterozygosity, i.e. the estimated heterozygosity is only required as a scaling coefficient to reduce the bias of the HK estimates. Finally, on the conceptual level, the main reason why the new estimator outperforms the other considered estimators could be explained within the framework of the Statistical Learning Theory of Vapnik (1995): given a limited amount of information “when solving a given problem, try to avoid solving a more general problem as an 10 1 2 3 4 5 6 7 8 intermediate step” (Vapnik 1995). The new estimator does just that by not estimating the population allele frequencies which is by far a much more difficult task when working with the small sample sizes. To some extent the W estimator followed a similar path by not using frequencies of individual alleles. Nevertheless, the W estimator requires the k following three observed sums of powers of allele frequencies: a m 1 qm for {2,3, 4} . 11 1 References 2 Belkhir K, Castric V, Bonhomme F (2002) IDENTIX, a software to test for relatedness in 3 4 5 6 a population using permutation methods. Molecular Ecology Notes, 2, 611-614. Broman KW (2001) Estimation of allele frequencies with data on sibships. Genetic Epidemiology, 20, 307-315. Dierkes P, Heg D, Taborsky M, Skubic E, Achmann R (2005) Genetic relatedness in 7 groups is sex-specific and declines with age of helpers in a cooperatively breeding 8 cichlid. Ecology Letters, 8, 968-975. 9 10 11 12 13 Goodnight KF, Queller DC (1999) Computer software for performing likelihood tests of pedigree relationship using genetic markers. Molecular Ecology, 8, 1231-1234. Guo SW, Thompson EA (1992) Performing the Exact Test of Hardy-Weinberg Proportion for Multiple Alleles. Biometrics, 48, 361-372. Konovalov DA, Manning C, Henshaw MT (2004) KINGROUP: a program for pedigree 14 relationship reconstruction and kin group assignments using genetic markers. 15 Molecular Ecology Notes, 4, 779-782. 16 Liebert AE, Starks PT (2006) Taming of the skew: transactional models fail to predict 17 reproductive partitioning in the paper wasp Polistes dominulus. Animal 18 Behaviour, 71, 913-923. 19 20 21 22 Lynch M, Ritland K (1999) Estimation of Pairwise Relatedness With Molecular Markers. Genetics, 152, 1753-1766. McPeek MS, Wu XD, Ober C (2004) Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics, 60, 359-367. 12 1 2 3 4 5 6 7 Milligan BG (2003) Maximum-likelihood estimation of relatedness. Genetics, 163, 11531167. Nei M (1978) Estimation of Average Heterozygosity and Genetic Distance from a Small Number of Individuals. Genetics, 89, 583-590. Queller DC, Goodnight KF (1989) Estimating relatedness using genetic markers. Evolution, 43, 258-275. Raymond M, Rousset F (1995) Genepop (Version-1.2) - Population-Genetics Software 8 for Exact Tests and Ecumenicism. Journal of Heredity, 86, 248-249. 9 Romo M, Suzuki S, Nakajima M, Taniguchi N (2006) Genetic evaluation of 10 interindividual relatedness for broodstock management of the rare species barfin 11 flounder Verasper moseri using microsatellite DNA markers. Fisheries Science, 12 72, 33-39. 13 14 15 16 Rousset F, Raymond M (1995) Testing Heterozygote Excess and Deficiency. Genetics, 140, 1413-1419. Stuart A, Ord JK (1987) Kendall's Advanced Theory of Statistics, Vol. 1: Distribution Theory, 5 edn. Oxford University Press, New York. 17 Vapnik V (1995) The Nature of Statistical Learning Theory Springer, New York. 18 Wang J (2002) An estimator for pairwise relatedness using molecular markers. Genetics, 19 20 21 160, 1203-1215. Wang J (2004) Sibship reconstruction from genetic data with typing errors. Genetics, 166, 1963-1979. 22 13 1 2 3 4 5 6 7 8 9 10 Acknowledgements This paper was partly written when D.K. was on sabbatical leave at the University of Bern. We would like to thank University of Bern and James Cook University and, in particular, Michael Taborsky and Bruce Litow for supporting this collaborative project, and Peter Stettler for his hospitality. We thank Francois Rousset, Ross Crozier, Laurent Excoffier, Nigel Bajema and three anonymous referees for helpful comments on earlier versions of this manuscript. D.H. is supported by SNF Grant 3100A0-108473. 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Figure Legends Fig. 1 Sampling variance, bias, root mean-square error (RMSE) and P-values of relatedness estimates for kin relationships as shown in the row titles. The curves are based on 10,000 samples, where each sample consists of 100 individuals in the singlefamily kin structure. The number of full-sibs in each sample is varied (x-axis). The dashed and dotted lines denote the QG estimates calculated with the sample and population allelic frequencies, respectively. The bold dots, crosses and circles denote the new HK, LR (Lynch & Ritland 1999) and W (Wang 2002) estimates, respectively. The solid line denotes the QG estimates with the outer-pairs allele-frequency correction. Ten loci are used, each having ten alleles with their population frequencies in the equally-frequent distribution. Fig. 2 The same as in Figure 1 but for varied sample sizes and 10 full-sibs (including their two parents) in each sample. Ten loci are used, each having five alleles with their population frequencies in the one-common allele frequency distribution. Fig. 3 The same as in Figure 1 but for samples with G (3,9) kin structure. The number of loci is varied (x-axis), each locus having 20 alleles with their population frequencies in the Dirichlet distribution. 15 1 2 3 Figure 1. 16 1 2 3 Figure 2 17 1 2 3 4 Figure 3 18 1 2 3 4 5 6 7 8 9 Appendix Expected value of the QG estimator Having expressed the QG estimator in vector notation, its expected value could be obtained as follows. Let an outbred population be in Hardy Weinberg Equilibrium (HWE) and described by the population allele frequencies p p1 , p2 ,..., pk which are assumed to be known exactly. Then each observed (sample) genotype xi ( x1,i , x2,i ,..., xk ,i ) could be represented as a sum of two statistically independent 10 , obtaining E mi2 E mi pm , gamete vectors xi ε i εi , i.e. xmi mi mi 11 2 var mi pm 1 pm , E xmi 2 pm , E xmi 2 pm 1 pm , 12 var xmi 2 pm 1 pm (e.g. Broman 2001), where p p m 1 pm2 and h 1 13 14 15 16 17 1 2 E xi xi 1 and k are homozygosity and heterozygosity at the given locus, respectively. A new form of pairwise relatedness matrix could be defined in the vector notation by x j rij xi (1 rij )z ij , where rii 1 , and where z ij is statistically independent of xi and x j , obtaining cov xmi , xmj 2rij pm 1 pm , E xmi xmj 2 rij pm (1 pm ) 2 pm2 and 18 E xi x j 2 rij h 2 . Then the expected values of the nominator and denominator in 19 Equation 5 for individuals xi and x j (from an outbred population sample) become 20 T E (T ) 2rij h and B E ( B ) 2h , respectively. In the case of multiple loci, B 2 Lh 21 and T 2rij Lh , where h 22 23 the l’th locus. The expected value of the QG estimator is approximately calculated via a Taylor series by omitting higher orders E T / B T / B obtaining 24 25 26 27 28 29 30 31 32 33 34 35 E rQG ( X i , X j ) 1 L L l 1 h(l ) , and where h(l ) is population heterozygosity of 2rij LH rij , (15) 2 LH where the expected value of the estimator is independent of population heterozygosity (and the population allele frequencies, for that matter). Bias in the QG estimator due to estimated allele frequencies In practice population allele frequencies are rarely known and the frequencies must be estimated from a sample. Two such frequency estimators are considered to reveal the effect of frequency estimation error on the QG relatedness estimator. The first frequency estimator is the sample frequency estimator (Equation 1) which yields the following expected value of the QG estimator L E xi x j q xi q x j rij Cij l 1 eij E rQG X i , X j , q , (16) C 1 L 1 ij E 2 xi xi 12 x j x j q xi q x j l 1 19 n 1 where the sample allele frequencies qm 2 allele frequencies pm p1 , p2 ,..., pk , and where Cij 3 4 An unbiased relatedness estimates could then be constructed from the QG estimates via rij eij Cij 1 eij . (17) 5 6 It is interesting to examine some examples. The first example is where all genotypes are in fact unrelated rij ij . Then for the QG estimator to become unbiased it must be 7 8 calculated via rij (NR) eij 21n 1 eij . (18) 9 10 11 12 13 14 15 1 2n i 1 xmi are used instead of the population 1 2 R R and i j Ri 1 n n r . j 1 ij The second example is where all sample genotypes are full sibs rij 12 (1 ij ) and the QG estimator must be bias-corrected via rij (FS) eij 12 21n 1 eij . (19) The second frequency estimator is the outer-pairs allele frequency estimator (Equation 2) which yields r Cij , (20) eij E rQG X i , X j , q(i, j ) ij 1 C ij n 16 where the outer-pairs allele frequencies qm (i, j ) 17 the population allele frequencies pm , and where Cij 18 Ri 19 and rij (FS) eij 12 1 eij revealing the fact that the outer-pairs frequencies yield 20 21 22 23 24 25 26 1 n2 n 1 2( n 2) 1 2 x i i j 1 m ,i are used instead of R R and i j r . The two examples of all unrelated and all full-sibs yield rij (NR) eij i i j 1 ii unbiased estimates only if the outer (in relation to the focal pair) individuals are unrelated to either of the paired genotypes. In summary, the QG pairwise relatedness estimator becomes biased if either one or both of the two focal individuals are related to one or more individuals in the rest of the sample. The outer-pairs frequency correction does not solve the problem of simultaneous estimation of pairwise relatedness and allele frequencies. 20
© Copyright 2024