Large scale kinship:familial Searching and DVI Seoul, ISFG workshop 29 August 2017
Large scale kinship Familial Searching: search for a relative of an unidentified offender whose profile is available in a DNA database, with the intention of identifying the offender indirectly DVI: search a list of unidentified persons against each other and against a list of missing persons, with the intention of making identifications Familial searching is used to generate investigative leads; it is generally impossible to achieve a very high power (probability to find relatives if present) without having too many false positives In DVI it is important not to overlook any identifications Both are large scale applications; statistical properties need to be understood for optimal application and understanding of results. 2
Familial Searching Compute likelihood ratios for paternity (PI) or for being sibs (SI) Most efficient strategy (in terms of number of false positives per true positive) is to extract everyone whose LR exceeds a prespecified threshold t Large t: fewer false positives, but also less probability to find a relative Small t: more probability to find relative, but also more false positives The theory of Block 2 can be used to make ROC curves, in which we plot Log 10 P LR > 10 t H d, P LR > 10 t H p ) = Log 10 FPR 10 t, TPR 10 t as a function of t This can be done for a specific profile, or averaged 3
ROC curve Sibling Index (Averaged over profiles) Black:NGM Dotted:SGMPlus TPR: True Positive Rate FPR: False Positive Rate E.g.: on NGM profiles, LR-threshold of 10000 corresponds to FPR of about 10^-5 and TPR about 0.6 4
ROC curve Paternity Index (Averaged over profiles) 5
Different case, different challenge If someone has very common alleles then he ll share common alleles with his relatives. These give rise to a low LR, since it s fairly easy to obtain them by chance Since it s easy to obtain them by chance, relatively many unrelated people will genetically look like they could be related Hence, relatives hard to find If someone has very rare alleles then he s likely to share rare alleles with his relatives. These give rise to a high LR, since it s fairly hard to obtain them by chance Since it s hard to obtain them by chance, relatively few unrelated people will genetically look like they could be related Hence, relatives easier to find 6
SI, SGMPlus loci, various profiles 7
Siblings: LR-ranking in 100.000 SGM+ profiles 8
Fully Bayesian interpretation Assuming that, in a database with N individuals There is at most one relative of a given kind The probability of this relative being person i is π i and the probability of there not being any relative is π 0 = 1 Suppose the obtained likelihood ratios are r 1,, r N. Set r 0 = 1. Then the probability that person i is the relative we look for is π i r i N j=0 π j r j Interpretation: the posterior probability is proportional to the prior probability and to the LR; for people outside the database no information is obtained. N i=1 π i 9
Case of equal priors If all π i are the same then the LR for the hypotheses 1. The database contains a relative of the specified type (H1) 2. The database does not contain a relative of the specified type (H2), is equal to 1 N r N i i=1 i.e., the average obtained LR for all individuals. Note that, if there is no relative in the whole database, we expect all r i to be 1, and so also the LR for the two hypotheses above is 1 in expectation, which it also needs to be in accordance with general theory. If there is no relative in the database, we expect the sum of all LR s that we obtain to be equal to the number of people searched against. 10
Analogy with possibly tricked deck of cards Familial search database size N Deck of N=52 cards Person 1 has PI=N and all other PI=0 Person 1 is parent/child of unknown offender N i=1 PI i = N No evidence that the database contains a relative If it does contain a relative, it has to be person 1 First card drawn is ace of spades Deck is tricked as aces of spades only LR in favour of tricked as aces of spades is 52; LR in favour of other tricks is 0; sum of 52 LR s is 52. No evidence that the deck is tricked If it is tricked, it has to be tricked as aces of spades. 11
Example 12
Strategy at the NFI 1. Autosomal search by PI and SI LR-threshold for further investigation equal to 1000, irrespective of number of loci compared This is enough to warrant interest but not nearly enough for identification Carry out additional DNA testing until either LR<1000 (stop) or LR>1,000,000,000 When no more additional testing can/needs be done, also compute Halfsib index for other types of relatedness Report any PI/SI/HSI equal to 100,000 or more 2. Y-chromosomal search: further type all profiles with at most one difference 13
By the book Utrecht: serial sex offender, unknown FS top ranked SI equals 5 million Additional profiling: SI> 10^9, Y-STR 22/23 match, mtdna 1 mismatch A brother of the database person was arrested and turned out to yield a direct dna match. Convicted. 14
Not by the book Familial search yields woman with PI=39000 Further testing: parent-child relation excluded Further testing: mitochondrial profiles match Age of woman + moment of crime: full/ maternal half siblings unlikely; paternal half-sibs? Mito-match best explained by maternal relationship Most support for autosomal and mitochondrial profiles: woman is sister of offender s mother. This was indeed verified with a full match. 15
Familial Searching illustrates The utility of LR distributions for case pre-assessment: ROC curves for the specific profile at hand, or averaged out to judge applicability of the method The irrelevance of LR distributions once the LR s have been calculated: posterior probabilities depend only on evidence we have, not on evidence we could have had That evidence should not be interpreted in terms of false rates: even if the false positive rate is very small, it may be that most of the positives are false positives That one should be careful with an equal prior for paternity testing: doing this with familial searching could lead to a probability of paternity >99,99% for several fathers which is clearly absurd. A LR alone can not be used to conclude about the true relationship. 16
Disaster Victim Identification Large lists of missing persons MP i and of unidentified individuals UI j Usually software would compute LR i,j, the LR for MP i = UI j versus Mp i is not related to Ui j If missing persons are related to each other, then neither hypothesis may be true In order not to overlook a possible identification, NFI uses a uniform mutation model Real mutations are often single step but a uniform model can be helpful when inconsistencies between genotypes are due to silent alleles or clerical errors 17
Pedigree with several MP s MN MG FN FG FGS MS1 MS2 MS3 M F FS1 FS1P FS2 FS2P Squares: men V X Y Z J K L Circles: women Reported Killed Not Available Reference Sample V1 V2 V4 V3 V5 : 5 victims 18
Hypothesis choosing With our 5 victims and 12 missing persons, we can construct many hypotheses: H 1 specifying that some victims are some MP s: number of possible such propositions is: 1 victim: 60 possibilities 2 victims: 1320 possibilities 3 victims: 13200 possibilities 4 victims: 59400 possibilities 5 victims: 95040 possibilities. H 2 : can specify that all victims are unrelated to the MP s, but relations between the victims are also possible. Huge number of combinations! For conceptual and computational reasons: choice to start with 1 victim versus 1 MP, equality vs. unrelatedness. Result: 60 Likelihood Ratios. 19
Resulting LR>100 victim 1-m V-f LR = 6x10 4 victim 2-m FS2-m / F-m FN-f J-m / K-m / L-f LR = 2x10 4 LR = 8x10 3 LR = 4x10 2 victim 3-f M-f LR = 1x10 9 victim 4-m Y-m / Z-f LR = 2x10 5 victim 5-f FS2-m / F-m J-m / K-m / L-f FN-f LR = 4x10 4 LR = 5x10 2 LR = 1x10 2 20
Since Identification P(DNA Victim 3 is M) P(DNA Victim 3 is unrelated to M) = 10 9 and no other LR s relating Victim 3 to the pedigree are large, we decide that Victim 3 is M. Now we can continue in the same way as before, calculating 4*11=44 LR s: 21
Calculation 2: 4 victims with pedigree victim 1-m V-f GENDER!! LR = 2x10 6 victim 2-m FS2-m / F-m FN-f J-m / K-m / L-f LR = 2x10 4 LR = 8x10 3 LR = 4x10 2 victim 3 = M victim 4-m Y-m / Z-f LR = 2x10 5 victim 5-f FS2-m / F-m J-m / K-m / L-f FN-f LR = 4x10 4 LR = 5x10 2 LR = 1x10 2 22
Combined Likelihood Ratio For analogous reasons, decide that Victim 1 is V. The joint LR for (Victim 3=M, Victim 1=V) vs. unrelated is the product of the two LR s that we ve computed: let G(V3) be the DNA-profile of victim 3 and analogous for G(V1). Then P(G(V3),G(V1) V3=M, V1=V) P(G(V3),G(V1) unrelated) = P(G(V3) V3=M,V1=V) P(G(V1) G(V3),V3=M,V1=V) P(G(V3) unrelated) P(G(V1) G(V3), unrelated) = 10 9 2 10 6 = 2 10 15. 23 However: the unrelatedness assumption is questionable: V3 and V1 have large LR for being parent-child! If we choose as alternative hypothesis that they are parent-child unrelated to the pedigree, then the combined LR is again one billion.
Pedigree so far MN MG FN FG FGS MS1 MS2 MS3 V3 F FS1 FS1P FS2 FS2P V1 X Y Z J K L Missing persons Not Available V2 Reference Sample V4 V5 : 3 victims left Victim added 24
victim 1 = V Calculation 3: 3 victims with pedigree victim 2-m F-m LR = 1x10 9 FS2-m LR = 5x10 6 FN-f LR = 4x10 6 Gender mismatch! victim 3 = M victim 4-m Y-m / Z-f LR = 2x10 5 victim 5-f FS2-m / F-m J-m / K-m / L-f FN-f LR = 2x10 5 LR = 1x10 3 LR = 6x10 2 25
Branching LR for (V2=F) versus (V2 unrelated to pedigree) is 10 9 LR for (V2=FS2) versus (V2 unrelated to pedigree) is 5 10 6. This is not a contradiction: it just shows that the alternative hypothesis (V2 unrelated to pedigree) is far less likely than these two propositions. Conclusion: without other information (i.e. equal prior odds) the probability that V2=F is (almost) 99,5 %; the probability that V2=FS2 is (almost) 0,5 %. Continuing with both pedigrees in this case does not alter these probabilities: we get high LR s for Victim 4=Y, Victim 5=FN. These positions in the pedigree can not distinguish between F and FS2 either. So: high LR s alone are not sufficient to draw hard conclusions. 26