University of Washington, TOPMed DCC July 2018

Similar documents
ville, VA Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX

Detection of Misspecified Relationships in Inbred and Outbred Pedigrees

Detecting Heterogeneity in Population Structure Across the Genome in Admixed Populations

A performance assessment of relatedness inference methods using genome-wide data from thousands of relatives

Kinship/relatedness. David Balding Professor of Statistical Genetics University of Melbourne, and University College London.

Objective: Why? 4/6/2014. Outlines:

ARTICLE PRIMUS: Rapid Reconstruction of Pedigrees from Genome-wide Estimates of Identity by Descent

Lecture 6: Inbreeding. September 10, 2012

Inbreeding and self-fertilization

Inbreeding and self-fertilization

Population Structure. Population Structure

Walter Steets Houston Genealogical Forum DNA Interest Group January 6, 2018

Bottlenecks reduce genetic variation Genetic Drift

Linkage Analysis in Merlin. Meike Bartels Kate Morley Danielle Posthuma

Pedigree Reconstruction using Identity by Descent

Autosomal-DNA. How does the nature of Jewish genealogy make autosomal DNA research more challenging?

LASER server: ancestry tracing with genotypes or sequence reads

BIOL 502 Population Genetics Spring 2017

Genetic Research in Utah

Pedigree Reconstruction Using Identity by Descent

Lecture 1: Introduction to pedigree analysis

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity, including those reporting another ethnicity.

NIH Public Access Author Manuscript Genet Res (Camb). Author manuscript; available in PMC 2011 April 4.

Chapter 2: Genes in Pedigrees

Autosomal DNA. What is autosomal DNA? X-DNA

Methods of Parentage Analysis in Natural Populations

Statistical methods in genetic relatedness and pedigree analysis

ICMP DNA REPORTS GUIDE

Kelmemi et al. BMC Medical Genetics (2015) 16:50 DOI /s

Primer on Human Pedigree Analysis:

Supporting Online Material for

Gene coancestry in pedigrees and populations

Using Autosomal DNA for Genealogy Debbie Parker Wayne, CG, CGL SM

NON-RANDOM MATING AND INBREEDING

Kinship and Population Subdivision

Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory

Genealogical Research

Genetic Genealogy. Rules and Tools. Baltimore County Genealogical Society March 25, 2018 Andrew Hochreiter

Coalescence. Outline History. History, Model, and Application. Coalescence. The Model. Application

Walter Steets Houston Genealogical Forum DNA Interest Group April 7, 2018

AFDAA 2012 WINTER MEETING Population Statistics Refresher Course - Lecture 3: Statistics of Kinship Analysis

Large scale kinship:familial Searching and DVI. Seoul, ISFG workshop

Nature Genetics: doi: /ng Supplementary Figure 1. Quality control of FALS discovery cohort.

Factors affecting phasing quality in a commercial layer population

Edinburgh Research Explorer

On identification problems requiring linked autosomal markers

Ancestral Recombination Graphs

Decrease of Heterozygosity Under Inbreeding

Walter Steets Houston Genealogical Forum DNA Interest Group February 24, 2018

Population Structure and Genealogies

U among relatives in inbred populations for the special case of no dominance or

Growing the Family Tree: The Power of DNA in Reconstructing Family Relationships

Determining Relatedness from a Pedigree Diagram

Introduction to Autosomal DNA Tools

Genetics: Early Online, published on June 29, 2016 as /genetics A Genealogical Look at Shared Ancestry on the X Chromosome

Population Genetics 3: Inbreeding

4. Kinship Paper Challenge

[CLIENT] SmithDNA1701 DE January 2017

Supplementary Note: Analysis of Latino populations from GALA and MEC reveals genomic loci with biased local ancestry estimation

Investigations from last time. Inbreeding and neutral evolution Genes, alleles and heterozygosity

Inbreeding depression in corn. Inbreeding. Inbreeding depression in humans. Genotype frequencies without random mating. Example.

DNA Basics, Y DNA Marker Tables, Ancestral Trees and Mutation Graphs: Definitions, Concepts, Understanding

CAGGNI s DNA Special Interest Group

From: Prof. Carlos D. Bustamante, Ph.D. Date: October 10, 2018

Meek DNA Project Group B Ancestral Signature

An Optimal Algorithm for Automatic Genotype Elimination

Inference of population structure using dense haplotype data Daniel John Lawson 1, Garrett Hellenthal 2, Simon Myers,3 and Daniel Falush,4,

Automated Discovery of Pedigrees and Their Structures in Collections of STR DNA Specimens Using a Link Discovery Tool

DNA: UNLOCKING THE CODE

A hidden Markov model to estimate inbreeding from whole genome sequence data

Genomic Variation of Inbreeding and Ancestry in the Remaining Two Isle Royale Wolves

Your mtdna Full Sequence Results

GliesianDNA (BETA) atdna Relationships Predictions for cms with no influence factors

Popstats Parentage Statistics Strength of Genetic Evidence In Parentage Testing

DNA Testing. February 16, 2018

Tools: 23andMe.com website and test results; DNAAdoption handouts.

Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations

TRACK 1: BEGINNING DNA RESEARCH presented by Andy Hochreiter

Developing Conclusions About Different Modes of Inheritance

GEDmatch Home Page The upper left corner of your home page has Information about you and links to lots of helpful information. Check them out!

PopGen3: Inbreeding in a finite population

Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms

Forensic use of the genomic relationship matrix to validate and discover livestock. pedigrees

Inbreeding Using Genomics and How it Can Help. Dr. Flavio S. Schenkel CGIL- University of Guelph

Genomic insights into the population structure and history of the Irish Travellers.

Comparative method, coalescents, and the future

BIOL Evolution. Lecture 8

KinLinks: Software Toolkit for Kinship Analysis and Pedigree Generation from NGS Datasets

Statistical Interpretation in Making DNA-based Identification of Mass Victims

TDT vignette Use of snpstats in family based studies

Walter Steets Houston Genealogical Forum DNA Interest Group November 18, 2017

CONGEN. Inbreeding vocabulary

Mehdi Sargolzaei L Alliance Boviteq, St-Hyacinthe, QC, Canada and CGIL, University of Guelph, Guelph, ON, Canada. Summary

Inference of Population Structure using Dense Haplotype Data

ARTICLE Using Genomic Inbreeding Coefficient Estimates for Homozygosity Mapping of Rare Recessive Traits: Application to Taybi-Linder Syndrome

Genome-Wide Association Exercise - Data Quality Control

Characterization of the global Brown Swiss cattle population structure

Package EILA. February 19, Index 6. The CEU-CHD-YRI admixed simulation data

Advanced Autosomal DNA Techniques used in Genetic Genealogy

Theoretical Population Biology. An approximate likelihood for genetic data under a model with recombination and population splitting

Transcription:

Module 12: Comput l Pipeline for WGS Relatedness Inference from Genetic Data Timothy Thornton (tathornt@uw.edu) & Stephanie Gogarten (sdmorris@uw.edu) University of Washington, TOPMed DCC July 2018 1 / 27

Incomplete Genealogy Widely used statistical methods for the analysis of large-scale genetic data often assume independent samples or samples with known pedigree relationships; e.g., standard linkage analysis and association analysis methods 2 / 27

Incomplete Genealogy Misspecified and cryptic relationships can invalidate many of these methods if correlated genotypes among relatives are not properly accounted for in the analysis 3 / 27

Identifying Relative Pairs In principle, we could determine the relationship between two individuals by simply looking at the percentage of the genome that are identical by descent (IBD) for a pair where: parent-offspring sharing: 50% of genome IBD sibs: 50% of genome (on average) IBD avuncular: 25% of genome (on average) IBD However, we do not directly observe IBD sharing. With SNP genotyping data or DNA sequencing data, we can estimate IBD sharing. 4 / 27

IBD Sharing Probabilities and Kinship coefficients IBD sharing probabilities and kinship coefficients are commonly used measures of relatedness for pairs of individuals For any pair of outbred individuals i and j, let δ ij k be the probability that i and j share k alleles IBD at a locus where k is 0, 1, or 2. Let φ ij to be the kinship coefficient for i and j. The kinship coefficient is the probability that a random allele selected from individual i and a random allele from individual j are IBD. Note that in outbred populations, φ = 1 2 δij 2 + 1 4 δij 1 5 / 27

Relatedness Measures for a Few Relationships Relationship φ ij δ ij 2 δ ij 1 δ ij 0 1 Parent-Offspring 4 0 1 0 1 1 1 1 Full Siblings 4 4 2 4 1 1 1 Half Siblings 8 0 2 2 1 1 1 Uncle-Nephew 8 0 2 2 1 1 3 First Cousins 16 0 4 4 1 1 7 First-Cousins Once Removed 32 0 8 8 Unrelated 0 0 0 1 6 / 27

Relatedness Inference in Homogenous Population Samples Genome Screen Data to Identify Relative Pairs High-throughput genotyping data facilitated new opportunities for the detection of pedigree errors as well assessing the degree of relatedness among sampled individuals in genetic studies. A number of methods have been proposed for identifying relatives using genome-screen data for samples from a single homogenous population 7 / 27

Relatedness Inference in Homogenous Population Samples Inference for Close Relatives from a Homogenous Population McPeek and Sun (2000) developed an approximate likelihood method (using HMM) to identify relative pairs for close relationships Purcell et al. (2007) proposed a method of moments estimator for IBD sharing probabilities using genome-screen data (implemented in the PLINK software) Choi, Wijsman, and Weir (2009) proposed using an EM algorithm to estimate the IBD probabilities and kinship coefficients Thornton and McPeek (2010) proposed a method of moments estimator for kinship coefficients by estimating genotypic correlations across the genome 8 / 27

Relatedness Inference in Homogenous Population Samples Identification of More Distant Relatives from a Homogenous Population A variety of methods have been developed for identifying long shared IBD segments for inference on more distant relatives Stankovich et al. (2005) extended HMM method of McPeek and Sun for more distant relative pairs Huff et al. (2011) Browning and Browning (2013) and others... 9 / 27

Relatedness Inference in Homogenous Population Samples Estimating Kinship Coefficients via Genotypic Correlations Thornton and McPeek (2010) proposed a method of moments approach for estimating kinship coefficients from SNP genotyping data in samples from homogenous populations based on genetic correlations. Consider two individuals i and j the sample. Assume genome screen data is available for i and j at M autosomal markers, indexed by m = 1, 2,..., M. Let g im be the genotype value at marker m for individual i, where g im takes values 0, 1, or 2, corresponding to the number of reference alleles individual i has. Let p m be the frequency of allelic type 1, where 0 < p m < 1. 10 / 27

Relatedness Inference in Homogenous Population Samples Estimating Kinship Coefficients in Homogenous Populations It can be shown that the covariance of g im and g jm at marker m is Cov(g im, g jm ) = 4p m (1 p m )φ ij, where φ ij is the kinship coefficient for i and j. Rearranging terms, we see that φ ij = Cov(g im,g jm ) 4p m(1 p m) This relationship holds for markers across the genome (but with the allele frequency distribution changing for each marker). It follows that the kinship coefficients can be estimated for pairs of individuals using genotype data from a genome-screen. 11 / 27

Relatedness Inference in Homogenous Population Samples Estimating Kinship Coefficients in Homogenous Populations For any pair of individuals i and j, we can estimate φ ij using method of moments where ˆφ ij = 1 M (g im 2ˆp m )(g jm 2ˆp m ) M 4ˆp m (1 ˆp m ) m=1 where ˆp m is an allele frequency estimate for the reference allele at marker m Note that this estimator is essentially the same (up to a constant factor) as the previously discussed GRM estimator used for population structure inference with PCA! It is also the same estimator that is used to construct a GRM for association testing with linear/logistic mixed models that will be discussed later! 12 / 27

Relatedness Inference in Diverse Populations Relatedness Inference in Structured Populations The aforementioned algorithms for relatedness inference assume population homogeneity. This assumption is often untenable. Many genetic studies, (such as TOPMed) have samples from populations with different ancestries. relationship estimation methods that assume homogeneity can give extremely biased results in the presence of population structure. The degree of relatedness among related and unrelated sample individuals with similar ancestry can be systematically inflated 13 / 27

Relatedness Inference in Diverse Populations KING: Relatedness inference with Distinct Ancestral Subpopulations The KING estimator (Manichaikul A et al., 2010) discussed in the previous lecture was developed for estimating kinship coefficients for pairs of individuals from ancestrally distinct subpopulations KING-robust estimates kinship coefficients for a pair of individuals by using the shared genotype counts as a measure of the genetic distance between the pair. The method does not require allele frequency estimates at the marker: is based on allele sharing counts for individuals A limitation of the method is that it gives biased kinship estimates for individuals with different ancestry, including close relatives who are admixed. 14 / 27

Relatedness Inference in Diverse Populations Relatedness Inference in Admixed Samples Genetic models used to identify related individuals from large scale genetic data often make simplifying assumptions about population structure either random mating or simple structures. In reality, human populations do not mate at random nor are there simple endogamous subgroups. While large-scale genetic studies have primarily examined populations of European ancestry, more recent studies, including TOPMed, involve multi-ethnic cohorts with samples from admixed populations. 15 / 27

Relatedness Inference in Diverse Populations Recent versus Distant Genetic Relatedness Distinguishing familial relatedness from population structure using genotype data is difficult, as both manifest as genetic similarity through the sharing of alleles. It is important to note that relatedness and ancestry are a continuum Two alleles that are considered to be identical copies (e.g., IBD) of an ancestral allele is relative to some choice of previous reference point in time, with the implication being that more distant allele sharing prior to that time is not considered in the determination of IBD 16 / 27

Relatedness Inference in Diverse Populations Recent versus Distant Genetic Relatedness 17 / 27

Relatedness Inference in Diverse Populations Deconvolution of Recent and Distant Genetic Relatedness 18 / 27

Relatedness Inference in Diverse Populations PC-Relate: Relatedness Inference in Diverse Samples The PC-Relate method of Conomos et al. (2016) estimates IBD sharing probabilities and kinship coefficients in the presence of unknown population structure Let g im and g jm be the previously defined genotype value at marker m for individuals i and j respectively. For all individuals in the sample, PC-Relate uses a regression model to estimate the expected genotypic values for each marker m conditional on i s inferred ancestry using principal components from PC-AiR. A regression model is used to estimated the expected genotypic count, where the top principal components (PCs) from PC-AiR are included as predictors. 19 / 27

Relatedness Inference in Diverse Populations PC-Relate: Relatedness Inference in Diverse Samples We denote µ im = 1 2 E[g im P Cs] to be the individual specific allele frequency for individual i at marker m based on the PCs. Note that 2µ im is the expected value of g im conditional on i s ancestry that is represented by the PCs from the regression model. The PC-Relate estimator of φ ij for i and j is obtained via method of moments: ˆφ A ij = M m=1 (g im 2ˆµ im )(g jm 2ˆµ jm ) M m=1 4 ˆµ im (1 ˆµ im )ˆµ jm (1 ˆµ jm ) 20 / 27

Relatedness Inference in Diverse Populations PC-Relate: Relatedness Inference in Diverse Populations PC-Relate also estimates IBD sharing probabilities In all calculations, baseline differences in genotypic values that are due to ancestry differences (i.e., the PCs) are regressed out (or adjusted for) PC-Relate kinship coefficients and IBD sharing probabilities are robust to population structure, admixture, and HWE departures See Conomos et al. (2016) for more details. 21 / 27

Relatedness Inference in TOPMed TOPMed Phase I: Relatedness Inference TOPMed cohorts are multi-ethnic There is also extensive relatedness in TOPMed cohorts due to family-based-sampling and samples from founder populations (Amish). Also likely a number of cryptic relationships within and among the cohorts. The PC-AiR algorithm was first applied to TOPMed Phase I data for inference on population structure in the presence of relatedness PC-Relate was used to estimate relatedness in the TOPMed Phase I samples where the top PCs from PC-AiR were adjusted for in the analysis. 22 / 27

Module 12: Comput l Pipeline for WGS Relatedness Inference in TOPMed TOPMed Phase I: PC-Relate Kinship by Study Study 1 Study 3 Study 4 Study 5 Study 6 Study 7 Study 9 Study 10 Study 11 Study 12 Study 13 Study 15 Study 16 Study 18 0.5 0.4 0.3 0.2 0.1 0.5 0.4 0.2 0.1 0.5 0.4 0.3 0.2 0.1 0.5 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.4 kinship estimate kinship estimate 0.3 0.5 0.4 0.3 0.2 Duplicates 0.3 0.2 Degree 1 0.1 Degree 2 Degree 3 0.1 0.0 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 k0 0.00 0.25 0.50 k0 0.75 1.00 23 / 27

Relatedness Inference in TOPMed TOPMed Phase I: PC-Relate Kinship Across Studies Study 1 Study 2 Study 3 Study 4 Study 5 0.5 0.4 0.3 0.2 kinship estimate 0.1 0.5 0.4 0.3 0.2 0.1 0.5 0.4 0.3 0.2 0.1 0.5 0.4 0.3 0.2 0.1 Study 6 Study 7 Study 8 Study 9 Study 10 Study 16 Study 17 Study 18 Study 19 Study 11 Study 12 Study 13 Study 14 Study 15 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 k0 0.0 0.2 0.4 0.6 0.8 study2 Study 1 Study 4 Study 5 Study 6 Study 8 Study 9 Study 10 Study 11 Study 12 Study 13 Study 14 Study 15 Study 18 Study 19 kinship estimate 0.5 0.4 Duplicates 0.3 0.2 Degree 1 0.1 Degree 2 Degree 3 0.0 0.00 0.25 0.50 0.75 1.00 k0 24 / 27

References Browning BL and SR Browning (2013). Improving the Accuracy and Efficiency of Identity by Descent Detection in Population Data. Genetics 194: 459-471 Choi Y, Wijsman EM, Weir BS (2009). Case-control association testing in the presence of unknown relationships. Genet. Epi. 33, 668-678. Conomos MP, Miller M, Thornton T (2015). Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness. Genetic Epidemiology 39, 276-93 Conomos MP, Reiner AP, Weir BS, and Thornton TA (2016), Model-free estimation of recent genetic relatedness. American Journal of Human Genetics 98: 127-148. 25 / 27

References Huff CD, Witherspoon DJ, Simonson, TS, Xing J, et al. (2011). Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Research, 21, 768-774. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies.bioinformatics 26, 2867-2873. McPeek MS and Sun L (2000). Statistical Tests for Detection of Misspecified Relationships by Use of Genome-Screen Data, Am. J. Hum. Genet. 66, 1076-1094. 26 / 27

References Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC (2007). PLINK: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet. 81, 559-575. Stankovich J, Bahlo M, Rubio JP, Wilkinson CR, Thomson R, Banks A, Ring M, Foote SJ, Speed TP (2005). Identifying nineteenth century genealogical links from genotypes. Hum. Genet. 117, 188-199 Thornton T, McPeek MS (2010). ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure. Am. J. Hum. Genet. 86, 172-184. 27 / 27