Ancestral Recombination Graphs

Similar documents
Population Structure and Genealogies

Population Genetics using Trees. Peter Beerli Genome Sciences University of Washington Seattle WA

Coalescence. Outline History. History, Model, and Application. Coalescence. The Model. Application

Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory

Approximating the coalescent with recombination

Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms

Estimating Ancient Population Sizes using the Coalescent with Recombination

Viral epidemiology and the Coalescent

Population genetics: Coalescence theory II

The genealogical history of a population The coalescent process. Identity by descent Distribution of pairwise coalescence times

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Comparative method, coalescents, and the future

Bioinformatics I, WS 14/15, D. Huson, December 15,

Forward thinking: the predictive approach

TREES OF GENES IN POPULATIONS

Comparative method, coalescents, and the future. Correlation of states in a discrete-state model

Theoretical Population Biology. An approximate likelihood for genetic data under a model with recombination and population splitting

Analysis of geographically structured populations: Estimators based on coalescence

Pedigree Reconstruction using Identity by Descent

Part I. Concepts and Methods in Bacterial Population Genetics COPYRIGHTED MATERIAL

2 The Wright-Fisher model and the neutral theory

Simulated gene genealogy of a sample of size 50 from a population of constant size. The History of Population Size from Whole Genomes.

Coalescent Theory: An Introduction for Phylogenetics

5 Inferring Population

STAT 536: The Coalescent

Coalescent Theory. Magnus Nordborg. Department of Genetics, Lund University. March 24, 2000

Kenneth Nordtvedt. Many genetic genealogists eventually employ a time-tomost-recent-common-ancestor

Kinship/relatedness. David Balding Professor of Statistical Genetics University of Melbourne, and University College London.

Ioanna Manolopoulou and Brent C. Emerson. October 7, Abstract

6.047/6.878 Lecture 21: Phylogenomics II

MOLECULAR POPULATION GENETICS: COALESCENT METHODS BASED ON SUMMARY STATISTICS

Chapter 12 Gene Genealogies

DISCUSSION: RECENT COMMON ANCESTORS OF ALL PRESENT-DAY INDIVIDUALS

UNDERSTANDING the genealogical relationship finite for any sample size. But, even positions sharing

BIOL Evolution. Lecture 8

Lecture 1: Introduction to pedigree analysis

Advanced data analysis in population genetics Likelihood-based demographic inference using the coalescent

Report on the VAN_TUYL Surname Project Y-STR Results 3/11/2013 Rory Van Tuyl

Inference of Population Structure using Dense Haplotype Data

Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations

GENEALOGICAL TREES, COALESCENT THEORY AND THE ANALYSIS OF GENETIC POLYMORPHISMS

Inference of population structure using dense haplotype data Daniel John Lawson 1, Garrett Hellenthal 2, Simon Myers,3 and Daniel Falush,4,

Gene coancestry in pedigrees and populations

Growing the Family Tree: The Power of DNA in Reconstructing Family Relationships

Research Article The Ancestry of Genetic Segments

Ancestral population genomics: the coalescent hidden Markov. model approach. Julien Y Dutheil 1, Ganeshkumar Ganapathy 2, Asger Hobolth 1,

Evaluating the performance of likelihood methods for. detecting population structure and migration

DNA Basics, Y DNA Marker Tables, Ancestral Trees and Mutation Graphs: Definitions, Concepts, Understanding

Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling

The Coalescent Model. Florian Weber

Coalescents. Joe Felsenstein. GENOME 453, Autumn Coalescents p.1/48

The Two Phases of the Coalescent and Fixation Processes

The Coalescent. Chapter Population Genetic Models

Chapter 4 Neutral Mutations and Genetic Polymorphisms

Your mtdna Full Sequence Results

SINGLE nucleotide polymorphisms (SNPs) are single cases the SNPs have originally been identified by sequencing.

POPULATION GENETICS: WRIGHT FISHER MODEL AND COALESCENT PROCESS. Hailong Cui and Wangshu Zhang. Superviser: Prof. Quentin Berger

Every human cell (except red blood cells and sperm and eggs) has an. identical set of 23 pairs of chromosomes which carry all the hereditary

Walter Steets Houston Genealogical Forum DNA Interest Group January 6, 2018

Coalescent Theory for a Partially Selfing Population

Coalescence time distributions for hypothesis testing -Kapil Rajaraman 498BIN, HW# 2

Meek DNA Project Group B Ancestral Signature

Tópicos Depto. Ciencias Biológicas, UniAndes Profesor Andrew J. Crawford Semestre II

Walter Steets Houston Genealogical Forum DNA Interest Group February 24, 2018

Genetics: Early Online, published on June 29, 2016 as /genetics A Genealogical Look at Shared Ancestry on the X Chromosome

Mitochondrial Eve and Y-chromosome Adam: Who do your genes come from?

Coalescent genealogy samplers: windows into population history

Coalescents. Joe Felsenstein. GENOME 453, Winter Coalescents p.1/39

and 6.855J. Network Simplex Animations

ESTIMATION OF THE NUMBER OF INDIVIDUALS FOUNDING COLONIZED POPULATIONS

A Likelihood Method to Estimate/Detect Gene Flow and A Distance Method to. Estimate Species Trees in the Presence of Gene Flow.

can mathematicians find the woods?

[CLIENT] SmithDNA1701 DE January 2017

TRACK 1: BEGINNING DNA RESEARCH presented by Andy Hochreiter

Coalescent Likelihood Methods. Mary K. Kuhner Genome Sciences University of Washington Seattle WA

Genetic Genealogy Journey DNA Projects by Debbie Parker Wayne, CG SM, CGL SM

Walter Steets Houston Genealogical Forum DNA Interest Group April 7, 2018

PATTERNS of heritable genetic variation in contem- relationships, but does not provide a basis for assessing

MODERN population genetics is data driven and

The Structure of Genealogies and the Distribution of Fixed Differences Between DNA Sequence Samples From Natural Populations

DNA Testing What you need to know first

LANDSCAPE SMOOTHING OF NUMERICAL PERMUTATION SPACES IN GENETIC ALGORITHMS

Genetic Diversity and the Structure of Genealogies in Rapidly Adapting Populations

How to use MIGRATE or why are Markov chain Monte Carlo programs difficult to use?

Using Autosomal DNA for Genealogy Debbie Parker Wayne, CG, CGL SM

Big Y-700 White Paper

On the nonidentifiability of migration time estimates in isolation with migration models

Walter Steets Houston Genealogical Forum DNA Interest Group January 27, 2018

Ancient Admixture in Human History

University of Washington, TOPMed DCC July 2018

Recent Trends in Population Genetics: More Data! More Math! Simple Models?

Estimating Effective Population Size and Mutation Rate From Sequence Data Using Metropolis-Hastings Sampling

GEDmatch Home Page The upper left corner of your home page has Information about you and links to lots of helpful information. Check them out!

Halley Family. Mystery? Mystery? Can you solve a. Can you help solve a

Inbreeding and self-fertilization

Factors affecting phasing quality in a commercial layer population

FOSSIL, archaeological, and genetic data all lend

The program Bayesian Analysis of Trees With Internal Node Generation (BATWING)

Paper Presentation. Steve Jan. March 5, Virginia Tech. Steve Jan (Virginia Tech) Paper Presentation March 5, / 28

Warning: software often displays unrooted trees like this:

Transcription:

Ancestral Recombination Graphs Ancestral relationships among a sample of recombining sequences usually cannot be accurately described by just a single genealogy. Linked sites will have similar, but not necessarily identical genealogies. Recombination leads to variation in the TMRCA between different sites, which in turn can lead to variation in genetic polymorphism. The ancestral recombination graph (ARG) is a generalization of the coalescent which describes the sequence of genealogies along a sample of recombining sequences. Nordborg (2000) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 1 / 20

Variation in Total Tree Length in a Sample of 10 Chromosomes Hudson (2000) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 2 / 20

Applications of the ARG The ARG has several uses: Recombining sequences are potentially much more informative about demography, admixture and selection than a single completely-linked locus. Fine scale recombination rate estimation is possible if we can exploit high density SNP data. Statistical inference for GWAS can be improved if we can accurately account for the complex correlations that exist between multiple linked loci. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 3 / 20

Meiotic recombination generates mosaic chromosomes Strachan & Read (1996) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 4 / 20

The Two-locus Ancestral Recombination Graph For simplicity, consider a sample of n sequences containing just two loci. We will make the following assumptions. The population evolves according to the diploid Wright-Fisher model, with population size N. Each individual samples two chromosomes uniformly at random from the previous generation. When a chromosome is sampled from a parent, it will either be inherited intact (with probability 1 r) or else it will undergo a recombination between the two loci (with probability r). When a recombination event occurs, the two loci will be inherited from the two different homologous chromosomes of that parent. We will assume that no recombination occurs within the loci. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 5 / 20

Suppose that N is large and that r is of order O(1/N). Then, when looking backwards in time, the ancestral relationships between the sampled sequences are determined by two processes: With probability ( ) n 1, a randomly chosen 2 2N pair of sequences coalesces. With probability nr, a randomly chosen sequence is produced by a recombination event. More complex scenarios involving multiple coalescences or recombination have negligible probabilities under the above assumptions. Hudson (2000) When a sequence is produced by a recombination event, the two loci have different ancestors. In this case, the branch experiencing the recombination splits in two, with each emerging branch corresponding to one of the two ancestral sequences. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 6 / 20

If N is large and time is measured in units of 2N generations, the ancestry of the sampled sequences can be modeled by a continuous-time Markov chain known as the two-locus ancestral recombination graph (Griffiths 1981): At rate ( n 2), a randomly chosen pair of sequences coalesces and the number of branches in the ARG decreases by one. At rate nρ/2, a randomly chosen sequence is produced by recombination. The corresponding branch splits into two branches, each containing material ancestral to one of the two loci. This increases the number of branches by one. The process terminates when both loci have reached their MRCA. Because the branching rate is linear in n while the coalescent rate is quadratic in n, the ARG is certain to reach an ultimate ancestor (UA) at some finite time. If the two loci reach their MRCA s at different times, then it may be possible to terminate the process before reaching the UA. The parameter ρ = 4N er is known as the population recombination rate. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 7 / 20

The two-locus ARG is readily extended to multiple loci. Suppose that the sampled sequences contain n loci and let r i be the recombination rate per generation between locus i and i + 1 and r = r 1 + + r n 1 be the total recombination rate. At rate ( n 2), a randomly chosen pair of sequences coalesces and the number of branches in the ARG decreases by one. At rate nρ/2, a randomly chosen sequence is generated by recombination. In this case, the recombination breakpoint falls between locus i and i + 1 with probability P(B = i) = r i /r. The affected branch splits into two branches, one containing material ancestral to locus 1 and the other containing material ancestral to locus 2. This increases the number of branches by one. The process terminates when all loci have reached their MRCA. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 8 / 20

Example: An ARG with Mutation Arenas et al. (2010) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 9 / 20

Although the ARG provides an accurate description of the ancestral relationships among a sample of recombining sequences, using it to analyze sequence data is computationally challenging for three reasons. Curse of dimensionality: The number of possible ARG s for a sample of n sequences containing L sites is ((2n 3)!!) L. Weakly informative data: In general, the ARG is only weakly determined by the sequence data. Long-range dependence: The genealogies at flanking sites remaining correlated even if we condition on the genealogy at an intermediate site. Because of these difficulties, a major research focus has been on the development of computationally efficient approximations for the ARG. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 10 / 20

Composite Likelihoods and the Two-locus ARG Hudson (2000) proposed a way of approximating the likelihood of recombinant sequence data based on the two-locus ARG. In this approach, a likelihood function L ij (ρ) is computed for each pair of segregating sites i and j using Monte Carlo-based methods. The composite likelihood of the complete data is then calculated by multiplying all of the pairwise likelihoods: L comp(ρ D) = i j L ij (ρ) The population recombination rate ρ can be estimated by finding the value ˆρ that maximizes L comp. The composite MLE is known to be consistent, but the composite likelihood function itself is too peaked around ˆρ, i.e., confidence intervals calculated using the curvature of L comp will be too narrow. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 11 / 20

A practical implementation of the composite likelihood approach was first made by McVean et al. (2002) and later updated by Auton & McVean (2007) to handle recombination rate variation (LDhat2). Chr 19 (A) and 22 (B) HLA Region Source: McVean et al. (2004) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 12 / 20

The Sequentially Markov Coalescent (SMC) An alternative simplification of the ARG was proposed by McVean & Cardin (2005) which removes the long-range dependence of genealogies at different sites. In this approach, which is known as the sequentially Markov coalescent (SMC), the ARG is approximated by a process that iteratively determines the genealogy at each position along a chromosome. The SMC starts at one end of the chromosome and samples a coalescent tree T 1 using the ordinary coalescent. It then generates a sequence of breakpoints b 1,, b m, and coalescent trees T 1,, T m, such that T i is the genealogy of the n sequences in the nonrecombinant segment (b i 1, b i ). The procedure for generating the breakpoints and coalescent trees is such that the sequence T 1,, T n is a Markov process, i.e., conditional on T i, the trees T 1,, T i 1 are independent of the trees T i+1,, T m. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 13 / 20

The sequence of breakpoints and coalescent trees is generated using the following procedure: Given b i and T i, the distance to the next breakpoint b i+1 is exponentially distributed with rate ρ T i /2, where T i is the total branch length in T i. Given T i, the next tree T i+1 is obtained by sampling a location uniformly at random along T i and detaching this lineage (and its subtree) from T i. This generates a floating lineage, which then coalesces with the remaining parts of T i. McVean & Cardin (2005) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 14 / 20

The Pairwise Sequentially Markov Coalescent (PSMC) To the extent that different loci have different genealogies, even a single diploid genome can be used to make inferences about demographic history and selection. With just two copies of each locus, the genealogy at each site is fully by specified by the pairwise coalescent time at that site. In this case, the ARG along a chromosome can be represented by a sequence of breakpoints b 1,, b m and pairwise coalescent times t 1,, t m. Under the SMC, the sequence of pairwise coalescent times becomes a discrete-time Markov chain. These sequences (b i, t i ) can be used to make inferences about demography and selection. Li & Durbin (2011) introduced a method called the PSMC which uses a hidden Markov model to infer (b i, t i ) from the sequence data. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 15 / 20

Hidden Markov Models (HMM) Hidden Markov models can be used to analyze data generated by processes in which the state of the system is hidden from the observer. The state of the system evolves according to a discrete-time Markov chain: X 1, X 2, X 3,. Whereas X t is hidden, at each time t, the observer can measure some variable Y t that depends only on X t. The objective is to use the observations (Y t) to learn about (X t), which can be done using dynamic programming algorithms. Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 e e e e e e X 0 p X 1 p X 2 p X 3 p X 4 p X 5 Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 16 / 20

Under the PSMC, the ancestral recombination graph and the sequence data can be represented by a HMM. Both processes (X i ) and (Y i ) are indexed by position along a chromosome. The hidden variable X i is the TMRCA at that position. The observed variable Y i is the pair of nucleotides in the sampled genome at that position. For practical reasons, the state space of the hidden variables is usually required to be finite. To this end, Li & Durbin (2011) replace the continuous interval [0, ) by a discrete set {s 0, s 1,, s m}. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 17 / 20

Applications of the PSMC from Li & Durbin (2011) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 18 / 20

Extensions to multiple sequences In principle, these ideas can be extended to samples containing more than two sequences. Hobolth et al. (2007) developed a coalescent HMM to handle data sampled from two or three species. Because the number of genealogies grows super-exponentially with the number of sampled sequences, this approach quickly becomes intractable. Rasmussen et al. (2014) proposed a novel approach based on the SMC-HMM which uses a clever MCMC algorithm ( threading ) to generate a sample of ARG s from the posterior distribution given the sequence data and model parameters. Threading works by stochastically building up the ARG sequence by sequence. This approach is implemented in the software package ARGweaver. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 19 / 20

Application: Distinguishing Background Selection from Selective Sweeps Rasmussen et al. (2014) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 20 / 20