Ancestral Recombination Graphs

Ancestral Recombination Graphs Ancestral relationships among a sample of recombining sequences usually cannot be accurately described by just a single genealogy. Linked sites will have similar, but not necessarily identical genealogies. Recombination leads to variation in the TMRCA between different sites, which in turn can lead to variation in genetic polymorphism. The ancestral recombination graph (ARG) is a generalization of the coalescent which describes the sequence of genealogies along a sample of recombining sequences. Nordborg (2000) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 1 / 20

Variation in Total Tree Length in a Sample of 10 Chromosomes Hudson (2000) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 2 / 20

Applications of the ARG The ARG has several uses: Recombining sequences are potentially much more informative about demography, admixture and selection than a single completely-linked locus. Fine scale recombination rate estimation is possible if we can exploit high density SNP data. Statistical inference for GWAS can be improved if we can accurately account for the complex correlations that exist between multiple linked loci. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 3 / 20

Meiotic recombination generates mosaic chromosomes Strachan & Read (1996) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 4 / 20

The Two-locus Ancestral Recombination Graph For simplicity, consider a sample of n sequences containing just two loci. We will make the following assumptions. The population evolves according to the diploid Wright-Fisher model, with population size N. Each individual samples two chromosomes uniformly at random from the previous generation. When a chromosome is sampled from a parent, it will either be inherited intact (with probability 1 r) or else it will undergo a recombination between the two loci (with probability r). When a recombination event occurs, the two loci will be inherited from the two different homologous chromosomes of that parent. We will assume that no recombination occurs within the loci. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 5 / 20

Suppose that N is large and that r is of order O(1/N). Then, when looking backwards in time, the ancestral relationships between the sampled sequences are determined by two processes: With probability ( ) n 1, a randomly chosen 2 2N pair of sequences coalesces. With probability nr, a randomly chosen sequence is produced by a recombination event. More complex scenarios involving multiple coalescences or recombination have negligible probabilities under the above assumptions. Hudson (2000) When a sequence is produced by a recombination event, the two loci have different ancestors. In this case, the branch experiencing the recombination splits in two, with each emerging branch corresponding to one of the two ancestral sequences. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 6 / 20

If N is large and time is measured in units of 2N generations, the ancestry of the sampled sequences can be modeled by a continuous-time Markov chain known as the two-locus ancestral recombination graph (Griffiths 1981): At rate ( n 2), a randomly chosen pair of sequences coalesces and the number of branches in the ARG decreases by one. At rate nρ/2, a randomly chosen sequence is produced by recombination. The corresponding branch splits into two branches, each containing material ancestral to one of the two loci. This increases the number of branches by one. The process terminates when both loci have reached their MRCA. Because the branching rate is linear in n while the coalescent rate is quadratic in n, the ARG is certain to reach an ultimate ancestor (UA) at some finite time. If the two loci reach their MRCA s at different times, then it may be possible to terminate the process before reaching the UA. The parameter ρ = 4N er is known as the population recombination rate. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 7 / 20

The two-locus ARG is readily extended to multiple loci. Suppose that the sampled sequences contain n loci and let r i be the recombination rate per generation between locus i and i + 1 and r = r 1 + + r n 1 be the total recombination rate. At rate ( n 2), a randomly chosen pair of sequences coalesces and the number of branches in the ARG decreases by one. At rate nρ/2, a randomly chosen sequence is generated by recombination. In this case, the recombination breakpoint falls between locus i and i + 1 with probability P(B = i) = r i /r. The affected branch splits into two branches, one containing material ancestral to locus 1 and the other containing material ancestral to locus 2. This increases the number of branches by one. The process terminates when all loci have reached their MRCA. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 8 / 20

Example: An ARG with Mutation Arenas et al. (2010) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 9 / 20

Although the ARG provides an accurate description of the ancestral relationships among a sample of recombining sequences, using it to analyze sequence data is computationally challenging for three reasons. Curse of dimensionality: The number of possible ARG s for a sample of n sequences containing L sites is ((2n 3)!!) L. Weakly informative data: In general, the ARG is only weakly determined by the sequence data. Long-range dependence: The genealogies at flanking sites remaining correlated even if we condition on the genealogy at an intermediate site. Because of these difficulties, a major research focus has been on the development of computationally efficient approximations for the ARG. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 10 / 20

Composite Likelihoods and the Two-locus ARG Hudson (2000) proposed a way of approximating the likelihood of recombinant sequence data based on the two-locus ARG. In this approach, a likelihood function L ij (ρ) is computed for each pair of segregating sites i and j using Monte Carlo-based methods. The composite likelihood of the complete data is then calculated by multiplying all of the pairwise likelihoods: L comp(ρ D) = i j L ij (ρ) The population recombination rate ρ can be estimated by finding the value ˆρ that maximizes L comp. The composite MLE is known to be consistent, but the composite likelihood function itself is too peaked around ˆρ, i.e., confidence intervals calculated using the curvature of L comp will be too narrow. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 11 / 20

A practical implementation of the composite likelihood approach was first made by McVean et al. (2002) and later updated by Auton & McVean (2007) to handle recombination rate variation (LDhat2). Chr 19 (A) and 22 (B) HLA Region Source: McVean et al. (2004) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 12 / 20

The Sequentially Markov Coalescent (SMC) An alternative simplification of the ARG was proposed by McVean & Cardin (2005) which removes the long-range dependence of genealogies at different sites. In this approach, which is known as the sequentially Markov coalescent (SMC), the ARG is approximated by a process that iteratively determines the genealogy at each position along a chromosome. The SMC starts at one end of the chromosome and samples a coalescent tree T 1 using the ordinary coalescent. It then generates a sequence of breakpoints b 1,, b m, and coalescent trees T 1,, T m, such that T i is the genealogy of the n sequences in the nonrecombinant segment (b i 1, b i ). The procedure for generating the breakpoints and coalescent trees is such that the sequence T 1,, T n is a Markov process, i.e., conditional on T i, the trees T 1,, T i 1 are independent of the trees T i+1,, T m. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 13 / 20

The sequence of breakpoints and coalescent trees is generated using the following procedure: Given b i and T i, the distance to the next breakpoint b i+1 is exponentially distributed with rate ρ T i /2, where T i is the total branch length in T i. Given T i, the next tree T i+1 is obtained by sampling a location uniformly at random along T i and detaching this lineage (and its subtree) from T i. This generates a floating lineage, which then coalesces with the remaining parts of T i. McVean & Cardin (2005) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 14 / 20

The Pairwise Sequentially Markov Coalescent (PSMC) To the extent that different loci have different genealogies, even a single diploid genome can be used to make inferences about demographic history and selection. With just two copies of each locus, the genealogy at each site is fully by specified by the pairwise coalescent time at that site. In this case, the ARG along a chromosome can be represented by a sequence of breakpoints b 1,, b m and pairwise coalescent times t 1,, t m. Under the SMC, the sequence of pairwise coalescent times becomes a discrete-time Markov chain. These sequences (b i, t i ) can be used to make inferences about demography and selection. Li & Durbin (2011) introduced a method called the PSMC which uses a hidden Markov model to infer (b i, t i ) from the sequence data. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 15 / 20

Hidden Markov Models (HMM) Hidden Markov models can be used to analyze data generated by processes in which the state of the system is hidden from the observer. The state of the system evolves according to a discrete-time Markov chain: X 1, X 2, X 3,. Whereas X t is hidden, at each time t, the observer can measure some variable Y t that depends only on X t. The objective is to use the observations (Y t) to learn about (X t), which can be done using dynamic programming algorithms. Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 e e e e e e X 0 p X 1 p X 2 p X 3 p X 4 p X 5 Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 16 / 20

Under the PSMC, the ancestral recombination graph and the sequence data can be represented by a HMM. Both processes (X i ) and (Y i ) are indexed by position along a chromosome. The hidden variable X i is the TMRCA at that position. The observed variable Y i is the pair of nucleotides in the sampled genome at that position. For practical reasons, the state space of the hidden variables is usually required to be finite. To this end, Li & Durbin (2011) replace the continuous interval [0, ) by a discrete set {s 0, s 1,, s m}. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 17 / 20

Applications of the PSMC from Li & Durbin (2011) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 18 / 20

Extensions to multiple sequences In principle, these ideas can be extended to samples containing more than two sequences. Hobolth et al. (2007) developed a coalescent HMM to handle data sampled from two or three species. Because the number of genealogies grows super-exponentially with the number of sampled sequences, this approach quickly becomes intractable. Rasmussen et al. (2014) proposed a novel approach based on the SMC-HMM which uses a clever MCMC algorithm ( threading ) to generate a sample of ARG s from the posterior distribution given the sequence data and model parameters. Threading works by stochastically building up the ARG sequence by sequence. This approach is implemented in the software package ARGweaver. Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 19 / 20

Application: Distinguishing Background Selection from Selective Sweeps Rasmussen et al. (2014) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 20 / 20