Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory

Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory Vineet Bafna Harish Nagarajan and Nitin Udpa 1 Disclaimer Please note that a lot of the text and figures here are copied from Nordborg s excellent tutorial on the subject [6]. This writeup is to be used only for the class. Please do NOT use for any purpose other than for class preparation. 2 Wright Fisher Model of Evolution When studying population genetic data, how can we decide if we are seeing something unexpected? One way is to simulate populations evolving under neutral circumstances, and then compute parameters that depart from neutrality. Wright and Fisher introduced a simple model of populations focusing on the genealogical relationships. It is based on a number of simplifying assumptions. 2.1 Insights of the model The Wright Fisher model of evolution attempts to simulate genetic drift wherein each individual in a generation produces an infinite and equal amount of gametes. The subsequent generation is a random draw of 2N gametes from this pool. In other words, the gene frequency in the next generation is composed of 2N draws from the gene frequency of the current generation. Mutations are randomly dropped into the gamete pool in the Wright Fisher model at a rate of mutations per individual per generation. The transition probability, which is the probability of a population drifting from a state having i copies to having j copies of a particular allele, can be obtained by the following expression [4]. T ij = ( ) ( 2N i j 2N ) j ( ) 2N j 2N i (1) 2N 2.2 Assumptions of the Wright Fisher Model The WF model of evolution rests mainly on the following six simplistic assumptions. 1. Discrete and non-overlapping generations are needed to separate the gamete pools from each generation. Although some mixing exists between generations, generally this assumption is a good way to reduce computational complexity without a significant loss in accuracy. 2. Constant population size (2N haplotypes) across generations is an assumption used to computationally simplify the problem. Although this is not accurate in certain regions of the world (with an exponentially growing population), the simplification is generally appropriate. 3. Equal fitness of all individuals is another simplifying assumption. This will be accurate if any mutations that occur are selectively neutral. Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA Bioinformatics Graduate Program, University of California San Diego, La Jolla, CA, USA 1

4. No geographical or social structure and random mating is yet another assumption that simplifies the problem computationally. This is not really accurate for large populations, but is necessary for any realistic simulation. 5. No recombination is also assumed for computational simplification. This is not necessarily accurate, but generally recombination rates are quite low, so it is probably a valid assumption in most cases. This is used to trace the lineage of members of a population easily, as each child is a direct copy of a parent (except for mutations). 6. The infinite sites assumption is incorporated for computational simplification, as each site can mutate at most once. This is used to create a simplified tree, as once a site is mutated, every descendant of the individual in question will have the mutation and every descendant not descended from the individual would have the normal form of the gene. This is a reasonable assumption as most sites rarely mutate multiple times, particularly over a short number of generations. 2.3 Simulating populations under the WF model In order to generate a small sample of n individuals from a population of 2N haplotypes, where n << N, the following strategy is used. First, the genealogy is simulated for T generations. This is followed by dropping mutations at a fixed rate µ, each at a new site, following which the haplotypes are generated and n individuals are sampled. The time complexity of this simulation strategy to generate a random population is O(NT ). An example of the WF simulation is shown in Figure 1 Figure 1: In this model, the genealogy is simulated for T generations of 2N individuals each. The green lines on the right panel indicate mutations being dropped at a rate µ. 3 Coalescent Theory Coalescent theory is an efficient way of simulating populations under some assumptions. Indeed, a population of n << N individuals can be simulated using only O(n) random values corresponding to coalescent times 2

as against O(NT ) in the case of WF model. 3.1 Insights of Coalescent Theory One of the key insights of coalescent theory is that the genealogy can be separated from the allelic states In other words, the mutations or allelic states that occur in a population have no effect on the fitness of its individual. As a result, the genealogy can be created randomly without any input from the genotype. According to Nordborg, the state can be separated from the descent. As shown in Figure 1, after creating the genealogic tree, the allelic states of any group of individuals can be generated by assigning an allelic state to their most recent common ancestor (MRCA), and then dropping mutations along the branches of the genealogical tree that leads to them. The coalescent theory also rests on the fact that much of the genealogy is irrelevant. The genealogy of a group of individuals can be modeled backwards in time, without worrying about other members of the previous generation that did not contribute offsprings. This is clearly depicted in Figure 2 Further, as a general consequence of the assumption of selective neutrality, each individual in a generation can be viewed as picking its parent at random from the previous generation. Figure 2: As one goes backwards in time, fewer individuals contribute to the current gamete pool as coalescent events occur. The topology of the genealogy basically can be generated by randomly picking lineages to coalesce. In other words, all topologies are equally likely. On the other hand, the branch length, which is defined by the coalescent time, is an independent exponential random variable. From this, stems the third insight of coalescent theory i.e. topology of genealogy is independent of coalescent time. Finally, there is no need to go back generation by generation to get coalescent times. Given some k individuals (1 k n) from a WF population of 2N, 3

t P r[no coalescence one generation back] = = k 1 i=0 k 1 i=0 2N i 2N 1 i 2N ( k 2) = 1 2N + O P r[no coalescence T generations back] e (k 2)t ( ) 1 N 2 (2) (3) (4) e (k 2) 2N (5) = e (k 2)τ 2N (6) Where τ = 2N in Eq: 7 is time measured in units of 2N generations. This is convenient as it takes an unknown quantity (N) out of the equation. E(Coalescence time for k individuals) = 1 ( k 2) (8) (7) In other words, the time for individuals to coalesce can be calculated via an exponentially distributed random variable. It is easy to see now that CT offers an implicit, efficient algorithm for generating a random genealogy of n individuals. 1. Generate a random binary topology for n individuals. 2. For k = n... 2, generate random times t k exponentially distributed with parameter ( k 2). 3. For k = n... 2, set the coalescent time to t k. Figure 3 illustrates this for a particular case of 6 individuals. Figure 3: Genealogical trees for 6 individuals are randomly generated using an exponential random variable with parameter ( k 2), where k is the number of distinct individuals in a generation. 3.2 Coalescent Properties We can use CT to compute some parameters. Let T MRCA denote the time to reach the most recent common ancestor. The expected time to reach the MRCA is calculated as the summation of the expected coalescence 4

times at each stage of coalescence (i.e. from n individuals to 2 individuals). E(T MRCA ) = n i=2 1 ( i 2) (9) n 2 = (i) (i 1) i=2 n ( 1 = 2 i 1 1 ) i i=2 ( = 2 1 1 ) n All times are in units of 2N generations. Note also that the expected time of the last step (2 individuals coalescing into 1) is 1, and it takes half of the total time to MRCA. This implies that most mutations ( 50%) are shared across ancient mutations. Let T tot denote the sum of all branch-lengths. E(T tot ) = = n i=2 n i=2 (10) (11) (12) i ( i 2) (13) 2 i 1 (14) = 2 (γ + log n) (15) Here, γ is the Euler s constant. Equation 15 implies that with increasing n, the sum of all branch lengths increases only by a factor of log n. Thus this limits the benefits of sampling with larger populations. 3.3 Simulating Populations 3.3.1 Simulating Populations of Constant Size Once we have a genealogy, we can use it to simulate a population. Each allele travels from the root to the leaf (individual), possibly mutating on the way. Suppose that for the genomic region we are simulating, mutations occur at a fixed rate µ per generation. On any branch of length τ (in units of 2N generations), select a number of mutations by sampling from an exponential distribution with mean 2N τ µ. In practice, we usually make the infinite sites assumption that each mutation hits a new site. Therefore, each mutation is labeled with the site label. This set of mutations allows us to generate a population of variant sites. An example of this strategy is shown in Figure 4 5

Figure 4: Simulating populations by generating a coalescent topology and branch length and dropping mutation with rate µt for every branch of length t. The numbers on each branch indicate the sites which have been mutated in that particular branch. 3.3.2 Simulating Populations under exponential growth Equation 8 shows that the expected time to coalesce is directly proportional to the population size N (as the time is measured in 2N generations). Under conditions of exponential population growth, N increases (and thus, the time to coalesce increases) as one goes forward in time. As a result, the branches closer to the leaves would be larger and those closer to the root would be smaller than the corresponding branches of a tree created under an unchanging population size. As mutation rate is proportional to branch length, one would expect mutations to be more unique under conditions of exponential growth (see Figure 5 for an illustration). One of the most common tools to simulate coalescent populations, Hudson s ms program, was used in Table 1 to further show this. 6

Figure 5: Typical genealogical trees under constant size and exponential growth conditions Table 1: Sample Populations Generated using Hudson s ms Program Constant Population Size Exponential Growth 1000101 0000000 1000101 1000000 1000101 0010000 0111000 0010000 0111000 0000100 1000101 0000000 1000101 0000000 1000111 0010000 1000101 0000001 1000101 0000000 1000101 0101000 1000101 0010000 1000101 0000000 1000101 0000011 1000101 0000000 7

3.4 Coalescent with Recombination The coalescent theory with recombination is identical to the theory without recombination with the addition of one more scaled factor, ρ, which is the probability of recombining. As a result, in a generation with k individuals, either an individual arises because of a recombination event between two individuals (and thus, it will have 2 parents each contributing a part of his genome), two individuals coalesce, neither (each individual has a distinct parent), or multiple events (which is generally ignored due to its low probability). In summary, the evolutionary history cannot be considered as a tree but as an Ancestral Recombination Graph (ARG). The ARG is represented pictorially in Figure 6 Figure 6: Pictorial representation of the Ancestral Recombination Graph 8

3.4.1 Generating sequences of the ARG The first step in generating sequences is to create the ARG. In order to simulate the ARG, the branch lengths and topology can be calculated as follows. Assume there are k individuals in a generation, r is the recombination rate and the population size is 2N. The mutation rate is µ P r[a recombination event occurs] = kr (16) ( k 2) P r[a coalescence event occurs] = P r[no individual recombines and no pair coalesces ] = e (kr2n+(k 2)) (18) P r[recombination given Coalescence or recombination] = kr2n kr2n + ( ) k 2 (19) = kρ kρ + 2 ( ) k (20) = 2N 2 ρ ρ + (k 1) When the time is considered in scaled units of 2N generations, the number of individuals will increase at a rate of kr2n and decrease at a rate of ( k 2). The scaled recombination rate ρ is thus defined as 4Nr. The following iteration is carried out by starting with k equal to n until k reaches a value of 1. The event of recombination is picked with the probability given by equation 21. If the event is recombination, an individual and a position to recombine are chosen randomly, otherwise a pair to coalesce is chosen. Once the ARG is simulated, the constituent coalescents are generated and the scaled mutation rate θ (which is equal to 4N µ) is revised by the following procedure. Every position less than the randomly chosen locus of recombination comes from the left parent and every position greater than the locus of recombination comes from the right parent. The fraction of the positions that go along a particular path is equal to the fraction of the mutation rate (compared to θ) of that path. The sequence is generated by splitting up the sequences into several trees based on the locations of recombinations, dropping mutations, and concatenating the resulting subsequences. The above described process is summarized in Figure 7 (17) (21) 9

4 Perfect Phylogeny Figure 7: Generating the sequences of the ARG The perfect phylogeny algorithm can be used to generate a genealogical tree from a population undergoing a coalescent process. It works as follows. The infinite sites assumption dictates that only two states are possible for any locus (the mutated state and the ancestral state), so these can be computationally represented as 0 and 1 (arbitrarily, if unknown). The sites can be sorted in decreasing order based on the number of mutated states present in the population. A tree can be created with individuals as leaves of a root containing the ancestral state of all sites. Following this, each locus can be used to sort the individuals into those containing the mutated state or the ancestral state. More loci with presumably different characteristics would create different dichotomies and thus, define the relative locations of mutations on the tree more accurately. Another way to look at this is for every pair of columns i and j in the sorted genotypic matrix, a perfect phylogeny exists if and only if the sets of rows with the value one are disjoint or if one is a subset of the other. Implementing this algorithm by comparing the rows of every pair of columns would yield a complexity of O(nm 2 ). [2] Given the genealogical tree, the branch lengths can be estimated by using the estimated time to coalesce, which is shown in equation 8. The perfect phylogeny algorithm is pictorially represented in Figure 8 10

Figure 8: The steps involved in constructing the genealogy using the perfect phylogeny algorithm Given a population and a pair of sites in the population, the LD can be calculated as follows. D = P 00 P 0 P 0 (22) In the perfect phylogeny algorithm, assumptions include a lack of recombination or recurrent mutations, so any pair of sites in the population will be linked. As a result, this will cause a state of perfect disequilibrium (represented by a high D value). 4.1 Applications of Perfect Phylogeny The perfect phylogeny algorithm generates a genealogical tree from members of a population. This can be useful in several aspects. For instance, in the genographic project, loci from human mitochondrial DNA and Y-chromosome DNA (both of which do not recombine) from several different ethnic groups are used to create a genealogical tree. This tree can be used to temporally place the coalescence events of the groups and thus, suggest a migration pattern for the human species. ( https://www5.nationalgeographic.com/genographic/). 4.2 Linear time algorithm for perfect phylogeny The perfect phylogeny algorithm described above is of complexity O(nm 2 ). In 1991 Dan Gusfield proposed a linear time algorithm for perfect phylogeny based on graph theoretic approach [1]. Assuming an input of an n*m genotype array M of a population, this algorithm is summarized below. 11

1. Treat the columns of M as binary digits (with most significant bit in row 1) and sort in descending order. Delete any duplicate columns. 2. Store all cells with value 1 in a separate structure (called O in the paper). Create an associated matrix L. For every cell (i, j) in O, let k be the largest value less than j such that (i, k) is in O (if j is the first column with a 1 in row i, let L(i, j) = 0). 3. A perfect phylogeny exists if and only if, for every column j, every cell (i, j) in O has the same value stored in L(i, j). 4.3 Special cases of the perfect phylogeny problem In the unrooted case, if a perfect phylogeny exists, a root can be determined. Under a perfect phylogeny, the actual tree does not change when 1s and 0s at a column are interchanged. In this case, if the values in a column are switched, such that 0 is the majority element, the same tree as in the rooted case could be obtained. In the event of missing data, the perfect phylogeny problem becomes intractable. As discussed by Kimmel and Shamir [5], the perfect phylogeny algorithm turns out to be an NP - hard problem. A special, polynomial case of this as described by Halperin and Karp [3] is if the input satisfies the rich data hypothesis (that is, if enough information exists to infer missing haplotypes). In addition, if the root is known, the perfect phylogeny algorithm can be modified to account for the missing data. In the event of a recurrent mutation occuring, the infinite sites assumption becomes invalid. This would probably make the task of reconstructing the genealogy nonpolynomial, and thus, impossible to solve accurately under any realistic sample. References [1] D Gusfield. Efficient Algorithms for Inferring Evolutionary Trees. Networks, 21:19 28, 1991. [2] D Gusfield. Algorithms on Strings, Trees, And Sequences. Cambridge University Press, Cambridge, UK, 1997. [3] E. Halperin and R.M. Karp. Perfect phylogeny and haplotype assignment. In Proceedings of the 8th RECOMB. ACM Press. [4] D.L. Hartl and A.G. Clark. Principles of Population Genetics. Sinauer Associates, Inc., Sunderland, MA, USA, 2007. [5] G. Kimmel and R. Shamir. The incomplete perfect phylogeny haplotype problem. J Bioinform Comput Biol, 3:359 384, Apr 2005. [6] M. Nordborg. Handbook of Statistical Genetics, chapter Coalescent Theory. John Wiley & Sons, Ltd, 2001. 12