Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory

Similar documents
Coalescence. Outline History. History, Model, and Application. Coalescence. The Model. Application

Ancestral Recombination Graphs

Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms

Coalescent Theory: An Introduction for Phylogenetics

BIOL Evolution. Lecture 8

Forward thinking: the predictive approach

Population Structure and Genealogies

The genealogical history of a population The coalescent process. Identity by descent Distribution of pairwise coalescence times

Population Genetics using Trees. Peter Beerli Genome Sciences University of Washington Seattle WA

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Population genetics: Coalescence theory II

2 The Wright-Fisher model and the neutral theory

Bioinformatics I, WS 14/15, D. Huson, December 15,

Comparative method, coalescents, and the future

STAT 536: The Coalescent

Coalescent Theory. Magnus Nordborg. Department of Genetics, Lund University. March 24, 2000

Pedigree Reconstruction using Identity by Descent

Comparative method, coalescents, and the future. Correlation of states in a discrete-state model

Coalescents. Joe Felsenstein. GENOME 453, Autumn Coalescents p.1/48

TREES OF GENES IN POPULATIONS

Analysis of geographically structured populations: Estimators based on coalescence

Part I. Concepts and Methods in Bacterial Population Genetics COPYRIGHTED MATERIAL

Viral epidemiology and the Coalescent

Kenneth Nordtvedt. Many genetic genealogists eventually employ a time-tomost-recent-common-ancestor

DNA Basics, Y DNA Marker Tables, Ancestral Trees and Mutation Graphs: Definitions, Concepts, Understanding

can mathematicians find the woods?

Kinship and Population Subdivision

Coalescents. Joe Felsenstein. GENOME 453, Winter Coalescents p.1/39

Ioanna Manolopoulou and Brent C. Emerson. October 7, Abstract

MOLECULAR POPULATION GENETICS: COALESCENT METHODS BASED ON SUMMARY STATISTICS

Exercise 4 Exploring Population Change without Selection

arxiv: v1 [q-bio.pe] 4 Mar 2013

Estimating Ancient Population Sizes using the Coalescent with Recombination

The Two Phases of the Coalescent and Fixation Processes

6.047/6.878 Lecture 21: Phylogenomics II

The Coalescent. Chapter Population Genetic Models

Warning: software often displays unrooted trees like this:

Coalescence time distributions for hypothesis testing -Kapil Rajaraman 498BIN, HW# 2

LANDSCAPE SMOOTHING OF NUMERICAL PERMUTATION SPACES IN GENETIC ALGORITHMS

Gene coancestry in pedigrees and populations

Chapter 12 Gene Genealogies

Your mtdna Full Sequence Results

Tópicos Depto. Ciencias Biológicas, UniAndes Profesor Andrew J. Crawford Semestre II

Contributed by "Kathy Hallett"

Approximating the coalescent with recombination

Inbreeding and self-fertilization

Theoretical Population Biology. An approximate likelihood for genetic data under a model with recombination and population splitting

Every human cell (except red blood cells and sperm and eggs) has an. identical set of 23 pairs of chromosomes which carry all the hereditary

The African Origin Hypothesis What do the data tell us?

Methods of Parentage Analysis in Natural Populations

Game Theory and Randomized Algorithms

Inbreeding and self-fertilization

The Genetic Algorithm

Populations. Arindam RoyChoudhury. Department of Biostatistics, Columbia University, New York NY 10032, U.S.A.,

A Likelihood Method to Estimate/Detect Gene Flow and A Distance Method to. Estimate Species Trees in the Presence of Gene Flow.

Frequent Inconsistency of Parsimony Under a Simple Model of Cladogenesis

MODERN population genetics is data driven and

CONGEN. Inbreeding vocabulary

Population Genetics. Joe Felsenstein. GENOME 453, Autumn Population Genetics p.1/70

DISCUSSION: RECENT COMMON ANCESTORS OF ALL PRESENT-DAY INDIVIDUALS

Phylogeny and Molecular Evolution

Growing the Family Tree: The Power of DNA in Reconstructing Family Relationships

Research Article The Ancestry of Genetic Segments

The Coalescent Model. Florian Weber

Laboratory 1: Uncertainty Analysis

POPULATION GENETICS: WRIGHT FISHER MODEL AND COALESCENT PROCESS. Hailong Cui and Wangshu Zhang. Superviser: Prof. Quentin Berger

Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling

GENOMIC REARRANGEMENT ALGORITHMS

Halley Family. Mystery? Mystery? Can you solve a. Can you help solve a

Recap: Properties of Trees. Rooting an unrooted tree. Questions trees can address: Data for phylogeny reconstruction. Rooted vs unrooted trees:

Objective: Why? 4/6/2014. Outlines:

Population Genetics. Joe Felsenstein. GENOME 453, Autumn Population Genetics p.1/74

Coalescent Theory for a Partially Selfing Population

GENEALOGICAL TREES, COALESCENT THEORY AND THE ANALYSIS OF GENETIC POLYMORPHISMS

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters

Meek DNA Project Group B Ancestral Signature

A Genetic Algorithm-Based Controller for Decentralized Multi-Agent Robotic Systems

5 Inferring Population

TRACK 1: BEGINNING DNA RESEARCH presented by Andy Hochreiter

Using Mitochondrial DNA (mtdna) for Genealogy Debbie Parker Wayne, CG, CGL SM

Bottlenecks reduce genetic variation Genetic Drift

Full Length Research Article

Shuffled Complex Evolution

Report on the VAN_TUYL Surname Project Y-STR Results 3/11/2013 Rory Van Tuyl

Evaluating the performance of likelihood methods for. detecting population structure and migration

DNA: Statistical Guidelines

GEDmatch Home Page The upper left corner of your home page has Information about you and links to lots of helpful information. Check them out!

Coding for Efficiency

Using Y-DNA for Genealogy Debbie Parker Wayne, CG, CGL SM

Vesselin K. Vassilev South Bank University London Dominic Job Napier University Edinburgh Julian F. Miller The University of Birmingham Birmingham

Lecture 1: Introduction to pedigree analysis

Popstats Parentage Statistics Strength of Genetic Evidence In Parentage Testing

Vol. 5, No. 6 June 2014 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Two-point linkage analysis using the LINKAGE/FASTLINK programs

Inference of population structure using dense haplotype data Daniel John Lawson 1, Garrett Hellenthal 2, Simon Myers,3 and Daniel Falush,4,

Human origins and analysis of mitochondrial DNA sequences

Using Autosomal DNA for Genealogy Debbie Parker Wayne, CG, CGL SM

baobabluna: the solution space of sorting by reversals Documentation Marília D. V. Braga

Your web browser (Safari 7) is out of date. For more security, comfort and the best experience on this site: Update your browser Ignore

University of Washington, TOPMed DCC July 2018

Transcription:

Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory Vineet Bafna Harish Nagarajan and Nitin Udpa 1 Disclaimer Please note that a lot of the text and figures here are copied from Nordborg s excellent tutorial on the subject [6]. This writeup is to be used only for the class. Please do NOT use for any purpose other than for class preparation. 2 Wright Fisher Model of Evolution When studying population genetic data, how can we decide if we are seeing something unexpected? One way is to simulate populations evolving under neutral circumstances, and then compute parameters that depart from neutrality. Wright and Fisher introduced a simple model of populations focusing on the genealogical relationships. It is based on a number of simplifying assumptions. 2.1 Insights of the model The Wright Fisher model of evolution attempts to simulate genetic drift wherein each individual in a generation produces an infinite and equal amount of gametes. The subsequent generation is a random draw of 2N gametes from this pool. In other words, the gene frequency in the next generation is composed of 2N draws from the gene frequency of the current generation. Mutations are randomly dropped into the gamete pool in the Wright Fisher model at a rate of mutations per individual per generation. The transition probability, which is the probability of a population drifting from a state having i copies to having j copies of a particular allele, can be obtained by the following expression [4]. T ij = ( ) ( 2N i j 2N ) j ( ) 2N j 2N i (1) 2N 2.2 Assumptions of the Wright Fisher Model The WF model of evolution rests mainly on the following six simplistic assumptions. 1. Discrete and non-overlapping generations are needed to separate the gamete pools from each generation. Although some mixing exists between generations, generally this assumption is a good way to reduce computational complexity without a significant loss in accuracy. 2. Constant population size (2N haplotypes) across generations is an assumption used to computationally simplify the problem. Although this is not accurate in certain regions of the world (with an exponentially growing population), the simplification is generally appropriate. 3. Equal fitness of all individuals is another simplifying assumption. This will be accurate if any mutations that occur are selectively neutral. Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA Bioinformatics Graduate Program, University of California San Diego, La Jolla, CA, USA 1

4. No geographical or social structure and random mating is yet another assumption that simplifies the problem computationally. This is not really accurate for large populations, but is necessary for any realistic simulation. 5. No recombination is also assumed for computational simplification. This is not necessarily accurate, but generally recombination rates are quite low, so it is probably a valid assumption in most cases. This is used to trace the lineage of members of a population easily, as each child is a direct copy of a parent (except for mutations). 6. The infinite sites assumption is incorporated for computational simplification, as each site can mutate at most once. This is used to create a simplified tree, as once a site is mutated, every descendant of the individual in question will have the mutation and every descendant not descended from the individual would have the normal form of the gene. This is a reasonable assumption as most sites rarely mutate multiple times, particularly over a short number of generations. 2.3 Simulating populations under the WF model In order to generate a small sample of n individuals from a population of 2N haplotypes, where n << N, the following strategy is used. First, the genealogy is simulated for T generations. This is followed by dropping mutations at a fixed rate µ, each at a new site, following which the haplotypes are generated and n individuals are sampled. The time complexity of this simulation strategy to generate a random population is O(NT ). An example of the WF simulation is shown in Figure 1 Figure 1: In this model, the genealogy is simulated for T generations of 2N individuals each. The green lines on the right panel indicate mutations being dropped at a rate µ. 3 Coalescent Theory Coalescent theory is an efficient way of simulating populations under some assumptions. Indeed, a population of n << N individuals can be simulated using only O(n) random values corresponding to coalescent times 2

as against O(NT ) in the case of WF model. 3.1 Insights of Coalescent Theory One of the key insights of coalescent theory is that the genealogy can be separated from the allelic states In other words, the mutations or allelic states that occur in a population have no effect on the fitness of its individual. As a result, the genealogy can be created randomly without any input from the genotype. According to Nordborg, the state can be separated from the descent. As shown in Figure 1, after creating the genealogic tree, the allelic states of any group of individuals can be generated by assigning an allelic state to their most recent common ancestor (MRCA), and then dropping mutations along the branches of the genealogical tree that leads to them. The coalescent theory also rests on the fact that much of the genealogy is irrelevant. The genealogy of a group of individuals can be modeled backwards in time, without worrying about other members of the previous generation that did not contribute offsprings. This is clearly depicted in Figure 2 Further, as a general consequence of the assumption of selective neutrality, each individual in a generation can be viewed as picking its parent at random from the previous generation. Figure 2: As one goes backwards in time, fewer individuals contribute to the current gamete pool as coalescent events occur. The topology of the genealogy basically can be generated by randomly picking lineages to coalesce. In other words, all topologies are equally likely. On the other hand, the branch length, which is defined by the coalescent time, is an independent exponential random variable. From this, stems the third insight of coalescent theory i.e. topology of genealogy is independent of coalescent time. Finally, there is no need to go back generation by generation to get coalescent times. Given some k individuals (1 k n) from a WF population of 2N, 3

t P r[no coalescence one generation back] = = k 1 i=0 k 1 i=0 2N i 2N 1 i 2N ( k 2) = 1 2N + O P r[no coalescence T generations back] e (k 2)t ( ) 1 N 2 (2) (3) (4) e (k 2) 2N (5) = e (k 2)τ 2N (6) Where τ = 2N in Eq: 7 is time measured in units of 2N generations. This is convenient as it takes an unknown quantity (N) out of the equation. E(Coalescence time for k individuals) = 1 ( k 2) (8) (7) In other words, the time for individuals to coalesce can be calculated via an exponentially distributed random variable. It is easy to see now that CT offers an implicit, efficient algorithm for generating a random genealogy of n individuals. 1. Generate a random binary topology for n individuals. 2. For k = n... 2, generate random times t k exponentially distributed with parameter ( k 2). 3. For k = n... 2, set the coalescent time to t k. Figure 3 illustrates this for a particular case of 6 individuals. Figure 3: Genealogical trees for 6 individuals are randomly generated using an exponential random variable with parameter ( k 2), where k is the number of distinct individuals in a generation. 3.2 Coalescent Properties We can use CT to compute some parameters. Let T MRCA denote the time to reach the most recent common ancestor. The expected time to reach the MRCA is calculated as the summation of the expected coalescence 4

times at each stage of coalescence (i.e. from n individuals to 2 individuals). E(T MRCA ) = n i=2 1 ( i 2) (9) n 2 = (i) (i 1) i=2 n ( 1 = 2 i 1 1 ) i i=2 ( = 2 1 1 ) n All times are in units of 2N generations. Note also that the expected time of the last step (2 individuals coalescing into 1) is 1, and it takes half of the total time to MRCA. This implies that most mutations ( 50%) are shared across ancient mutations. Let T tot denote the sum of all branch-lengths. E(T tot ) = = n i=2 n i=2 (10) (11) (12) i ( i 2) (13) 2 i 1 (14) = 2 (γ + log n) (15) Here, γ is the Euler s constant. Equation 15 implies that with increasing n, the sum of all branch lengths increases only by a factor of log n. Thus this limits the benefits of sampling with larger populations. 3.3 Simulating Populations 3.3.1 Simulating Populations of Constant Size Once we have a genealogy, we can use it to simulate a population. Each allele travels from the root to the leaf (individual), possibly mutating on the way. Suppose that for the genomic region we are simulating, mutations occur at a fixed rate µ per generation. On any branch of length τ (in units of 2N generations), select a number of mutations by sampling from an exponential distribution with mean 2N τ µ. In practice, we usually make the infinite sites assumption that each mutation hits a new site. Therefore, each mutation is labeled with the site label. This set of mutations allows us to generate a population of variant sites. An example of this strategy is shown in Figure 4 5

Figure 4: Simulating populations by generating a coalescent topology and branch length and dropping mutation with rate µt for every branch of length t. The numbers on each branch indicate the sites which have been mutated in that particular branch. 3.3.2 Simulating Populations under exponential growth Equation 8 shows that the expected time to coalesce is directly proportional to the population size N (as the time is measured in 2N generations). Under conditions of exponential population growth, N increases (and thus, the time to coalesce increases) as one goes forward in time. As a result, the branches closer to the leaves would be larger and those closer to the root would be smaller than the corresponding branches of a tree created under an unchanging population size. As mutation rate is proportional to branch length, one would expect mutations to be more unique under conditions of exponential growth (see Figure 5 for an illustration). One of the most common tools to simulate coalescent populations, Hudson s ms program, was used in Table 1 to further show this. 6

Figure 5: Typical genealogical trees under constant size and exponential growth conditions Table 1: Sample Populations Generated using Hudson s ms Program Constant Population Size Exponential Growth 1000101 0000000 1000101 1000000 1000101 0010000 0111000 0010000 0111000 0000100 1000101 0000000 1000101 0000000 1000111 0010000 1000101 0000001 1000101 0000000 1000101 0101000 1000101 0010000 1000101 0000000 1000101 0000011 1000101 0000000 7

3.4 Coalescent with Recombination The coalescent theory with recombination is identical to the theory without recombination with the addition of one more scaled factor, ρ, which is the probability of recombining. As a result, in a generation with k individuals, either an individual arises because of a recombination event between two individuals (and thus, it will have 2 parents each contributing a part of his genome), two individuals coalesce, neither (each individual has a distinct parent), or multiple events (which is generally ignored due to its low probability). In summary, the evolutionary history cannot be considered as a tree but as an Ancestral Recombination Graph (ARG). The ARG is represented pictorially in Figure 6 Figure 6: Pictorial representation of the Ancestral Recombination Graph 8

3.4.1 Generating sequences of the ARG The first step in generating sequences is to create the ARG. In order to simulate the ARG, the branch lengths and topology can be calculated as follows. Assume there are k individuals in a generation, r is the recombination rate and the population size is 2N. The mutation rate is µ P r[a recombination event occurs] = kr (16) ( k 2) P r[a coalescence event occurs] = P r[no individual recombines and no pair coalesces ] = e (kr2n+(k 2)) (18) P r[recombination given Coalescence or recombination] = kr2n kr2n + ( ) k 2 (19) = kρ kρ + 2 ( ) k (20) = 2N 2 ρ ρ + (k 1) When the time is considered in scaled units of 2N generations, the number of individuals will increase at a rate of kr2n and decrease at a rate of ( k 2). The scaled recombination rate ρ is thus defined as 4Nr. The following iteration is carried out by starting with k equal to n until k reaches a value of 1. The event of recombination is picked with the probability given by equation 21. If the event is recombination, an individual and a position to recombine are chosen randomly, otherwise a pair to coalesce is chosen. Once the ARG is simulated, the constituent coalescents are generated and the scaled mutation rate θ (which is equal to 4N µ) is revised by the following procedure. Every position less than the randomly chosen locus of recombination comes from the left parent and every position greater than the locus of recombination comes from the right parent. The fraction of the positions that go along a particular path is equal to the fraction of the mutation rate (compared to θ) of that path. The sequence is generated by splitting up the sequences into several trees based on the locations of recombinations, dropping mutations, and concatenating the resulting subsequences. The above described process is summarized in Figure 7 (17) (21) 9

4 Perfect Phylogeny Figure 7: Generating the sequences of the ARG The perfect phylogeny algorithm can be used to generate a genealogical tree from a population undergoing a coalescent process. It works as follows. The infinite sites assumption dictates that only two states are possible for any locus (the mutated state and the ancestral state), so these can be computationally represented as 0 and 1 (arbitrarily, if unknown). The sites can be sorted in decreasing order based on the number of mutated states present in the population. A tree can be created with individuals as leaves of a root containing the ancestral state of all sites. Following this, each locus can be used to sort the individuals into those containing the mutated state or the ancestral state. More loci with presumably different characteristics would create different dichotomies and thus, define the relative locations of mutations on the tree more accurately. Another way to look at this is for every pair of columns i and j in the sorted genotypic matrix, a perfect phylogeny exists if and only if the sets of rows with the value one are disjoint or if one is a subset of the other. Implementing this algorithm by comparing the rows of every pair of columns would yield a complexity of O(nm 2 ). [2] Given the genealogical tree, the branch lengths can be estimated by using the estimated time to coalesce, which is shown in equation 8. The perfect phylogeny algorithm is pictorially represented in Figure 8 10

Figure 8: The steps involved in constructing the genealogy using the perfect phylogeny algorithm Given a population and a pair of sites in the population, the LD can be calculated as follows. D = P 00 P 0 P 0 (22) In the perfect phylogeny algorithm, assumptions include a lack of recombination or recurrent mutations, so any pair of sites in the population will be linked. As a result, this will cause a state of perfect disequilibrium (represented by a high D value). 4.1 Applications of Perfect Phylogeny The perfect phylogeny algorithm generates a genealogical tree from members of a population. This can be useful in several aspects. For instance, in the genographic project, loci from human mitochondrial DNA and Y-chromosome DNA (both of which do not recombine) from several different ethnic groups are used to create a genealogical tree. This tree can be used to temporally place the coalescence events of the groups and thus, suggest a migration pattern for the human species. ( https://www5.nationalgeographic.com/genographic/). 4.2 Linear time algorithm for perfect phylogeny The perfect phylogeny algorithm described above is of complexity O(nm 2 ). In 1991 Dan Gusfield proposed a linear time algorithm for perfect phylogeny based on graph theoretic approach [1]. Assuming an input of an n*m genotype array M of a population, this algorithm is summarized below. 11

1. Treat the columns of M as binary digits (with most significant bit in row 1) and sort in descending order. Delete any duplicate columns. 2. Store all cells with value 1 in a separate structure (called O in the paper). Create an associated matrix L. For every cell (i, j) in O, let k be the largest value less than j such that (i, k) is in O (if j is the first column with a 1 in row i, let L(i, j) = 0). 3. A perfect phylogeny exists if and only if, for every column j, every cell (i, j) in O has the same value stored in L(i, j). 4.3 Special cases of the perfect phylogeny problem In the unrooted case, if a perfect phylogeny exists, a root can be determined. Under a perfect phylogeny, the actual tree does not change when 1s and 0s at a column are interchanged. In this case, if the values in a column are switched, such that 0 is the majority element, the same tree as in the rooted case could be obtained. In the event of missing data, the perfect phylogeny problem becomes intractable. As discussed by Kimmel and Shamir [5], the perfect phylogeny algorithm turns out to be an NP - hard problem. A special, polynomial case of this as described by Halperin and Karp [3] is if the input satisfies the rich data hypothesis (that is, if enough information exists to infer missing haplotypes). In addition, if the root is known, the perfect phylogeny algorithm can be modified to account for the missing data. In the event of a recurrent mutation occuring, the infinite sites assumption becomes invalid. This would probably make the task of reconstructing the genealogy nonpolynomial, and thus, impossible to solve accurately under any realistic sample. References [1] D Gusfield. Efficient Algorithms for Inferring Evolutionary Trees. Networks, 21:19 28, 1991. [2] D Gusfield. Algorithms on Strings, Trees, And Sequences. Cambridge University Press, Cambridge, UK, 1997. [3] E. Halperin and R.M. Karp. Perfect phylogeny and haplotype assignment. In Proceedings of the 8th RECOMB. ACM Press. [4] D.L. Hartl and A.G. Clark. Principles of Population Genetics. Sinauer Associates, Inc., Sunderland, MA, USA, 2007. [5] G. Kimmel and R. Shamir. The incomplete perfect phylogeny haplotype problem. J Bioinform Comput Biol, 3:359 384, Apr 2005. [6] M. Nordborg. Handbook of Statistical Genetics, chapter Coalescent Theory. John Wiley & Sons, Ltd, 2001. 12