A Likelihood Method to Estimate/Detect Gene Flow and A Distance Method to. Estimate Species Trees in the Presence of Gene Flow.

Size: px

Start display at page:

Download "A Likelihood Method to Estimate/Detect Gene Flow and A Distance Method to. Estimate Species Trees in the Presence of Gene Flow."

Tracey Edwina Craig
6 years ago
Views:

1 A Likelihood Method to Estimate/Detect Gene Flow and A Distance Method to Estimate Species Trees in the Presence of Gene Flow Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Lingfei Cui, M.S., B.S. Graduate Program in Evolution, Ecology and Organismal Biology The Ohio State University 2014 Committee: Dr. Laura Kubatko, Advisor, Dr. Bryan Carstens, Dr. John Freudenstein

2 Copyright by Lingfei Cui 2014

3 Abstract In this thesis, I developed two methods to extract interesting information in the fields of population genetics and phylogenetics from multi-locus data. The first method includes a maximum likelihood method to jointly estimate speciation times, population sizes and migration rates, and a likelihood ratio test to detect gene flow. It uses an IM (Isolation-with-Migration) model and takes a summary statistics approach. Sequence data are first summarized into gene trees, which are then fed to this method. Simulation has shown that this method is very powerful when all the gene trees are known. However, improvement is needed in order to deal with estimated gene tree data. The second method is STEST (Species Tree Estimation from Speciation Times), which uses estimated speciation times as distances to build species trees. The original STEST attempts to estimate species trees from data subject to gene flow. Therefore Zhu and Yang (2012) s method SIM3s is chosen to build the distance matrix because their method estimates speciation times in the presence of gene flow. However, the method SIM3s can be replaced by any speciation time estimation methods. For example, STEST (M0) uses Yang (2002) s method M0 to estimate speciation times and STEST (DB) uses Yang (2010) s method DB. Simulation has shown that in many cases, STEST-based methods have better performance than *BEAST and STEM do. ii

4 Dedication This document is dedicated to my family. iii

5 Acknowledgments I would express my deepest gratitude to my advisor Dr. Laura Kubatko. She has always been so nice and patient to me, ever since the first day I came here. Whenever I came across a problem she was always there to help me out. Whenever I made any progress, she was always generous with her compliments. She was always supportive to me even after I decided to quit the Ph.D. program. My biggest regret here is that I didn t become her first Ph.D. graduate in the EEOB Department. Thank you for your understanding and help! I hope one day I can become a person and a mentor as good as you are. I also want to thank the members of my committee, Dr. Bryan Carstens and Dr. John Freudenstein. You have always been so nice to me and gave me so many good suggestions. I feel grateful and honored to have you in my committee. My appreciation also goes to Dr. Michael Davis and Dr. Ronald Solomon. Thank you for your guidance and help in my career. I want to thank all my dear friends. Thank you for bringing me so much happiness so that I don t feel lonely in a foreign country. In the end, I want to thank my parents. Thank you for understanding and supporting me to travel this far to finish my education. iv

6 Vita Guangbei High B.S. Biotechnology, Shandong University M.S. Mathematics, The Ohio State University 2010 to present...university Fellow/Graduate Associate, Department of EEOB, The Ohio State University Fields of Study Major Field: Evolution, Ecology and Organismal Biology v

7 Table of Contents Abstract... ii! Dedication... iii! Acknowledgments... iv! Vita... v! Fields of Study... v! Table of Contents... vi! List of Tables... x! List of Figures... xi! Chapter 1: Introduction... 1! 1.0 Phylogeny... 1! 1.1 Phylogenetic Inference Problem... 3! 1.2 Gene Trees and Species Trees Paradox... 5! 1.3 Phylogenetic Estimation in the Presence of Deep Coalescence... 7! 1.5 Phylogenetic Estimation in the Presence of Gene Flow... 12! 1.6 IM Model... 13! 1.7 Software to Estimate Migration Rate... 16! 1.8 Outline of This Thesis... 17! vi

8 Chapter 2: A Likelihood Method to Estimate/Test Gene Flow... 20! 2.1 Methods... 20! IM model... 20! Probability Density Function and Likelihood Function... 23! Likelihood Ratio Test (LRT)... 26! 2.2 Simulation Study and Result... 27! Simulation... 27! Results... 29! Maximum Likelihood Estimation... 29! Likelihood Ratio Test... 30! 2.3 Discussion... 30! 2.4 Future Work... 36! Chapter 3: A Distance Method to Estimate Species Trees in the Presence of Gene Flow 38! 3.1 Methods... 38! Distance Matrix Building... 38! where. The estimate the speciation time is ! Tree Construction... 40! An Example... 42! 3.2 Simulation Study... 43! vii

9 3.2.1 Simulation Study 1: Four-taxon Tree Under the n-island Model... 45! Simulation Study 2: Nine-Taxon Trees Under the n-island Model... 45! 3.3 Empirical Data Analysis... 48! 3.4 Results... 49! Results of Simulation Study ! Results of Simulation Study ! Results of Empirical Data Analysis... 52! 3.5 Discussion... 52! Small Tree Estimation... 52! Large Tree Estimation... 55! Empirical Data Analysis... 56! Summary... 56! 3.6 Future Work... 56! Chapter 4: Discussion and Future Directions... 58! 4.1 Discussion... 58! 4.2 Future Direction... 59! An Extension to Method ! A Word on This Extension... 66! viii

10 Appendix A. C Code to Compute the Probability Density of the Time to MRCA Under a Two-population IM Model... 67! Appendix B. R Code to Implement the Sequential Clustering Algorithm for Species Tree Reconstruction... 73! References... 79! ix

11 List of Tables Table 1. Number of possible species trees given n species Table 2. The comparison of different species tree estimation methods Table 3. Instantaneous rate matrix Q Table 4. Instantaneous rate matrix Q Table 5. Settings for simulation study 1. Each entry in this table provides a label for a set of model parameters, e.g., A3 corresponds to the setting in which the difference τ 3 -τ 2 and τ 2 -τ 1 are both 0.5 and all migration rates are Table 6. Settings for simulation study 2. Each entry in this table provides a label for a set of model parameters, e.g., C2 corresponds to the setting in which the speciation times are τ 1 =2, τ 2 =3, τ 3 =4, τ 4 =5, τ 5 =6 and τ 6 =10 and all of the migration rates are Table 7. Simulation results (short speciation interval scenario) Table 8. Simulation results (long speciation interval scenario) Table 9. Results of simulation study Table 10. Instantaneous rate matrix Q Table 11. Instantaneous rate matrix Q Table 12. Instantaneous rate matrix Q x

12 List of Figures Figure 1. Example phylogenies. (a) Illustration of a rooted tree, (b) illustration of an unrooted tree, (c) illustration of bifurcating pattern, (d) a species tree, (e) a gene tree, (f) illustration of a underling species tree and its embedded gene tree Figure 2. The number of possible rooted trees increases exponentially as the number of species increases Figure 3. Factors responsible for the incongruence of gene trees and the species tree. (a) Deep coalescence, (b) Gene flow, (c) A-Gene duplication, B-Gene loss Figure 4. A two population IM model. (a) Model species tree and parameters; (b) Illustration of state change: A. S(1,1) to S(2,0) by migration, B. S(2,0) to S(1,0) by coalescence, C. S(1,0) to S(0,1) by migration Figure 5. Zhu and Yang's SIM3s model. (a) Model species tree and parameters; (b) Coalescence of 1,2 happens first in population 1, t 1 τ 1 ; (c) Coalescence of 1,2 happens first in population 4, τ 1 t 1 τ 0 ; (d) Coalescence of 1,2 happens first in population 5, τ 0 t 1 t 0 ; (d) Coalescence of 2,3 happens in population 5, τ 0 t 1 t 0 ; (e) Coalescence of 1,3 happens in population 5, τ 0 t 1 t 0. (f) Both coalescence events happen in population Figure 6. Population parameter estimates vs. the length of the genes used to estimate gene trees when l=100, τ=2 and θ=4. (a) Speciation time estimates, (b) population size estimates, (c) migration rate estimates. Each error bar is constructed as mean ± standard xi

13 deviation. Accuracy of the estimates increases when the length of the genes is increased. When the length of the genes approaches infinity, known trees can be recovered from sequence data. Thus we denote the performance of our method given known gene tree data as the performance when the length of the genes is infinite Figure 7. Population parameter estimates for various numbers of genes with sequences of length 3,000 bp when τ=2 and θ=4. (a) Speciation time estimates, (b) population size estimates, (c) migration rate estimates. Each error bar is constructed as mean ± standard deviation. Accuracy of the estimates increases as the number of gene trees is increased. 32 Figure 8. Population parameter estimates vs. number of known gene trees when τ=2 and θ=4. (a) Speciation time estimates, (b) population size estimates, (c) migration rate estimates. Each error bar is constructed as mean ± standard deviation. Accuracy of the estimates increases when the number of the genes is increased Figure 9. Power of the LRT for (a) length=, τ=2, θ=4 as a function of migration rate and (b) l=100, τ=2, θ=4 as a function of migration rate Figure 10. Illustration of the tree building algorithm. See text for details Figure 11. Model species tree and parameters for simulation study Figure 12. Model species tree for simulation study Figure 13. Results for simulation study 1 plot. Red: Results from STEM; Green: Results from *BEAST; Purple: Results from STEST; Gray: Results from STEST (DB); Pink: Results from STEST (M0) Figure 14. An extended model for method 1. (a) Model species tree and the parameters. (b) G 1a : two coalescent events happen in population 1. (c) G 1b : the first coalescent event xii

14 happens in population 1, the second in population 2. (d) G 1c : two coalescent events happen in population 2. (e) G 1d : the first coalescent event happens in population 2, the second in population 1. (f) G 2a the first coalescent event happens in population 1, the second in population A. (g) G 2b the first coalescent event happens in population 2, the second in population A. (h) G 3 two coalescent event happens in population 1, the second in population A xiii

15 Chapter 1: Introduction 1.0 Phylogeny A phylogeny or phylogenetic tree is a bifurcating tree with real-valued branch lengths showing the evolutionary relationships among different species, represented by the leaves of the tree (see Figure 1). A bifurcating tree with a root, a node depicting the common ancestor of all species in the tree, is called a rooted tree (Figure 1a). It is called an unrooted tree if the root node is not identified (Figure 1b). There are two types of trees of interest in phylogenetic studies, namely gene trees and species trees. They represent the evolutionary histories of genes and of species, respectively. For a species tree, each leaf represents a currently-existing species. Each internal node is a common ancestor of its descending species (looking from the root to leaves, i.e. from past to present). The root is the ancestor of all of the current species on the tree. Bifurcating means that each node has exactly two direct descendants (see Figure 1c). Each bifurcating process represents a speciation event. The time back from the present to a bifurcating process on a species tree is called a speciation time. Each branch is a line segment that connects two species (could be ancestral or current species). The 1

16 Figure 1. Example phylogenies. (a) Illustration of a rooted tree. (b) Illustration of an unrooted tree. (c) Illustration of bifurcating pattern. (d) A species tree. (e) A gene tree. (f) Illustration of an underling species tree and its embedded gene tree. 2

17 branch length is proportional to the time interval between these two speciation events. For a gene tree, each leaf represents a currently-existing allele of the same gene. Each internal node is a common ancestor of its descending alleles (looking from the root to the leaves). The root is the ancestor of all of the current alleles on the tree. Each bifurcating process represents a coalescent event, which means two lineages coalesce to a common ancestor. The time from the present to a bifurcating process is called a coalescent time. The branch length is proportional to the time interval between two coalescent events. A species tree or a gene tree alone can be represented in a figure similar to Figure 1a. Figure 1f is often used to illustrate the underlying species tree and its embedded gene trees together. 1.1 Phylogenetic Inference Problem The phylogenetic inference problem is to estimate an evolutionary history based on the information from currently-existing evidence. Estimation of the Species tree is the primary goal in such problems. Two categories of data are available for phylogenetic inference, namely discrete characters and distances. For example, sequence data are of the first category. DNA-DNA hybridization data are of the second category. All phylogenetic inference methods, such as parsimony methods, distance methods, maximum likelihood and Bayesian inference, fall into one of three categories: (1) algorithmic methods for direct tree construction, (2) comparison criterion to pick the best tree in a given tree space, and (3) Bayesian method (Hillis et al. 1996). 3

18 Phylogenetic inference problems are usually hard, especially when the number of involved species is large. Given n species, the number of possible species tree topologies is (2n 3)!!= (2n 3)! 2 n 2 (n 2)! (2n 5)! (n 2) for rooted trees and (2n 5)!!= (n 3) for 2 n 3 (n 3)! unrooted trees. It can be observed that the size of the species tree space increases faster than exponentially as n increases (see Table 1 and Figure 2). Therefore the complexity of calculation and computing load increase dramatically with only a small increment in the number of species. Number of species (n) Number of rooted trees Number of unrooted trees , ,480 5, ,216, , ,459,458,000 16,216, ,675,502,000 1,459,458,000 Table 1. Number of possible species trees given n species. 4

19 base 10 logarithm of number of rooted species trees number of species Figure 2. The number of possible rooted trees increases exponentially as the number of species increases. 1.2 Gene Trees and Species Trees Paradox It has been well-appreciated that a single gene tree could be different from its underlying species tree (e.g., Fitch 1970, Tajima 1983, Pamilo and Nei 1988, Maddison 1997, Felsenstein 2004). The causes for such incongruence include deep coalescence, horizontal gene transfer (HGT) or lateral gene transfer (LGT), and gene duplication/loss. Deep coalescence, also called incomplete lineage sorting (ILS), refers to the case when the coalescent time is deeper into the past than the previous speciation time (Figure 3a), which 5

might result from large population sizes or short speciation times (Maddison 1997). HGT or LGT refers to the gene transfer between organisms that is not through reproduction (Figure 3b).

20 might result from large population sizes or short speciation times (Maddison 1997). HGT or LGT refers to the gene transfer between organisms that is not through reproduction (Figure 3b). The probability of HGT between distinct species is different. It is widely accepted that in prokaryotes, HGT happens frequently, thus is playing an important role in evolution (Boto 2010). In addition, more evidence for HGT is being found in other cases, such as HGT between Bacteria and Eukarya (Watkins & Gray 2006, Guljamow et al. 2007), and within Eukayrya (Nedelcu et al. 2008). Gene duplication is an important mechanism in molecular evolution. It usually refers to the duplication of regions of DNA that contain at least one gene (Figure 3c-A). Gene deletion is the loss of DNA sequences of genes (Figure 3c-B). There are many factors that could lead to gene loss, such as unequal crossing over and losses from translocation. Both gene duplication and gene deletion are very common and their mechanisms have been extensively studied (Dittmar and Liberles 2011). Figure 3. Factors responsible for the incongruence of gene trees and the species tree. (a) Deep coalescence. (b) Gene flow. (c) A-Gene duplication, B-Gene loss. 6

21 1.3 Phylogenetic Estimation in the Presence of Deep Coalescence Two classes of methods have been introduced to infer the species trees in the presence of deep coalescence. The first are gene tree parsimony methods that choose a species tree with the minimal number of deep coalescent events given a set of embedded gene trees, which are usually first estimated from sequence data. The parsimony principle dates all the way back to Aristotle (see his Posterior Analytics). It was first introduced into the area of phylogenetics by Camin and Sokal (1965) and further developed by Kluge and Farris (1969) and Farris (1970). A detailed description of the gene tree parsimony method to accommodate deep coalescence can be found in Maddison (1997). The idea is later realized in PhyloNet (Than et al. 2008). The second class contains a number of methods that statistically model the population genetics process of deep coalescence (*BEAST, BEST, STEM, STAR, STEAC, SNAPP). The technique they use is Kingman s coalescent theory, which is a retrospective model in which the genes sampled in individuals can be traced back to a common ancestor known as the most recent common ancestor (MRCA). Kingman (1982 a,b) has proven that this continuous-time coalescent model can be derived as the large-sample limit from several classic discrete population models, such as the Wright-Fisher model and the Moran model (Wakeley 2009). Under the basic coalescent model, coalescent times follow an exponential distribution with mean that depends on the effective population size and the size of the sample. Given a population with effective population size N e, the probability that n (n 2) lineages coalesce into n-1 lineages in the time interval t follows an exponential distribution t exp(( n 2 )), with t properly scaled by 7

22 the effective population size N e. The distribution of the time to the most recent common ancestor has been computed for bigger trees (Rosenberg & Feldman 2002, Efromovich & Kubatko 2008). 1.4 Software Review Here I review the most commonly used software packages for species tree inference. The program STEM (Species Tree Estimation using Maximum likelihood) (Kubatko et al. 2009) takes a set of gene trees as the input and returns the maximum tree (MT) as an estimate of their underlying species tree. With the assumption of no gene flow, given a set of gene trees, any species tree in which they are embedded must have speciation time smaller (i.e., more recent) than the coalescent time for every gene tree. The maximum tree (MT) is a bifurcating tree built such that it has the largest possible speciation times that satisfy this requirement given all gene trees. It is also the maximum likelihood estimate when all the population sizes are equal. Liu (2006; see also Liu and Pearl 2010) has shown that MT is statistically consistent if the gene trees are known. This method was also developed independently by Mossel and Roch (2010) under the name GLASS (Global LAteSt Split). STEM can be viewed both as a maximum likelihood method and as a distance method, for the reasons that MT is a maximum likelihood estimate under suitable conditions and that MT is actually built from a distance matrix where each entry is the smallest coalescent time of genes from every pair of species. STEAC (Species Tree Estimation Using Average Coalescence Times) is another distance method developed by Liu et al. (2009). It first fills in a distance matrix with the 8

23 average coalescent time between every pair of species. Then a clustering method is applied to construct the species tree from this distance matrix. Liu and Edwards (2009) have proven that the STEAC tree is also a consistent estimator. Both STEM and STEAC are able to produce full trees (tree topology plus branch lengths or speciation intervals). In contrast, the method STAR (Species Tree estimation using Average Ranks of coalescences) (Liu et al. 2009), which is a species tree construction algorithm from a set of gene trees, can only give a tree topology without information on speciation times. STAR first ranks the nodes of each gene tree. This process goes from the root to the leaves. The root is assigned the highest rank, which is the number of species. The rank decreases by 1 each time after it is assigned to a node along every branch. The expected rank of the node connecting any two species is the average rank of nodes connecting alleles sampled from the same two species. A neighborhoodjoining (NJ) method is then applied to a matrix that is filled in with twice the expected ranks of each node to get the STAR tree estimate. Another species tree estimation method called MP-EST (Maximum Pseudo-likelihood for Estimating Species Trees) has been developed by Liu et al. (2010). It uses a maximum pseudo-likelihood approach to estimate the species tree from gene trees. Simulation study shows that the MP-EST tree is also statistically consistent (Liu et al. 2010). However, MP- EST is limited in estimating branch lengths in cases in which only 1 lineage is sampled from each species, and when there is not enough topological variation among gene triples. 9

24 *BEAST (Drummond & Rambaut 2007, Drummond et al. 2012), BEST (Bayesian Estimation of Species Trees Under the Coalescent Model) (Liu 2008) and SNAPP (SNP and AFLP Phylogenies) (Bryant et al. 2012) are three widely-used Bayesian inference programs. *BEAST uses Makov Chain Monte Carlo (MCMC) to jointly estimate the posterior distribution of the target species tree as well as all the gene trees and other population parameters such as mutation rates and population sizes. It is part of the ambitious evolutionary analysis software package BEAST, which is the first to allow a relaxed molecular clock model (Drummond & Rambaut, 2007). BEST deploys a Bayesian model consisting of sequences, gene trees and a species tree. Hierarchical MCMC is used to estimate the posterior distribution of a set of target parameters. In the MCMC implementation, for each new proposal, gene trees are updated by a tree rearrangement scheme in MrBayes and then the species tree is updated by the maximum tree (MT) derived from the updated gene trees with a Poisson number of modified nodes. SNAPP is a package that employs a finite-sites model likelihood algorithm within a Bayesian framework. Differently from *BEAST and BEST, which are designed for linked data, SNAPP infers the species trees from unlinked biallelic markers. Therefore it is not applicable for site data within a single gene sequence. Rather, it can handle unlinked SNP data. Both *BEAST and BEST deal with full sequence data and a large number of parameters, which makes them computationally instensive. SNAPP bypasses the estimation process of gene trees, thus is computationally easier than *BEAST and BEST. However, its use is limited by the types of data it handles. Bayesian methods also require 10

25 Methods Category Data Output Advantage/Disadvantage PhyloNet Parsimony Gene trees Topology STEM ML/distance Gene trees Full tree STAR Distance Gene trees (topology only) Topology Fast; Doesn t consider errors in gene trees; Nonparametric; Often returns a large number of equally parsimonious trees. Fast; Doesn t consider errors in gene trees; Often returns unresolved trees (especially when number of lineages is large). Fast; Doesn t consider the errors in gene trees; Nonparametric. STEAC Distance Gene trees Full tree Fast; Doesn t consider errors in gene trees. *BEAST BEST Bayesian Bayesian Full sequence Full sequence Full tree Full tree SNAPP Bayesian SNP data Full tree MP-EST Maximum Pseudolikelihood Gene trees Full tree Table 2. Comparison of different species tree estimation methods. Computationally intensive; Need to assign a prior distribution for each parameter. Computationally intensive; Need to assign a prior distribution for each parameter. Do not explicitly sample gene trees; Employs dynamic programming; Can only deal with unlinked biallelic loci. Faster than Bayesian method but slower than distance method; Doesn t consider errors in gene trees; Fail to estimate branch length when lack of topology variation. 11

26 specification of prior information in the form of prior distribution on model parameters. Researchers are able to take advantage of this if accurate prior information is known; meanwhile they also need to avoid the adverse influence on the posterior distribution of target parameters caused by choice of an inappropriate prior if no information is available. 1.5 Phylogenetic Estimation in the Presence of Gene Flow All current methods for species tree estimation assume that there is no gene flow following speciation. The estimation methods described above attribute the discrepancy between gene trees and the species tree in which they are embedded exclusively to the possibility of deep coalescence. For this reason, it is not appropriate to use any of them if such discrepancy is caused by other factors such as HGT. Indeed, species tree estimation in the presence of gene flow has been less studied and remains a big challenge in phylogenetics. Maddison (1997) described a parsimony method that picks the species tree with minimal number of migration events in 1997, and then after one decade, Eckert and Carstens (2008) and Leache and Rannala (2013) examined the accuracy of species tree estimates in data subject to gene flow for several of the existing species tree estimation methods, none of which models the process of gene flow. They concluded that the existence of migration may complicate the phylogenetic inference problem in many situations. 12

27 1.6 IM Model A powerful tool to model gene flow is the isolation-with-migration (IM) model, which is a model that allows study of the timing of coalescent events between populations in the presence of possible gene flow (Hey 2010). To introduce the IM model, let s consider a two-species scenario that includes six parameters (Figure 4), θ 1, θ 2, θ A, τ, m 12 and m 21. We define θ λ = 4N λ µ, λ 1,2,A where N λ is the effective population size for the corresponding population λ and µ is the mutation rate per generation. τ is the speciation time. We further define m ij to be the migration rate from population i to population j per generation per locus. Assume one lineage is sampled from each population. There are two lineages total. We use S(i,j) to indicate i alleles in population 1 and j alleles in population 2. Before time τ (viewing the process from present to past) or after speciation (viewing the process from past to present), there are 5 possible states, S(1,0), S(0,1), S(1,1) (this is the initial state), S(2,0), S(0,2). We can then formulate an instantaneous rate matrix Q, which gives the rates of transitions between these states. For example, the rate of going from state S(0,1) to state S(1,0) is m 21 since the only possible case is 1 gene from population 2 moves to population 1, while the rate from state S(0,2) to state S(1,1) is 2m 21 since there are 2 lineages in population 2, either of which can move to population 1. The rate from state S(0,2) to state S(0,1) is 2/θ 2, which is the rate at which a coalescent event happens in population 2. By this rule, we can fill in all of the non-diagonal entries of the transition matrix Q. Each diagonal entry is then set so that the sum of its corresponding row is 0 (Table 3). 13

Figure 4. A two population IM model. (a) Model species tree and parameters. (b) Illustration of state change: A. S(1,1) to S(2,0) by migration, B. S(2,0) to S(1,0) by coalescence. C.

28 Figure 4. A two population IM model. (a) Model species tree and parameters. (b) Illustration of state change: A. S(1,1) to S(2,0) by migration, B. S(2,0) to S(1,0) by coalescence. C. S(1,0) to S(0,1) by migration. S(1,1) S(2,0) S(0,2) S(1,0) S(0,1) S(1,1) - m 21 -m 12 m 21 m S(2,0) 2m 12-2m 12-2/θ 1 0 2/θ 2 0 S(0,2) 2m m 21-2/θ 1 0 2/θ 1 S(1,0) m 12 m 12 S(0,1) m 21 - m 21 Table 3. Instantaneous rate matrix Q. 14

29 After time τ (looking backwards) or before speciation, there are 2 cases: I. There is only 1 gene in the ancestral population, which means that a coalescent event has happened before time τ. Therefore, the state at time τ could be either S(1,0) or S(0,1). II. There are 2 genes at time τ in the ancestral population. The state at time τ could be S(1,1), S(2,0) or S(0,2). Following Hobolth et al. (2011), we can use the continuous-time Markov chain representation to obtain the matrix of probabilities of transitions between the states as a function of time. This transition probability matrix is obtained as the solution P(t) to the system of differential equations P' (t) = QP(t) with initial condition P(0) = I. The solution is P(t) = e Qt, which we use to derive the probability density function for each of the two cases listed above. I. The coalescent event could happen in population 1 or in population 2 before speciation (looking backwards). So the coalescent time t < τ. Let a be the initial state. The probability density function of the time to coalescent t for t < τ is: f 1 (t Θ) = (e Qt ) as(2,0) 2 θ 1, where (e Qt ) as(2,0) represents the entry of the matrix (e Qt ) at the row corresponding to state a and the column corresponding to state S(2,0). 15

30 The density that coalescence happens at t in population 2 is: f 2 (t Θ) = (e Qt ) as(0,2) 2 θ 2, where (e Qt ) as(0,2) represents the entry of the matrix (e Qt ) at the row corresponding to state a and the column corresponding to state S(0,2). The total density that a coalescent event happens at t < τ : f (t Θ) = f 1 (t Θ) + f 2 (t Θ). II. The density for a coalescent event in the ancestral population t > τ is: f (t Θ) = ((e Qτ ) as(1,1) + (e Qτ ) as(2,0) + (e Qτ ) as(0,2) )) 2 e ( 2/θ A )( t τ ) θ A, where (e Qτ ) ab represents the entry of the matrix (e Qτ ) at the row corresponding to state a and the column corresponding to state b, b is in S(1,1), S(2,0), S(0,2). 1.7 Software to Estimate Migration Rate Under the IM model, Nielsen and Wakeley (2001) proposed the first procedure to jointly estimate three parameters: the relative population sizes, the migration rates, and the speciation times. They used an MCMC method within a Bayesian framework. Hey and Nielsen (2004, 2007) and Hey (2010) then implemented this idea in the software IM, IMa and IMa2. Under the same model, Zhu and Yang (2009) were able to formulate a 16

31 likelihood function in a simple setting (SIM3s model, see Figure 5a): there are three populations 1, 2 and 3 (outgroup); gene flow exists between population 1 and population 2 with migration rates m 1 and m 2 ; θ i (i=1, 2, 3, 4, 5) are the population sizes; τ 0 and τ 1 are speciation times; and one lineage is sampled from each species. Under this setting, there are 5 possible gene trees, G 1a, G 1b, G 1c, G 2, and G 3 (Figure 5 b~f). That gene flow only exists between population 1 and population 2 before time τ 1 fits a 2-population IM model. Zhu and Yang thus obtained the density function of coalescent times t 0, t 1 and gene tree G x, where x 1a, 1b, 1c, 2, 3. They further wrote down a likelihood function to estimate the speciation times and migration rates from multi-locus sequence data within the maximum likelihood framework. In order to study gene flow between populations, many researchers first obtain a species tree estimate by a species tree estimation program that doesn t allow the possibility of gene flow. Then they treat this species tree estimate as the correct phylogeny and use IMa2, 3s (Zhu and Yang, 2012), or other programs to evaluate the magnitude of migration. Their supportive argument for this approach is that these population parameter estimation methods are robust in many cases even when a wrong species tree is specified, which, however, may not be true. They are risking the chance that an incorrect species tree topology may collapse the population parameter estimation. 1.8 Outline of This Thesis This study consists of two parts. In the first part, I develop a maximum likelihood method under the IM model to jointly estimate speciation times, population sizes and 17

(c) Coalescence of 1,2 happens first in population 4, τ1 t1 τ0.

32 Figure 5. Zhu and Yang's SIM3s model. (a) Model species tree and parameters. (b) Coalescence of 1,2 happens first in population 1, t1 τ1. (c) Coalescence of 1,2 happens first in population 4, τ1 t1 τ0. (d) Coalescence of 1,2 happens first in population 5, τ0 t1 t0. (d) Coalescence of 2,3 happens in population 5, τ0 t1 t0. (e) Coalescence of 1,3 happens in population 5, τ0 t1 t0. (f) Both coalescence events happen in population

33 migration rates from gene tree data. Using the same likelihood function, I design a likelihood ratio test (LRT) to detect gene flow. The performance of this method is then evaluated by applying it to simulated data (Chapter 2). In the second part, I propose a new distance method to estimate species trees in the presence of gene flow. Instead of jointly estimating all of the parameters together, I first calculate an estimate of the speciation time between every pair of species using a method that accommodates gene flow. These estimates are treated as distances between different species. The species with the shortest speciation interval are considered as sister species. Finally a clustering algorithm is applied to build a species tree. Technical details are explained in Chapter 3. I also design a simulation study to evaluate the performance of this method (Chapter 3). Chapter 4 gives a short summary of the significance of this work and some future directions. 19

34 Chapter 2: A Likelihood Method to Estimate/Test Gene Flow In this chapter, we extend the method of Hobolth et al. (2011) to compute the probability density function of the time to the most recent common ancestor (MRCA) between two or more populations in the presence of migration. We then use a likelihood framework to obtain joint MLEs of the population sizes, the speciation time, and the migration rates. Using the same likelihood function, we design a likelihood ratio test to detect gene flow. We test our method by applying it to simulated data. In the end, we discuss the performance of our method and possible future directions to extend our work. 2.1 Methods IM model We first generalize the method of Hobolth et al. (2011) to more complicated situations. To illustrate the idea, we extend the two-lineages case introduced in Section 1.6 to a fourlineages case under a two-species IM model. The notation and basic setting are the same as in Section 1.6. We assume that two genes are sampled from each population, i.e., the initial state is S(2,2). Before time τ (viewing 20

35 the process from present to past) or after speciation (viewing the process from past to present), for 4 lineages there are 14 possible states, S(0,1), S(1,0), S(0,2), S(1,1), S(2,0), S(0,3), S(1,2), S(2,1), S(3,0), S(0,4), S(1,3), S(2,2), S(3,1), S(4,0), where S(i,j) indicates i genes in population 1 and j genes in population 2. We then enumerate each scenario to formulate the instantaneous rate matrix Q. The basic idea is the same as in Section 1.5. For instance, the rate of going from state S(0,2) to state S(0,1) is 2 θ 2 since the only possible case is 2 genes from population 2 coalesce, while the rate from state S(0,3) to state S(0,2) is since there are 3 genes in population 2, each pair of which are equally likely θ 2 to be the two that coalesce. Each non-diagonal entry can be determined by the following rule Q(S(i, j),s(s,t)) = ( i 2 ) 2 θ 1 if i 2,i = s +1, j = t; ( j 2 ) 2 θ 2 if i = s, j 2, j = t +1; im 12 if i,t 1,i = s +1, j = t 1; jm 21 if j,s 1,i = s 1, j = t +1; 0 if otherwise. Each diagonal entry is then set so that the sum of its corresponding row is 0, so the instantaneous rate matrix is obtained (Table 4). 21

36 Table 4. Instantaneous rate matrix Q

37 After time τ (looking from present to past) or before speciation, there are 4 cases: I. There is only 1 gene in the ancestral population, which means that all three coalescent events have happened before time τ. Therefore, the state at time τ could be either S(0,1) or S(1,0). II. There are 2 genes in the initial ancestral population, resulting in 1 coalescent event in this population. Thus the state at time τ could be S(0,2), S(1,1) or S(2,0). III. There are 3 genes in the initial ancestral population, resulting in 2 coalescent events in this population. Hence, the state at time τ could be S(0,3), S(1,2), S(2,1) or S(3,0). IV. There are 4 genes in the initial ancestral population, resulting in 3 coalescent events in this population. In this case, there are 5 possible states at time τ: S(0,4), S(1,3), S(2,2), S(3,1) and S(4,0). We will use these cases to derive the overall probability density function for the time to the most recent common ancestor in the next section Probability Density Function and Likelihood Function Similar to the procedure described in the Section 1.6, we use the continuous-time Markov chain representation to obtain the matrix of probabilities P(t) = e Qt of transitions between the states as a function of time. We then use this matrix to derive the probability 23

38 density function for each of the four cases listed above given the initial state S(2,2). Case I: Note that this case corresponds to t τ, and we must consider two possibilities: (1) If the third coalescent event occurs in population 1, the density for time t τ is f (S(2,2),S(1,0)) (t) = (e Qt ) (S(2,2),S(1,0)) ( 2 θ 1 ) where (e Qt ) (S(i, j ),S(s,t )) is the entry (a,b) in the matrix e Qt if S(i,j) is in the a th row and S(s,t) is in the b th column. (2) If the third coalescent event occurs in population 2, the density for time t τ is f (S(2,2),S(0,1)) (t) = (e Qt ) (S(2,2),S(0,1)) ( 2 θ 2 ) Therefore the total density for the coalescent time t when t τ and the initial state is S(2,2) is f (t Θ) = f (S(2,2),S(1,0)) (t) + f (S(2,2),S(0,1)) (t) = (e Qt ) (S(2,2),S(1,0)) ( 2 θ 1 ) + (e Qt ) (S(2,2),S(0,1)) ( 2 θ 2 ) Case II: If the first and second coalescent events occur in population 1 or 2 and the third coalescent event occurs in the ancestral population, the density for t τ is f (t) = (S(2,2),S(i, j )i+ j=2 ) (eqt ) (S(2,2),S(i, j )) g 2 1 (t τ ) i+ j=2 24

39 where g n 1 (y) is the probability density function for n genes to coalesce to 1 gene in time y. This probability is well-known (see, e.g., Efromovich & Kubatko 2008) and is given by g n 1 (y) = n 1 j=1 µ n, j p n, j (y) where p n, j (x) = λ n, j e λ n, jx χ(x 0), j = 1,, n-1, and χ(.) is the indicator function; µ n, j = 1 if n=2; and µ n, j = n 1 µ n, j = 1. j=1 n 1 λ n, j (λ n,i λ n, j ) 1 if n 2. We note that i j,i=1 Case III: If the first coalescent event occurs in population 1 or 2 and the second and third coalescent events occur in the ancestral population, the density for t τ is f (S(2,2),S(i, j )i+ j=3 ) (t) = (eqt ) (S(2,2),S(i, j )) g 3 1 (t τ ) i+ j=3 Case IV: If no events occur in population 1 or 2 and all the coalescent events occur in the ancestral population, the density for t τ is f (S(2,2),S(i, j )i+ j=4 ) (t) = (eqt ) (S(2,2),S(i, j )) g 4 1 (t τ ) i+ j=4 Therefore the total density for the coalescent time t when t τ is 25

40 4 f (t Θ) = f (t) (S(2,2),S(i, j )i+ j=k ) = (e Qt ) (S(2,2),S(i, j )) g k 1 (t τ ) k=2 4 k=2 i+ j=k To summarize, we can write the probability density function for t as f (t Θ) = (e Qt ) (S(2,2),S(1,0)) ( 2 θ 1 ) + (e Qt ) (S(2,2),S(0,1)) ( 2 θ 2 ) if 0 t τ; 4 (e Qt ) (S(2,2),S(i, j )) g k 1 (t τ ) if t τ; k=2 i+ j=k 0 if otherwise. Let t=t 1, t 2,, t l be a vector of l observed times to the most recent common ancestor (MRCA) in a sample of l genes, where t i is the MRCA of the i th gene. With the assumption that all genes are independent, the probability density function for the data t is f (t Θ) = f (t i Θ) l i=1 The likelihood function is therefore L(Θ t ) = f (t i Θ) l i= Likelihood Ratio Test (LRT) Based on the likelihood function, the following likelihood ratio test is designed to examine whether there is gene flow between the two populations under study. Specifically, we consider the following hypotheses: 26

41 H 0 : θ Θ 0, where Θ 0 = Θ m 12 = m 21 = 0 H 1 : θ Θ 0 c The null hypothesis H 0 is that there is no gene flow. The alternative hypothesis is that there is gene flow. The likelihood ratio test statistic is λ(t ) = sup L(θ t ) θ Θ 0 sup L(θ t ) θ Θ where t is data, Θ 0 is the parameter space where the migration rates are zero, andθ is the full parameter space. In order to assess significance, we base our test on the statistic 2 l = 2logλ(T). Under the null hypothesis, m 12 = m 21 = 0 is at the boundary of the parameter space of the full model, and thus 2 l should follow a 50:50 mixture distribution of a point mass at zero and a χ 2 1 random variable (Self & Liang 1987). The critical values are therefore 2.71 at 5% and 5.41 at 1%. The above ideas are implemented in a C program, see the Appendix A. 2.2 Simulation Study and Result Simulation To simplify the problem, we assume that all the population sizes are equal as are the migration rates, although such a restriction is not necessary in the development of our 27

42 method. To examine the performance of our method, we first simulated data under a variety of conditions. In each case, we used the program ms (Hudson 2002) to generate a sample of gene trees with the same underlying population structure, namely the same speciation time, population sizes, and migration rates. From each gene tree, we obtained the observed MRCA, and used these as input into our method to obtain the maximum likelihood estimates (MLEs) of all parameters and to apply the LRT. In practice, however, gene trees are not observed directly and instead must first be estimated from sequence data. To simulate this situation, we took each of the gene trees sampled above and evolved sequence data along them using the program Seq-Gen (Rambaut 1997). The simulated alignments from Seq-Gen were then used to estimate a gene tree in PAUP* (Swofford 2003) under the maximum likelihood criterion. The MRCAs were extracted from the estimated gene trees and were again used as input to our method. As in the case of known gene trees, we considered both parameter estimation and hypothesis testing via the LRT. We considered data sets of various sizes. Specifically, we considered l = 10, 25, 50, and 100 genes. We considered speciation times τ = 2 and population size θ = 4. Several migration rates were considered: m = 0, 0.001, 0.01, 0.1, 0.5, and 1. In the process of estimating gene trees, we considered genes of length 1,000, 2,000 and 3,000 bp. 28

43 2.2.2 Results Maximum Likelihood Estimation When the number of genes is 100, the MLEs of the speciation times, population sizes and migration rates from estimated gene tree data get closer to the true values as the length of the genes increases (Figure 6). When the migration rate is small (m=0, 0.1), all of the estimates converge to the true value as the length of the genes approaches infinity (the length of the genes equal to infinity means that the gene trees can be fully recovered from sequence data). When the migration rate is moderate (m=0.5), estimates of speciation times and migration rates converge to the true value as the length of the genes approaches infinity, while estimates of the population sizes are biased. When the migration rate is large (m =1), all of the estimates are biased, even when known gene tree data are used (Figure 6). When the length of the simulated genes is fixed to be 3,000bp (which is longer than typically-observed genes, but is used to assess performance of the method), the MLEs from the estimated gene tree data get closer to the true values as the number of genes increases. All of the estimates are closer to the true value when the migration rate is small than those when the migration rate is large under the same setting (Figure 7). The MLEs from known gene tree data follow a similar pattern when the number of gene trees increases. In most of the cases, the MLEs are very close to the true value of the parameters when the number of genes is 100 (Figure 8). 29

44 Likelihood Ratio Test The same process as in Section is performed to generate simulated data, which is fed to both the full model and the null model to calculate 2 l. For the known gene tree data, the asymptotic result is applicable, while for the estimated gene tree data, the null distribution should be simulated in order to get the correct critical value. The power of the LRT using known gene tree data shows that our test is more powerful when the migration rate is larger and the data contain more gene trees, as expected. For example, fix the number of genes to be 100. The power of our test is 20% when the migration rate is 0.01, and it is 70% when the migration rate is 0.1. When the number of genes is 5000, the power of our test is 70% even when the migration rate is just 0.01 (Figure 9a). For the estimated gene tree data, when the number of genes is 100, the power of our LRT improves as the length of the genes increases. However, the power is still below 50% when the length of genes is 5,000 bp. The power is around 60% for the case m=0.1 when the length of genes increases to 10,000 bp (Figure 9b). 2.3 Discussion Our method takes coalescent times (when our method is extended to cases with more than two species, both coalescent times and tree topologies are needed) as the input and we do not model the errors in the gene tree estimation process. Therefore it is not a surprise that estimates from known gene tree data are more accurate than those from estimated 30

45 Speciation Time Population Size Length of Genes a (a) Length of Genes b (b) Migration Rate mig=1 mig=0.5 mig=0.1 mig=0 true value Length of Genes c (c) Figure 6. Population parameter estimates vs. the length of the genes used to estimate gene trees when l=100, τ=2 and θ=4. (a) Speciation time estimates. (b) Population size estimates. (c) Migration rate estimates. Each error bar is constructed as mean ± standard deviation. Accuracy of the estimates increases when the length of the genes is increased. When the length of the genes approaches infinity, known trees can be recovered from sequence data. Thus we denote the performance of our method given known gene tree data as the performance when the length of the genes is infinite. 31

46 Speciation Time Population Size Number of Gene Trees a (a) Number of Gene Trees b (b) Migration Rate mig=1 mig=0.5 mig=0.1 mig=0 true value Number of Gene Trees c (c) Figure 7. Population parameter estimates for various numbers of genes with sequences of length 3,000 bp when τ=2 and θ=4. (a) Speciation time estimates. (b) Population size estimates. (c) Migration rate estimates. Each error bar is constructed as mean ± standard deviation. Accuracy of the estimates increases as the number of gene trees is increased. 32

47 Speciation Time Population Size Number of Gene Trees a (a) Number of Gene Trees b (b) Migration Rate mig=1 mig=0.5 mig=0.1 mig=0 true value Number of Gene Trees c (c) Figure 8. Population parameter estimates vs. number of known gene trees when τ=2 and θ=4. (a) Speciation time estimates. (b) Population size estimates. (c) Migration rate estimates. Each error bar is constructed as mean ± standard deviation. Accuracy of the estimates increases when the number of the genes is increased. 33

48 l=10 l=25 l=50 l=100 l= Migration Rate a Figure 9. Power of the LRT for (a) length=, τ=2, θ=4 as a function of migration rate and (b) l=100, τ=2, θ=4 as a function of migration rate. 34 Power Power length=1000 length=2000 length=3000 length=5000 length=10000 length=... Migration Rate b 34

49 gene tree data. The longer the genes, the more information can be used to recover the gene trees, and the method performs better, as expected. Parameter estimates and LRT test results are better as the length of the genes increases. The number of genes is also an important factor, since our method is obtaining information from the variation in the individual gene trees. More gene trees provide more observations. For this reason, as the number of genes increases, parameter estimates from both known gene tree data and estimated gene tree data get closer to the true values. The power of the LRT improves as well. For parameter estimation, it can be observed that when other settings are the same, estimates from data with larger migration rate deviate more from their true value than those from data with smaller migration rate. A possible reason is that larger migration rate makes the likelihood function flatter, which makes it difficult for our method to find the global maximum. When the number of gene trees is 100 (which is the largest number in our simulation), estimates from known gene tree data with the migration rate equal to 1 are still biased (note that 1 is a very large migration, which means no boundaries between populations). This implies that either our method cannot handle data with large migration rate well or 100 gene trees do not contain enough information for us to recover the history when the migration rate is large. For the LRT from estimated gene tree data, the length of genes required for a reasonable power is around 10,000 bp, which is much longer than the (non-concatenated) gene 35

50 sequences that are typically available in large-scale genome sequencing projects (~300-1,500bp). Therefore this method will not be generally useful in practice unless further improved. 2.4 Future Work From the simulation study, we find that our method doesn t perform well in dealing with estimated gene tree data. One may argue that it is not a good method in practice. However, we cannot ignore the fact that our method does look promising in dealing with known gene tree data. Therefore we believe it can be improved to handle some cases well. Possible future directions are: (1) Increase the number of the genes. As discussed above, 100 genes do not seem to be enough to get good estimates, even from known gene tree data. As simulation has shown, increasing the number of genes does improve the performance of our method in all cases. (2) Increase the number of lineages within each species. This will provide more information about each population, which certainly helps in the parameter estimation as well as in the LRT. (3) Add an outgroup. This is an idea borrowed from Zhu and Yang (2012) s method. We can use the same setting as in Section 1.7 to introduce a distant species. This distant species may provide us with enough information to improve the estimates of the coalescent times. 36

51 (4) Consider the topologies of different gene trees. Again, this is borrowed from Zhu and Yang (2012) s method. We should enumerate the probability of all the different possible gene trees given a species tree (just like Figure 5) and estimate the species trees from the distribution of gene trees. 37

52 Chapter 3: A Distance Method to Estimate Species Trees in the Presence of Gene Flow In this chapter, I introduce a new class of distance methods, called STEST. I apply the new methods to simulated data sets and an empirical data set to evaluate its performance. In the end, I discuss the advantages and disadvantages of my method and the future directions. 3.1 Methods Our method, called STEST (Species Tree Estimation from Speciation Times), consists of two parts: creation of a distance matrix and use of a clustering algorithm based on this matrix to construct the tree. We will describe each of these steps in detail in the following section Distance Matrix Building Let Ω 0 =S i (i=0, 1, 2,..., n, n 3) be a set of species. Let S 0 be the outgroup. Within each species S i multiple genes are sampled. Let be the vector of DNA sequences for each sampled gene from species S i. For each pair of species (S i, S j ) (i j, 1 i,j n), ( g i, g i 38

53 g j, g 0 ) is used to estimate the speciation time t i,j between species S i and S j using Zhu and Yang (2012) s maximum likelihood method under the SIM3s model (We denote this method SIM3s). We define D = (t i,j ) to be the distance matrix, which is a symmetric n x n matrix that contains the speciation times for all pairs of species. The idea in STEST is to use the estimated speciation time as the distance between two species. We use the method SIM3s to obtain the estimated speciation times because it models gene flow and it is fast (when compared to Bayesian methods). Nonetheless, this method can be replaced with any method that is able to estimate speciation times. We also evaluate the performance of the methods when the method SIM3s is replaced with Yang (2002) s method (we denote it M0), which assumes no gene flow, and with Yang (2010) s method (DB), which assumes gene flow. The method M0 uses a simplified model of SIM3s, which assumes no gene flow, i.e., m 1 =m 2 =0 (Figure 5a). Under this setting, there are only 4 possible gene trees, G 1b, G 1c, G 2, and G 3 (Figure 5c~g). Let D i be the sequence data at locus i. Let τ 1, τ 0, and θ i s have the same meaning as in Figure 5. By convention, Θ is used to denote the parameter vector Θ = (τ 1,τ 0,θ 4,θ 5 ). The probability density function f (t 1,t 2,G Θ) of the coalescent times t 1, t 2 and gene topology G can be derived in the same way as described in Section 1.7. This density function and Yang (1994) s formula for the conditional probability P(D i t 1,t 2,G) of D i given t 1, t 2 and G are then combined together to derive the likelihood function L(Θ D i ) = P(D i t 1,t 2,G) f (t 1,t 2,G Θ)dt 1 dt 2 dg, which is used in M0 to obtain MLEs 39

54 of the speciation time in the absence of gene flow. DB refers to discrete beta. The difference between M0 and DB is that DB allows the species divergence time τ 1 of species 1 and species 2 to vary among loci at random due to possible gene flow. Yang (2010) chooses a beta distribution to model this. The density of τ 1 is f (τ 1 τ 0, p,q) = 1 B(p,q) (τ 1 τ 0 ) p 1 (1 τ 1 τ 0 ) q 1 1 τ 0, 0 < τ 1 < τ 0 where τ 0, p, and q are parameters of the distribution. To understand this formula, let s change the variables by making x 1 = τ 1. Then x 1 ~ beta(p,q), 0 < x 1 < 1. The mean τ 0 x 1 = τ p pq 1 of x 1 is and the variance is. So p = x 1 q. Let s treat τ 0 p + q (p + q) 2 (p + q +1) 1 x 1 x 1, q instead of p, q as the parameters of the distribution of x 1. The density of x 1 can be written as f (x 1 x 1,q). The likelihood function is 1 L(Θ D i ) = P(D i t 1,t 2,G) f (t 1,t 2,G θ 4,θ 5,τ 0,τ 1 = τ 0 x 1 )dt 1 dt 2 dg f (x 1 x 1,q)dx 1 0 where Θ = (x 1,q,τ 0,θ 4,θ 5 ). The estimate the speciation time τ is τ 1 1 = x1 τ Tree Construction Let T 0 = t ij i<j be the set of all the entries of the distance matrix D (distance between 40

55 every two species). The following algorithm is performed: 1. Pick the smallest time t i1 j 1 in T 0 by a searching algorithm, write τ 1 = t i1 j 1, add a new node at time τ 1 connecting S i1 and S j1. 2. Suppose that k nodes have been added and a set Ω Ω 0 of species has been connected. Let T T 0 be the set of times that haven t been picked. Pick the smallest time t ia j b from T. Case 1. If S ia, S jb Ω =, add a new node at time τ k+1 = t ia j b connecting S ia and S jb. Case 2. If S ia, S jb Ω = S jb, then add a new node at time τ k+1 = t ia j b connecting S ia and node τ mb,k, where τ mb,k is the node with the largest time that is connected to S jb after k nodes are constructed. Case 3. If S ia, S jb Ω, then (i) if S ia and S jb share an ancestor, then this step is finished; (ii) if S ia and S jb don t share an ancestor, add a new node at time τ k+1 = t ia j b connecting node τ ma,k and node τ m b,k. 3. Continue until all species share a common ancestor, i.e. the root is reached. This algorithm has been implemented in the statistical language R, see Appendix B. 41

56 3.1.3 An Example To illustrate this method, let s consider a set S=S 1, S 2, S 3, S 4, S 5 consisting of 5 species. Suppose that the distance matrix D=(t ij ) is given by D = S 1 S 2 S 3 S 4 S Note that the distance matrix is symmetric, therefore we only need to fill in the upper triangular part. Then T 0 = t ij i<j =0.5, 0.75, 1, 1.2, 2, 2.1, 2.2, 2.3. Let s perform the clustering algorithm step by step (see Figure 10). 1. Pick the smallest element in T, t 12 =0.5. Add a new node at time τ 1 = t 12 =0.5 to connect S 1 and S 2. After this step, Ω=S 1, S 2, T = T Pick the smallest element in T, t 45 =0.75. Since S 4, S 5 Ω =, add a new node at time τ 2 =t 45 connecting S 4 and S 5. After this step, Ω=S 1, S 2, S 4, S 5, T = T 0 0.5, Pick the smallest element in T, t 13 =1. Then S 1, S 3 Ω = S 1, and τ 1 is the largest node which is connected to S 1, so add a new node at time τ 3 =t 13 =1 connecting S 3 and node τ 1. After this step, Ω=S 1, S 2, S 3, S 4, S 5, T = T 0 0.5, 0.75, Pick the smallest element in T, t 2,3 =1.2, but S 2, S 3 Ω and S 2 & S 3 share a same 42

57 node τ 1 so nothing is done. After this step, Ω=S 1, S 2, S 3, S 4, S 5, T = T 0 0.5, 0.75, 1, Pick the smallest element in T, t 3,4 =2. Since S 3, S 4 Ω and S 3 & S 4 do not share a common ancestor, we need to add a new node to connect the two largest nodes that are connected to S 3 and S 4 respectively, i.e. τ 3 and τ 2. After this step, Ω=S 1, S 2, S 3, S 4, S 5, T = T 0 0.5, 0.75, 1, 1.2, 2, and τ 4 = Root is reached! The estimated tree is shown in Figure 10, Step 5, and the speciation times are: τ 1 = 0.5, τ 2 =0.75, τ 3 =1 and τ 4 = Simulation Study To assess the performance of the new method STEST, we first carry out a simulation study. In the first case, we compare our method to *BEAST and STEM for the estimation of small species trees in the presence of gene flow. In the second case, we compare our method to STEM for the estimation of large species trees again with gene flow. In both cases, we also study the performance of the methods with Zhu and Yang (2012) s method to estimate the distance matrix replaced with the method M0 (Yang 2002) and with the method DB (Yang 2010). We call the two methods STEST (M0) and STEST (DB) respectively. 43

58 τ 3 τ 2 τ 1 0 τ 2 τ 1 τ S 1 S 2 S 1 S 2 S 4 S 5 Step 2 Step 1 τ 4 S 1 S 2 S 3 S 4 S 5 Step 3 Root Reached! Step 6 44 τ 3 τ 3 τ 2 τ 2 τ 1 τ S 1 S 2 S 3 S 4 S 5 S 1 S 2 S 3 S 4 S 5 Step 5 Step 4 Figure 10. Illustration of the tree building algorithm. See text for details. 44

59 3.2.1 Simulation Study 1: Four-taxon Tree Under the n-island Model Figure 11 shows the model species tree and parameters for the first simulation study. One lineage is sampled from each species under a four-taxon species tree. Since Zhu and Yang s method requires an outgroup, we specify species 0 to be the outgroup (Figure 11). 100 genes are sampled. All of the population sizes are assumed to be the same and equal to θ. Three bifurcating speciation events happen at times τ 1, τ 2, and τ 3. Gene flow exists among all but population 0 before τ 1 with m ij the migration rate from population i to population j (1 i, j 3). All the migration rates are equal. Thus the migration pattern follows an n-island model (Wright 1943). To simulate the data, gene trees are first sampled from ms (Hudson 2002) under 20 different settings (see Table 4). Seq-Gen (Rambaut & Grassly 1997) is then used to generate full sequence data from the simulated gene trees under the JC69 model (Jukes and Cantor 1969). The length of each gene is set to be 1,000 bp. STEST, STEST (M0), STEST (DB) and *BEAST are used to get species tree estimates directly from sequence data. Gene trees are first estimated by PAUP* and are then used as the input to STEM. For each setting, the same procedure is repeated 80 times Simulation Study 2: Nine-Taxon Trees Under the n-island Model Figure 12 shows the model species tree and parameters for the second simulation study. One lineage is sampled from each species under a nine-taxon species tree. We specify species 0 to be the outgroup (Figure 12). 100 genes are sampled. All of the population 45

Figure 11. Model species tree and parameters for simulation study 1. Migration rate (all equal) 0 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1.0 Speciation Interval 0.

60 Figure 11. Model species tree and parameters for simulation study 1. Migration rate (all equal) Speciation Interval 0.5 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 2 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Table 5. Settings for simulation study 1. Each entry in this table provides a label for a set of model parameters, e.g., A3 corresponds to the setting in which the difference τ 3 -τ 2 and τ 2 -τ 1 are both 0.5 and all migration rates are

Figure 12. Model species tree for simulation study 2. Speciation Times τ 1 =2 τ 2 =3 τ 3 =4 τ 4 =5 τ 5 =6 τ 6 =10 0 C1 Migration Rate 0.1 C2 0.5 C3 1 C4 5 C5 Table 6. Settings for simulation study 2.

61 Figure 12. Model species tree for simulation study 2. Speciation Times τ 1 =2 τ 2 =3 τ 3 =4 τ 4 =5 τ 5 =6 τ 6 =10 0 C1 Migration Rate 0.1 C2 0.5 C3 1 C4 5 C5 Table 6. Settings for simulation study 2. Each entry in this table provides a label for a set of model parameters, e.g., C2 corresponds to the setting in which the speciation times are τ 1 =2, τ 2 =3, τ 3 =4, τ 4 =5, τ 5 =6 and τ 6 =10 and all of the migration rates are

62 sizes are the same and are equal to θ. Eight bifurcating speciation events happen at times τ 1, τ 1, τ 2, τ 2, τ 3, τ 4, τ 5 and τ 6. Gene flow exists among all but population 0 before τ 1 with M ij the migration rate from population i to population j (1 i, j 8). All of the migration rates are assumed to be equal. Thus the migration pattern again follows an n-island model. Just as in the previous section, gene trees are sampled from ms and then Seq-Gen is used to generate sequence data from these simulated gene trees under the JC69 model. The length of the simulated sequences is 1,000 bp. STEST, STEST (M0) and STEST (DB) are applied to the sequence data directly. PAUP* is used to estimate ML gene trees from sequence data before STEM is applied. The parameter settings are listed in Table 5. We use 100 replicates for each setting. 3.3 Empirical Data Analysis We apply the methods STEST, STEST (M0) and STEST (DB) to the HGCOR (Human, Chimpanzee, Gorilla, Orangutan and Rhesus) data set obtained by Ebersberger et al. (2007), who has shown that the species tree topology is ((((G,O),C),H),R). R (Rhesus) is the outgroup. The data set contains 28,160 sequence alignments. 249 of sequence alignments are longer than 1,000 bp and are used in this analysis. 48

63 3.4 Results Results of Simulation Study 1 Results from the simulation study 1 are given in Tables 6 and 7 and are plotted in Figure 13. For each setting, the number of correct tree estimates (out of 80) is recorded and translated into a percentage. In the short speciation interval scenarios (A1~A10), *BEAST outperforms STEM and STEST in all cases. When gene flow doesn t exist, STEM estimates 85% of the total trees correctly, which is better than STEST s 71% correct. However, in the presence of gene flow, the method STEST consistently yields better results than STEM. The performance of all three methods decreases as the migration rate increases. The difference in the percentage of correct estimates between *BEAST and STEST is large when the migration rate is small, but it tends to 0 when the migration rate is large (>1). STEM starts to perform poorly (percentage correct < 30%) after the migration rate is larger than 0.2. When the migration rate is larger than 1, the percentage correct for STEM is less than 10%. The performance of STEST (M0) and STEST (DB) is better than STEST but worse than *BEAST. STEST (M0) s performance is very close to *BEAST s, and STEST (DB) is very close to STEST (Figure 13a). In the long speciation interval scenarios (B1~B10), STEST outperforms *BEAST and STEM in almost all cases. When the migration rate is 0, all of the methods perform well ( 94% correct). The performance of STEM decreases dramatically in the presence of gene 49

64 flow (decrease in accuracy from 100% to 44% when the migration rate changes from 0 to 0.05). The same thing happens with *BEAST (decrease in accuracy from 91% to 54% when the migration rate changes from 0.05 to 0.1). The performance of STEST starts to drop after the migration rate is increased to 0.1. The accuracy still stays 68% when the migration is 0.2. Its performance starts to be lower than 30% when migration rate is > 0.5. Again, STEST (M0) s performance is very close to *BEAST s, and STEST (DB) is very close to STEST (Figure 13b). Short Speciation Interval Long Speciation Interval % Correct Trees % Correct Trees Migration Rates a Migration Rates b Figure 13. Results for simulation study 1 plot. Red: Results from STEM; Green: Results from *BEAST; Purple: Results from STEST; Gray: Results from STEST (DB); Pink: Results from STEST (M0). 50

65 % A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 *BEAST STEM STEST STEST (DB) STEST (M0) Table 7. Simulation results (short speciation interval scenario). % B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 *BEAST STEM STEST STEST (DB) STEST (M0) Table 8. Simulation results (long speciation interval scenario) Results of Simulation Study 2 The results of the second simulation study are given in Table 9. As expected, when there is no migration, STEST performs the worst (52%), and STEST (M0) performs the best (100%). When the migration rate is small (0.1), STEM begins to fail all the time (0%) and 51

66 STEST (M0) has 12% correct. However, both STEST and STEST (DB) still have around 30% correct. When the migration rate is moderate or large ( 0.5), all of the four methods fail all the time. % C1 C2 C3 C4 C5 STEM STEST STEST (DB) STEST (M0) Table 9. Results of simulation study Results of Empirical Data Analysis The species tree estimate (((H,C),G),O) using STEST, STEST (M0) and STEST (DB) agrees with the species tree topology obtained by Rannala and Yang (2003). The running times are 130s, 14s and 95s for STEST, STEST (M0) and STEST (DB) respectively on a Linux machine with two eight core Xecon E (2.8 GHz) CPUs and 384 GB ram. 3.5 Discussion Small Tree Estimation In the absence of gene flow, STEST performs the worst (only 71% correct) among all of the methods in the short speciation interval scenario, which can be explained by the fact that the short speciation interval causes many deep coalescent events. STEST cannot 52

67 handle the data well when there are many deep coalescence events and the discrepancy between gene trees and species trees are exclusively due to deep coalescence. This is counterintuitive, because the SIM3s model uses coalescent theory. However, under the SIM3s model, only three species are studied and one of them is the outgroup, which means that the coalescent event regarding the two ingroup species is expected to happen in population 4 (Figure 5a). This implies that no deep coalescent events are considered. *BEAST deals with this case well with 100% correct and STEM has 85% correct. In the long speciation interval scenario, when there are not as many deep coalescence events, all of the methods have similar good performance (all above 90% correct). This again demonstrates the influence of deep coalescence in phylogenetic inference problems. In the short speciation interval scenario, when the migration rate is small ( 0.3), *BEAST performs excellent ( 90% correct) because the incongruence between gene trees and the species tree are influenced more by deep coalescence than by gene flow. When the migration rate becomes bigger (>0.3), *BEAST s performance starts to drop as the incongruence between gene trees and the species tree are affected more and more by gene flow. STEST follows a similar pattern, with its performance worse than *BEAST when gene flow is small because it cannot deal with deep coalescence. However, when the migration rate is large enough (>1.0), i.e. the incongruence is heavily influenced by gene flow, STEST has similar performance to *BEAST. STEST (DB) has a similar performance curve to STEST, which shows that the methods DB and SIM3s have similar performance when only a portion of a large tree is studied. STEM s performance curve is different. It decreases dramatically as migration rate increases. The reason is that as the 53

68 gene flow increases, the minimal coalescent time tends to zero. Therefore, STEM produces a lot of unresolved species trees, which implies the data doesn t have enough information for species tree estimation through STEM s approach. STEST (M0) can be treated as an extension of STEM, because both of them use speciation time estimates in the absence of gene flow to build species trees. It outperforms STEM in all the cases. In the long speciation interval scenario, deep coalescence is no longer a problem. The incongruence between gene trees and species trees is mostly due to gene flow. Therefore STEST outperforms *BEAST and STEM almost everywhere. When the migration rate falls in the interval (0, 0.3), STEST performs significantly better than *BEAST and STEM do. When the migration rate is larger than 0.6, all three of the methods perform poorly (< 30% accuracy). The performance curve is also decreasing as the migration rate increases. However, the slope of the performance curve is steeper than that in the short speciation interval scenario. The possible reason is that longer speciation interval allows more migration events when the migration rates are the same. Therefore, in the long speciation interval scenario, the same amount of increment in migration rate produces a larger increase in the number of migration events, which makes the performance of these methods decrease more. Again the performance of STEST (DB) is very close to that of STEST, and STEST (M0) always outperforms STEM. There are multiple possible reasons why the performance of STEST decreases when the migration rate increases. The first one is that the assumption that no other populations are exchanging genes with the focal populations in the IM model is violated. When the 54

69 migration rate is small, such violation doesn t matter a lot. However, when migration rate between the focal populations and the unfocal populations is large, Zhu and Yang s method or any IM-based method is no longer applicable to estimate speciation times between two species. The second possible reason is that when the migration rate is large, the likelihood surface becomes flat. Therefore it is more difficult to locate the global maximum of the likelihood function. To improve this, Zhu and Yang s method could be replaced by better methods (if any were developed) to estimate speciation times with the presence of gene flow Large Tree Estimation The performance of STEST is much worse than the small tree estimation case. The possible reason is that there are too many unfocal populations that are exchanging genes with the two focal populations in the speciation time estimation process. These genes disturb the signal that SIM3s uses to estimate speciation times. The performance of STEST (DB) is better than STEST when no gene flow exists and similar to STEST when migration is small (both ~30% when migration rate is 0.1). STEM s performance is similar to the small species tree case. It has ~80% correct when no gene flow exists and collapses even when the migration rate is as small as 0.1. STEST (M0) again outperforms STEM. When the migration rate is 0, it always gives the correct estimates and when the migration rate is 0.1, it has 12% correct. All of the four methods collapse when the gene flow is moderate or large (migration rate 0.5). 55

70 3.5.3 Empirical Data Analysis In the empirical study, all the three newly proposed methods estimate the correct tree topology within three minutes. This is expected because there is not much gene flow in the data set Summary In summary, when the discrepancy between gene trees and the species tree is heavily influenced by gene flow (under the n-island model), and the migration rate is not very large, STEST provides a better way to estimate species trees than *BEAST and STEM. When the migration rate is 0, STEST (M0) provides a better way to estimate species trees (it s faster than *BEAST and more accurate than STEM). Furthermore, some of these restrictions may be removed if better speciation time estimation methods can be developed. 3.6 Future Work There are two directions to extend our study in the future. The first direction is to develop better speciation time estimation methods. Our idea is to use the speciation time estimates as distances to estimate species trees. The better quality the speciation time estimates are, the better accuracy our method will have. The second direction is to explore the applicable situations of our current methods STEST, STEST (M0) and STEST (DB). In our study, we only examined the performance of these methods in dealing with data under the simplest migration model, the n-island model. There are far more cases where 56

71 gene flow could exist in nature. It is worthwhile to investigate as many situations as possible and to find out the best applicable domain for each method. 57

72 Chapter 4: Discussion and Future Directions 4.1 Discussion In this thesis, we proposed two methods. The first one includes a maximum likelihood estimation method to jointly estimate speciation times, population sizes and migration rates and a likelihood ratio test to detect gene flow from gene tree data. This method is faster than the Bayesian methods. Simulation has shown that it is very powerful in dealing with known gene tree data, but cannot handle estimated gene tree data well. Further development thus is needed for this method (Chapter 2). The second method is called STEST (Species Tree Estimation from Speciation Times). It uses the speciation times as the distances to estimate species trees. The motivation to develop this method is to accommodate possible gene flow. We pick Zhu and Yang (2012) s method SIM3s in the original STEST method because SIM3s estimates speciation times in the presence of gene flow and it is fast enough to get the estimates within a reasonable time. However, SIM3s can be replaced with any speciation time estimation method. There are two examples in this thesis. STEST (M0) uses speciation time estimates from Yang (2002) s method M0, and STEST (DB) uses speciation time estimates from Yang (2010) s method DB. The method M0 estimates speciation times in the absence of gene flow. Therefore, it can be 58

73 treated as a direct extension of STEM, because both of them use speciation time estimates in the absence of gene flow as distances to build species trees. Simulation has shown that STEST (M0) almost always outperforms STEM. Similarly to SIM3s, DB also estimates speciation times in the presence of gene flow. STEST (DB) has a similar performance to the original STEST method. Simulation has shown that under certain conditions STESTbased methods yield more accurate estimates than *BEAST and STEM (Chapter 3). 4.2 Future Direction There are many directions to extend our study in the future. For example, increasing the number of sampled genes, increasing the number of lineages within each population, adding an outgroup and explicitly checking the distribution of gene tree topologies may improve the performance of the first method. For the second method, we can either work to improve its performance by developing better methods to estimate speciation times, or to investigate the situations in which the method is applicable. In the next section, we ll extend our first method by explicitly checking the distribution of gene tree topologies, i.e., the fourth direction as discussed in Section An Extension to Method 1 To reduce the complexity, let s consider three instead of four lineages in a two species IM model. The model species tree and parameters is in Figure 14a. Two lineages are 59

74 Figure 14. An extended model for method 1. (a) Model species tree and the parameters. (b) G1a: two coalescent events happen in population 1. (c) G1b: the first coalescent event happens in population 1, the second in population 2. (d) G1c: two coalescent events happen in population 2. (e) G1d: the first coalescent event happens in population 2, the second in population 1. (f) G2a the first coalescent event happens in population 1, the second in population A. (g) G2b the first coalescent event happens in population 2, the second in population A. (h) G3 two coalescent event happens in population 1, the second in population A

Coalescent Theory: An Introduction for Phylogenetics

Coalescent Theory: An Introduction for Phylogenetics Laura Salter Kubatko Departments of Statistics and Evolution, Ecology, and Organismal Biology The Ohio State University lkubatko@stat.ohio-state.edu