Population Structure and Genealogies

Similar documents
Ancestral Recombination Graphs

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Coalescence. Outline History. History, Model, and Application. Coalescence. The Model. Application

Analysis of geographically structured populations: Estimators based on coalescence

Comparative method, coalescents, and the future

Population Genetics using Trees. Peter Beerli Genome Sciences University of Washington Seattle WA

Comparative method, coalescents, and the future. Correlation of states in a discrete-state model

Viral epidemiology and the Coalescent

Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory

Coalescents. Joe Felsenstein. GENOME 453, Autumn Coalescents p.1/48

Population genetics: Coalescence theory II

Coalescents. Joe Felsenstein. GENOME 453, Winter Coalescents p.1/39

Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms

Coalescent Likelihood Methods. Mary K. Kuhner Genome Sciences University of Washington Seattle WA

BIOL Evolution. Lecture 8

Forward thinking: the predictive approach

Bioinformatics I, WS 14/15, D. Huson, December 15,

TREES OF GENES IN POPULATIONS

The genealogical history of a population The coalescent process. Identity by descent Distribution of pairwise coalescence times

Ioanna Manolopoulou and Brent C. Emerson. October 7, Abstract

Coalescent Theory: An Introduction for Phylogenetics

MOLECULAR POPULATION GENETICS: COALESCENT METHODS BASED ON SUMMARY STATISTICS

2 The Wright-Fisher model and the neutral theory

Estimating Ancient Population Sizes using the Coalescent with Recombination

5 Inferring Population

STAT 536: The Coalescent

Chapter 12 Gene Genealogies

Theoretical Population Biology. An approximate likelihood for genetic data under a model with recombination and population splitting

How to use MIGRATE or why are Markov chain Monte Carlo programs difficult to use?

Advanced data analysis in population genetics Likelihood-based demographic inference using the coalescent

Tópicos Depto. Ciencias Biológicas, UniAndes Profesor Andrew J. Crawford Semestre II

Coalescent Theory. Magnus Nordborg. Department of Genetics, Lund University. March 24, 2000

A Likelihood Method to Estimate/Detect Gene Flow and A Distance Method to. Estimate Species Trees in the Presence of Gene Flow.

Part I. Concepts and Methods in Bacterial Population Genetics COPYRIGHTED MATERIAL

Coalescent genealogy samplers: windows into population history

Approximating the coalescent with recombination

can mathematicians find the woods?

Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling

Bottlenecks reduce genetic variation Genetic Drift

GENEALOGICAL TREES, COALESCENT THEORY AND THE ANALYSIS OF GENETIC POLYMORPHISMS

On the nonidentifiability of migration time estimates in isolation with migration models

Evaluating the performance of likelihood methods for. detecting population structure and migration

The Coalescent. Chapter Population Genetic Models

The Coalescent Model. Florian Weber

Simulated gene genealogy of a sample of size 50 from a population of constant size. The History of Population Size from Whole Genomes.

THE estimation of population genetics parameters such as

Kenneth Nordtvedt. Many genetic genealogists eventually employ a time-tomost-recent-common-ancestor

Coalescence time distributions for hypothesis testing -Kapil Rajaraman 498BIN, HW# 2

Where do evolutionary trees comes from?

6.047/6.878 Lecture 21: Phylogenomics II

Kinship and Population Subdivision

Growing the Family Tree: The Power of DNA in Reconstructing Family Relationships

arxiv: v1 [q-bio.pe] 4 Mar 2013

DISCUSSION: RECENT COMMON ANCESTORS OF ALL PRESENT-DAY INDIVIDUALS

POPULATION GENETICS: WRIGHT FISHER MODEL AND COALESCENT PROCESS. Hailong Cui and Wangshu Zhang. Superviser: Prof. Quentin Berger

MODERN population genetics is data driven and

Frequent Inconsistency of Parsimony Under a Simple Model of Cladogenesis

Inference of Population Structure using Dense Haplotype Data

SINGLE nucleotide polymorphisms (SNPs) are single cases the SNPs have originally been identified by sequencing.

The Structure of Genealogies and the Distribution of Fixed Differences Between DNA Sequence Samples From Natural Populations

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Feb 3 & 5):

Pedigree Reconstruction using Identity by Descent

Chapter 4 Neutral Mutations and Genetic Polymorphisms

Inference of population structure using dense haplotype data Daniel John Lawson 1, Garrett Hellenthal 2, Simon Myers,3 and Daniel Falush,4,

ESTIMATION OF THE NUMBER OF INDIVIDUALS FOUNDING COLONIZED POPULATIONS

PHYLOGEOGRAPHIC BREAKS WITHOUT GEOGRAPHIC BARRIERS TO GENE FLOW

Report on the VAN_TUYL Surname Project Y-STR Results 3/11/2013 Rory Van Tuyl

DNA Basics, Y DNA Marker Tables, Ancestral Trees and Mutation Graphs: Definitions, Concepts, Understanding

[CLIENT] SmithDNA1701 DE January 2017

Research Article The Ancestry of Genetic Segments

Evolutionary trees and population genetics: a family reunion

Genetic Diversity and the Structure of Genealogies in Rapidly Adapting Populations

Gene Genealogy in Three Related Populations: Consistency Probability Between Gene and Population Trees

Recap: Properties of Trees. Rooting an unrooted tree. Questions trees can address: Data for phylogeny reconstruction. Rooted vs unrooted trees:

Warning: software often displays unrooted trees like this:

Coalescent Theory for a Partially Selfing Population

Inbreeding and self-fertilization

The Two Phases of the Coalescent and Fixation Processes

Lecture 1: Introduction to pedigree analysis

Inbreeding and self-fertilization

Exercise 4 Exploring Population Change without Selection

AFDAA 2012 WINTER MEETING Population Statistics Refresher Course - Lecture 3: Statistics of Kinship Analysis

Supporting Online Material for

Recent Trends in Population Genetics: More Data! More Math! Simple Models?

The African Origin Hypothesis What do the data tell us?

Populations. Arindam RoyChoudhury. Department of Biostatistics, Columbia University, New York NY 10032, U.S.A.,

University of Washington, TOPMed DCC July 2018

Mitochondrial Eve and Y-chromosome Adam: Who do your genes come from?

DNA: Statistical Guidelines

Kinship/relatedness. David Balding Professor of Statistical Genetics University of Melbourne, and University College London.

Chapter 2: Genes in Pedigrees

Meek DNA Project Group B Ancestral Signature

Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations

Epidemiology. David A. Rasmussen. Department of Biology Duke University. Date: Approved: Katia Koelle, Supervisor. William Morris.

Autosomal-DNA. How does the nature of Jewish genealogy make autosomal DNA research more challenging?

Estimating Effective Population Size and Mutation Rate From Sequence Data Using Metropolis-Hastings Sampling

Contributed by "Kathy Hallett"

BETTER TOGETHER: MAKING YOUR CASE WITH DOCUMENTS AND DNA BCG-sponsored Webinar ( Patricia Lee Hobbs, CG

A group-theoretic approach to inversion distance

Gene coancestry in pedigrees and populations

Transcription:

Population Structure and Genealogies One of the key properties of Kingman s coalescent is that each pair of lineages is equally likely to coalesce whenever a coalescent event occurs. This condition is violated in structured populations, since lineages will on average be more closely related to other lineages belonging to the same subpopulation. Consequently, Kingman s coalescent does not correctly describe the distribution of the genealogy of a random sample of chromosomes from a structured population. Instead, we must work with a generalization called the structured coalescent. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 1 / 18

Example: Two Demes Suppose that a population of diploid organisms is subdivided into two demes, where: The population sizes are N 1 and N 2, respectively. Generations are non-overlapping. Reproduction in each deme follows the Wright-Fisher model, modified to allow for migration. Specifically, a chromosome in deme i will either be descended from a randomly sampled chromosome from the same deme with probability m ii or from a randomly sampled chromosomes from the other deme with probability m ji. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 2 / 18

Backward vs. Forward Migration Rates The quantities m ii and m ij introduced on the previous slide a called backward migration rates since they describe the rate at which lineages in deme i either remain in deme i or move to deme j when we are looking backwards in time. In contrast, in ecology, we usually think in terms of forward migration rates, which describe the rate at which individuals born in one location migrate to a new location. Under our modified Wright-Fisher model, these are related as follows. If q ij is the probability that an individual born in deme i migrates to deme j, then m ij = N j N i q ji. Caveat: Most coalescent-based analyses of migration report backward migration rates. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 3 / 18

Scaling Assumptions It is customary to make the following assumptions. 1 We assume that the population sizes N 1 and N 2 are large enough (say > 100) that multiple mergers and other complex coalescent events can be neglected (i.e., the genealogy is almost certain to be a binary tree). 2 We also assume that the backwards migration rates are of order (N 1 + N 2) 1. This ensures that it is very unlikely that two or more events will happen in the same generation. Remark: If the second condition is not satisfied, then migration events will happen much more rapidly than coalescent events so that the population is effectively panmictic on the coalescent time scale. In this case, Kingman s coalescent applies with a suitably defined N e. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 4 / 18

Suppose that we have sampled n 1 chromosomes from deme 1 and n 2 chromosomes from deme 2. Looking backwards in time, the genealogy of this sample can be influenced by the following events: At rate ( n 1 2 ) 1 2N 1, a randomly chosen pair of lineages in deme 1 coalesces, reducing n 1 to n 1 1. At rate ( n 2 2 ) 1 2N 2, a randomly chosen pair of lineages in deme 2 coalesces, reducing n 2 to n 2 1. At rate n 1m 12, a randomly chosen lineage moves from deme 1 to deme 2, reducing n 1 to n 1 1 and increasing n 2 to n 2 + 1. At rate n 2m 21, a randomly chosen lineage moves from deme 2 to deme 1, reducing n 2 to n 2 1 and increasing n 1 to n 1 + 1. This process continues until there is only one lineage remaining, at which point the entire genealogy of the sample is determined, including information about the locations of ancestral lineages. This is known as the structured coalescent. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 5 / 18

Structured Coalescents We can also define structured coalescents when there are more than two demes. Suppose that there are D demes of sizes N 1,, N D and that the backward migration rate from deme i to deme j i is m ij. Given a random sample of n 1 chromosomes from deme 1, n 2 chromosomes from deme 2, etc., the following events can occur: Coalescent events: At rate ( ) n i 1 2 2N i, a randomly chosen pair of lineages in deme i coalesces, reducing n i to n i 1. Migration events: At rate n i m ij, a randomly chosen lineage moves from deme i to deme j, reducing n i to n i 1 and increasing n j to n j + 1. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 6 / 18

Pairwise Coalescent Times Suppose that the population contains D demes, each containing N diploid individuals, and that the migration rate between any pair of demes is m. This is Wright s finite island model. Let T w and T b be the pairwise coalescent times (in generations) when two chromosomes are either sampled from the same deme or from two different demes. Then E[T w ] = 2ND E[T b ] = 2ND + (D 1) 2m Notice that E[T w ] does not depend on m, but does depend on D, i.e., even if we sample chromosomes from the same deme, their expected coalescent time is affected by population structure. The pairwise coalescent time of two chromosomes sampled from different demes does depend on m and is greater than that of two chromosomes sampled from the same deme. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 7 / 18

Inference using the Structured Coalescent Several software packages are available that use the structured coalescent to analyze sequence data sampled from subdivided populations. MIGRATE is a program by Peter Beerli (Beerli 2009) that can be used to estimate backward migration rates and effective population sizes from sequence data sampled from a subdivided population. It is based on the modified Wright-Fisher model described above. It can carry out both Bayesian and maximum likelihood inference (one at a time) using Monte Carlo algorithms. Substitution models: infinite alleles model; F84 model for DNA sequences; infinite sites model for SNP data; SMM model for microsatellites Inference can be done using multiple unlinked loci, but each locus is assumed to be non-recombining. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 8 / 18

Bayesian Phylogeography using BEAST BEAST is able to analyze data from subdivided populations using an approximation to the structured coalescent known as the discrete phylogeographical model (Lemey et al. 2009). In this approach, location is treated as a neutrally evolving character that takes values in a discrete set (the set of locations). Pairs of lineages having the same location can coalesce and individual lineages can move between locations. A variable selection procedure (stochastic search variable selection; SSVS) is used to sample sparse migration rate matrices, i.e., this method tends to minimize the number of migration rates that are positive. It is also possible to run analyses that use a GLM to identify relationships between the migration rates and other variables of interest such as temperature, transport networks, etc. (Lemey et al. 2014). Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 9 / 18

Regional migration rates of P. vivax inferred from mtdna BEAST estimates both the inclusion probability and the magnitude of each rate. The inclusion probability is the posterior probability that the rate is non-zero. Inclusion probabilities can be converted to Bayes factors; customarily, a rate is reported only if Bf > 3. The magnitude is expressed as migrations per lineage per unit time. Taylor et al. (2013) Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 10 / 18

Inference of Ancestral Locations in BEAST: Rabies in African Wild Dogs Lemey et al. (2009) Bayesian Phylogeography Finds its Roots. PLoS Computational Biology 5(9): e1000520. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 11 / 18

Predictors of global H3N2 diffusion Lemey et al. (2014) Unifying Viral Genetics and Human Transportation Data to Predict the Global Transmission Dynamics of Human Influenza H3N2. PLoS Pathogens 10(2): e1003932. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 12 / 18

Isolation-with-Migration Models The structured coalescent can also be extended to models that allow the number of populations and their migration rates to change over time. These are known as isolation-with-migration models and are useful for analyzing biogeographical processes and incipient speciation. IM is a program by Hey & Nielsen (2007) that can be used to do inference for IM models using DNA data. Hey (2010) extended this to allow for arbitrary numbers of populations and phylogenies (IMa2). Inference is carried out within a Bayesian framework using MCMC. Source: Hey (2010) Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 13 / 18

Assumptions The history of the sampled populations can be represented by a rooted bifurcating tree. No gene flow has occurred between unsampled populations and those contained within the tree. Each subpopulation is constant in size and follows the Wright-Fisher model. Pairwise migration rates are constant, but not necessarily symmetrical. No bottlenecks occurred when ancestral populations split. The individual loci are unlinked and no recombination occurs within loci. Each locus evolves neutrally: no selection at the locus or at other linked loci. Substitution models: HKY and infinite sites for sequence data; SMM for microsatellite data. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 14 / 18

IMa2 Input and Estimated Parameters Suppose that we have sequence data sampled from a population subdivided into k subpopulations. To analyze this using IMa2, we need a bifurcating tree which describes a history of population splits terminating in a single ancestral population. The following unknown parameters will be estimated by the program. k population mutation rates θ 1,, θ k for the extant populations and k 1 population mutation rates θ k+1,, θ 2k 1 for the ancestral populations. These are all scaled by the average mutation rate µ. 2(k 1) 2 backward migration rates m ij = M ij /µ between each pair of coexisting populations. k 1 population split times. Caveat: Even for modest values of k, there are many parameters that need to be estimated (e.g., 45 when k = 5) which is only possible with substantial data and computer time. Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 15 / 18

Fig. 5: Sample Size - loci vs. individuals Fig. 7: Samples with low diversity Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 16 / 18

Fig. 9: Effect of the prior on the posterior distribution of migration rates Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 17 / 18

Jay Taylor (ASU) Structured Coalescents and Applications 16 Feb 2017 18 / 18