Introducton to Coalescent Models Bostatstcs 666 Lecture 4
Last Lecture Lnkage Equlbrum Expected state for dstant markers Lnkage Dsequlbrum Assocaton between neghborng alleles Expected to decrease wth dstance Measures of lnkage dsequlbrum D, D and ² or r 2
Prevously DNA sequence varaton Types of DNA varants Allele frequences Genotype frequences Hardy-Wenberg Equlbrum
Makng predctons What allele frequences do we expect? How much varaton n a gene? How are neghborng varants related?
Smple Approach: Smulaton. N startng sequences 2. Sample N offsprng sequences Apply mutatons accordng to µ 3. Increment tme 4. If enough tme has passed Generate fnal sample Stop. 5. Otherwse, return to step.
Smulatng a Populaton Sequences Tme
Today Introduce coalescent approach Framework for studyng genetc varaton Provdes ntuton on patterns of varaton Provdes analytcal solutons
Am Gene genealoges: Descrptons of relatedness between sequences Analogous to phylogenetc trees for speces The shape of the genealogy depends on populaton hstory, selecton, etc. Together wth mutaton rate, genealogy predcts DNA varaton
Genealogy Hstory of a partcular set of sequences Descrbes ther relatedness Specfes dvergence tmes Includes only a subset of the populaton Most Recent Common Ancestor (MRCA)
Coalescent approach Generate genealogy for a sample of sequences. Introduces computatonal and analytcal convenence. Instead of proceedng forward through tme, go backwards!
Hstory of the Populaton
Genealogy of Fnal Populaton
Levels of Complexty Hstory of the populaton Includes sequences that are extnct Hstory of all modern sequences Includes sequences that we haven t sampled Hstory of a subset of modern sequences Mnmalst approach!
Parameters we wll focus on Mutaton rate (µ) Populaton Sze Haplod populaton (N chromosomes) Dplod populaton (2N chromosomes) Tme (t) Sample sze (n) Recombnaton rate (r)
Other Parameters Selecton For gene of nterest For neghborng gene Demographc parameters Mgraton Populaton Structure Populaton Growth
Mutaton Model The mutaton process s complex Rate depends on surroundng sequence Reverse mutatons are possble Two smple models are popular Infnte alleles Every mutaton generates a dfferent allele Infnte stes Every mutaton occurs at a dfferent ste
Mutaton Model Focus on nfnte stes model Mutaton rate n genomc DNA s ~0-8 / bp Recurrent mutatons should be very rare Scaled mutaton rate parameter, e.g.: 000 bp sequence 0-8 mutatons per base par per generaton µ 0-5 per sequence per generaton
Neutral Varants Varants that have do not affect ftness Accumulate nexorably through tme Lost through genetc drft Do not affect genealogy
Example: Modelng Accumulaton of Mutatons Populaton of dentcal sequences Sample one descendant after t generatons How many mutatons have accumulated? Hnt: depends on mutaton rate µ and tme t Tougher questons How many mutatons have been fxed? How much varaton n the total populaton?
So far Dvergence of a sngle sequence Accumulaton of mutatons Depends on tme t Depends on mutaton rate µ Does not depend on populaton sze N Does not depend on populaton growth Next: A par of sequences!
A tougher example Sample of two sequences 00 bp each How many dfferences are expected? Populaton of sze, N 000 Mutaton rate µ 0-8 / bp / generaton µ 0-6 / 00 bp / generaton
Genealogy of two sequences MRCA Tme T(2) Sequence Sequence 2 Mutatons between MRCA and Sequence?
Genealogy of two sequences MRCA Tme T(2) Sequence Sequence 2 Total mutatons n genealogy?
Number of mutatons S Dstrbuted as Posson, condtonal on total tree length E(S) µe(t tot ) Var(S) E[Var(S T)] + Var[E(S T)] µe(t tot ) + µ²var(t tot ) T tot s the total length of all branches
Estmatng T(2) Probablty that two sequences have dstnct ancestors n prevous generaton N P( 2) N N Probablty of dstnct ancestors for t generatons s P(2) t
Probablty of MRCA at tme t+ P(2) t ( P(2)) N N N t N N t N e t N
For n > 2 Coalescence when two sequences have common ancestor For smplcty, consder the possblty of multple smultaneous coalescent events to be neglgble Requrements for no coalescence: Pck one ancestor for sequence Pck dstnct ancestor for sequence 2 Pck yet another ancestor for sequence 3
Estmatng P(n) Probablty that n sequences have n dstnct ancestors n prevous generaton P( n) n N N n 2 N Assume: N s large n s small Terms of order N -2 can be gnored
Probablty of Coalescence at Tme t+ t N n t t e N n N n N n n P n P 2 2 2 2 )) ( ( ) (
Tme to next coalescent event Use an exponental dstrbuton to approxmate tme to next coalescent event Decay Rate Mean λ λ n 2 N N n 2
T(j) For convenence, measure tme to next coalescent event n unts: N generatons for haplods 2N generatons for dplods E( T j ) / j 2 How would you calculate tme to MRCA of n sequences?
Total Tme n Tree Sum of all the branch lengths Total evolutonary tme avalable e.g. for mutatons to occur 2 2 2 2 2 ) ( 2 ) ( ) ( n n n n tot T T E
T MRCA vs. T TOT T MRCA T TOT.0.2.4.6.8 2.0 Relatve Sum of Branch Lengths 2 4 6 8 0 0 20 40 60 80 00 0 20 40 60 80 00 Number of Sequences Number of Sequences Relatve Tme to MRCA
Number of Segregatng Stes Commonly named S Total number of mutatons n genealogy Assumng no recurrent mutaton A functon of the total length of the genealogy T tot
Expected number of mutatons Factor N for haplods, 2N for dplods Populaton genetcsts defne θ4nµ (for dplods) For gene mappng, θ s usually recombnaton rate Populaton genetcsts, use r for recombnaton rates ( ) 2 / / 4 ) ( 2 ) ( n n n N T E N S E θ µ µ
Expected number of mutatons Factor N for haplods, 2N for dplods Populaton genetcsts defne θ4nµ (for dplods) For gene mappers, θ s usually the recombnaton rate Populaton genetcsts, use r for recombnaton rates ( ) 2 / / 4 ) ( 2 ) ( n n n N T E N S E θ µ µ
E(S) as a functon of n Expected Number of Segregatng Stes 0 2 4 6 8 0 2 4 Parameters N 0,000 ndvduals µ 0-4 θ 4 2 3 4 5 6 7 8 9 0 2 4 6 8 20 Sample Sze
More about S Very large varance Var( S) θ n / + θ 2 n / 2 Most of the varance contrbuted by early coalescent events (.e. wth small n)
Var(S) as a functon of n 2 3 4 5 6 7 8 9 0 2 4 6 8 20 Sample Sze Parameters N 0,000 ndvduals µ 0-4 θ 4 Varance n Number of Segregatng Stes 0 0 20 30 40 50 60 70
Inferences about θ Could be estmated from S Dvde by expected length of genealogy ˆ θ n S / Could then be used to: Estmate N, f mutaton rate µ s known Estmate µ, f populaton sze N s known
^ Var(θ) as a functon of N Varance n Estmate of Theta 0.0 0.2 0.4 0.6 0.8.0.2 Parameters N 0,000 ndvduals µ 0-4 θ 4 2 5 8 4 7 20 23 26 29 32 35 38 4 44 47 50 Sample Sze
Alternatve Estmator for θ Count parwse dfferences between sequences Compute average number of dfferences ~ θ n 2 n n S j j +
Today Probablty of coalescence events Length of genealogy and ts branches Expected number of mutatons Smple estmates of θ
Recommended Readng Rchard R. Hudson (990) Gene genealoges and the coalescent process Oxford Surveys n Evolutonary Bology, Vol. 7. D. Futuyma and J. Antonovcs (Eds). Oxford Unversty Press, New York.