LASER server: ancestry tracing with genotypes or sequence reads

Similar documents
White Paper Global Similarity s Genetic Similarity Map

Nature Genetics: doi: /ng Supplementary Figure 1. Quality control of FALS discovery cohort.

Detecting Heterogeneity in Population Structure Across the Genome in Admixed Populations

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity, including those reporting another ethnicity.

University of Washington, TOPMed DCC July 2018

ville, VA Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX

Comparative method, coalescents, and the future

Inference of population structure using dense haplotype data Daniel John Lawson 1, Garrett Hellenthal 2, Simon Myers,3 and Daniel Falush,4,

FASTA - Pearson and Lipman (88)

Inference of Population Structure using Dense Haplotype Data

Comparative method, coalescents, and the future. Correlation of states in a discrete-state model

Package EILA. February 19, Index 6. The CEU-CHD-YRI admixed simulation data

DNA: UNLOCKING THE CODE

Diet Networks: Thin Parameters for Fat Genomics

Section 6.4. Sampling Distributions and Estimators

Image analysis. CS/CME/BioE/Biophys/BMI 279 Oct. 31 and Nov. 2, 2017 Ron Dror

Classification of Road Images for Lane Detection

MA 180/418 Midterm Test 1, Version B Fall 2011

Gene coancestry in pedigrees and populations

(Notice that the mean doesn t have to be a whole number and isn t normally part of the original set of data.)

Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms

Illumina GenomeStudio Analysis

Simulated Statistics for the Proposed By-Division Design In the Consumer Price Index October 2014

Supplementary Note: Analysis of Latino populations from GALA and MEC reveals genomic loci with biased local ancestry estimation

Factors affecting phasing quality in a commercial layer population

Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory

Multiresolution Analysis of Connectivity

Symmetric (Mean and Standard Deviation)

The Bead. beadarray: : An R Package for Illumina BeadArrays. Bead Preparation and Array Production. Beads in Wells. Mark Dunning -

Identification of the Hypothesized African Ancestry of the Wife of Pvt. Henry Windecker Using Genomic Testing of the Autosomes.

Methods of Parentage Analysis in Natural Populations

DNA Testing. February 16, 2018

American Community Survey 5-Year Estimates

American Community Survey 5-Year Estimates

SNP variant discovery in pedigrees using Bayesian networks. Amit R. Indap

Population Genetics. Joe Felsenstein. GENOME 453, Autumn Population Genetics p.1/70

Kinship/relatedness. David Balding Professor of Statistical Genetics University of Melbourne, and University College London.

Pedigree Reconstruction using Identity by Descent

BIOL Evolution. Lecture 8

Drum Transcription Based on Independent Subspace Analysis

Two-point linkage analysis using the LINKAGE/FASTLINK programs

Big Y-700 White Paper

Kenneth Nordtvedt. Many genetic genealogists eventually employ a time-tomost-recent-common-ancestor

Inbreeding and self-fertilization

Class-count Reduction Techniques for Content Adaptive Filtering

M 3 : Manipulatives, Modeling, and Mayhem - Session I Activity #1

Privacy preserving data mining multiplicative perturbation techniques

Chapter 11. Sampling Distributions. BPS - 5th Ed. Chapter 11 1

The History of African Gene Flow into Southern Europeans, Levantines, and Jews

SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES American Community Survey 5-Year Estimates

Tiny ImageNet Challenge Investigating the Scaling of Inception Layers for Reduced Scale Classification Problems

Paper ST03. Variance Estimates for Census 2000 Using SAS/IML Software Peter P. Davis, U.S. Census Bureau, Washington, DC 1

TRACK 1: BEGINNING DNA RESEARCH presented by Andy Hochreiter

Exam Time. Final Exam Review. TR class Monday December 9 12:30 2:30. These review slides and earlier ones found linked to on BlackBoard

Using Autosomal DNA for Genealogy Debbie Parker Wayne, CG, CGL SM

Abstract and Kinetic Tile Assembly Model

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

GEDmatch Home Page The upper left corner of your home page has Information about you and links to lots of helpful information. Check them out!

Population Structure. Population Structure

DNA: Statistical Guidelines

Mapping small-effect and linked quantitative trait loci for complex traits in. backcross or DH populations via a multi-locus GWAS methodology

Recommender Systems TIETS43 Collaborative Filtering

On the GNSS integer ambiguity success rate

Automobile Independent Fault Detection based on Acoustic Emission Using FFT

Travel Photo Album Summarization based on Aesthetic quality, Interestingness, and Memorableness

Population Genetics. Joe Felsenstein. GENOME 453, Autumn Population Genetics p.1/74

CONGEN. Inbreeding vocabulary

From: Prof. Carlos D. Bustamante, Ph.D. Date: October 10, 2018

Displaying Distributions with Graphs

Augment the Spatial Resolution of Multispectral Image Using PCA Fusion Method and Classified It s Region Using Different Techniques.

Spatially Varying Color Correction Matrices for Reduced Noise

Image analysis. CS/CME/BioE/Biophys/BMI 279 Oct. 31 and Nov. 2, 2017 Ron Dror

Implementing single step GBLUP in pigs

Math Exam 2 Review. NOTE: For reviews of the other sections on Exam 2, refer to the first page of WIR #4 and #5.

Math Exam 2 Review. NOTE: For reviews of the other sections on Exam 2, refer to the first page of WIR #4 and #5.

TDT vignette Use of snpstats in family based studies

DNA Basics. OLLI: Genealogy 101 October 1, ~ Monique E. Rivera ~

Developing Conclusions About Different Modes of Inheritance

Supplementary Information

RADIO SYSTEMS ETIN15. Channel Coding. Ove Edfors, Department of Electrical and Information Technology

Distinguishing Mislabeled Data from Correctly Labeled Data in Classifier Design

Exercise 4 Exploring Population Change without Selection

Image analysis. CS/CME/BIOPHYS/BMI 279 Fall 2015 Ron Dror

How can it be right when it feels so wrong? Outliers, diagnostics, non-constant variance

The techniques with ERDAS IMAGINE include:

SUPPLEMENTARY INFORMATION

Genome-Wide Association Exercise - Data Quality Control

Vesselin K. Vassilev South Bank University London Dominic Job Napier University Edinburgh Julian F. Miller The University of Birmingham Birmingham

Walter Steets Houston Genealogical Forum DNA Interest Group January 6, 2018

What to Expect When You re Clustering

Laser Printer Source Forensics for Arbitrary Chinese Characters

Removal of ocular artifacts from EEG signals using adaptive threshold PCA and Wavelet transforms

COordinated relationship exploration is an important task in

Instruction Manual. Mark Deimund, Zuyi (Jacky) Huang, Juergen Hahn

GE 113 REMOTE SENSING

Introduction. Mathematical Background Preparation using ENVI.

Genomic insights into the population structure and history of the Irish Travellers.

Genealogical and Genetic Evidence Relating to the Native American Ancestry of: Margaret Ann (Hensiek) Faux

Objective: Why? 4/6/2014. Outlines:

Lane Detection in Automotive

Transcription:

LASER server: ancestry tracing with genotypes or sequence reads The LASER method Supplementary Data For each ancestry reference panel of N individuals, LASER applies principal components analysis (PCA) on autosomal SNPs to construct a K-dimensional reference ancestry space. This K-dimensional space defines a common ancestry coordinate system for samples from different studies. Users can specify the value of K based on the reference panel and their research objectives. We typically choose K such that major ethnic groups or populations of interest are well separated. LASER allows genotypes or sequence reads for the study samples, and projects them into the reference ancestry space one by one. To assign coordinates to a single genotyped individual, LASER uses SNPs shared between this individual and the N reference panel members to perform a PCA of the N+1 individuals and obtains the top K K PCs. In general, larger values of K lead to more accurate ancestry estimates because information from higher order PCs is used. However, when K is too large (close to N), LASER may suffer from overfitting, leading to poor estimation accuracy. For example, when using the HGDP reference panel, we set the default values as K=4 because major continental groups are well separated in the top 4 PCs. Users can set K>4 for the HGDP reference panel if they are interested in intra-continental population structure such as separating different European populations. Alternatively, we recommend using a continental reference panel, such as the POPRES dataset for Europe, for easier interpretation of the results. We set K =20 because we have found this provides good results similar to K >20 and avoids the risk of overfitting (see simulation results in Wang et al. 2015). LASER then performs a projection Procrustes analysis (Gower and Dijksterhuis 2004) to find a set of transformations (projection, translation, rotation, reflection, and scaling) that project the N reference individuals from the K - dimensional space to a K-dimensional space. The transformations maximize the Procrustes similarity between the projected coordinates and the pre-defined ancestry coordinates for reference samples. Finally, LASER uses these transformations to place each study individual into the K-dimensional reference ancestry space. The accuracy of the placement is partly reflected by the Procrustes similarity, which we denote as the individual-specific Procrustes score t. When analyzing a sequenced individual, LASER simulates read counts for each reference individual conditional on its observed genotypes. The simulated data matches the sequencing depth and estimated per base error rate of the individual being placed (Wang et al. 2015). The simulated read data for reference individuals and observed read data of the study individual are then combined to obtain the top K PCs of the N+1 individuals. Using these PCs, the analysis proceeds as with genotype data. As long as the same reference panel is used, LASER maps all study individuals to the same K- dimensional ancestry space, regardless of differences in the available data types and variant sets.

Evaluating appropriateness of an ancestry reference panel When an individual s ancestry is not represented in the reference panel, LASER might cluster the individual with reference populations of distant genetic background, yielding misleading results (Wang et al. 2015). To illustrate this point, we randomly selected 1000 individuals from the POPRES dataset as a European reference panel and use our LASER method to place the remaining 385 POPRES individuals (based on 306,469 genotyped SNPs) and all the HGDP individuals (based on 79,583 overlapping SNPs) on the European map (K =20, K=2). Results are shown in Figure S1A-C. Both HGDP Europeans and POPRES test individuals were clustered with their geographic neighboring populations on the POPRES reference map, however, the placement of HGDP non-europeans was misleading (Figure S1B). For example, HGDP individuals from Oceania were clustered with POPRES Italians, and HGDP East Asians overlapped with Southeastern Europeans in the POPRES reference panel. In this section, we propose a new statistic Z to capture such artifacts caused by using an inappropriate reference panel that doesn t represent ancestry background of the study individual. Recall that LASER analyzes each study individual independently together with a set of N reference individuals using PCA followed by projection Procrustes analysis. The PCA is performed by eigen value decomposition on a (N+1) (N+1) genetic relationship matrix M, where each of the diagonal elements is the variance of the normalized genotypic values (or the normalized reference allele read counts for analyzing sequence reads) of an individual, sum across all loci. Details of the calculation of M can be found in our previous papers (Wang et al. 2014, 2015). We denote the last diagonal element of M as m #, which is the variance for the study individual, and the first N diagonal elements as m $ (i = 1,2,, N), which are the variance for the N reference individuals. If the ancestry of a study individual is represented in the reference panel, m # should have similar values to its neighboring reference individuals. We therefore propose the following approach to calculate a statistic indicating if the ancestry reference panel is appropriate for a study individual. 1. Identify k nearest reference individuals of a study individual based on Euclidean distances in the reference ancestry space. We set k=10 as the default value. 2. Calculate the mean and standard deviation of m $ for the k nearest neighbors (i.e., i {indices of k nearest neighbors}), denoted as μ @AA and σ @AA, respectively. 3. Calculate Z score as Z = D EFG HII J HII. If the study individual has similar ancestry background as his k nearest neighbors, we will expect Z score to be close to 0. We evaluated the proposed Z score in our previous illustrative experiment. As shown in Figure S1D, majority of the POPRES test individuals and HGDP Europeans have Z<4. In contrast, HGDP individuals from East Asia, Oceania, America, and Africa all have Z>11, suggesting the POPRES reference panel is inappropriate for these samples. HGDP individuals from Middle East and Central South Asia have mean Z scores of 9.5 and 7.9 respectively, reflecting their close genetic relationship to Europeans compared to other non- European populations. The mean and standard deviation of the Z scores for different regions are summarized in Table S1. Overall, our proposed Z score serves as a good measurement to reflect how well a study individual s ancestry is reflected in the ancestry reference panel. We recommend users to be cautious in interpreting LASER results when Z score is greater than 4 or appears to be an outlier among all study samples.

References Gower, J.C. and Dijksterhuis, G.B. (2004) Procrustes Problems. Oxford University Press, Oxford, New York. Wang, C. et al. (2014) Ancestry estimation and control for population stratification for sequencebased association studies. Nat Genet 46: 409-415. Wang, C. et al. (2015) Improved ancestry estimation for both genotyping and sequencing data using projection Procrustes analysis and genotype imputation. Am J Hum Genet, 96: 926-937. Supplementary Tables Table S1. Summary of Z scores for estimating ancestry of individuals from different geographic regions in the POPRES European reference ancestry space. Region/ dataset Number of individuals Z score mean (±sd) POPRES 385 2.1 (±1.2) Europe 156 2.7 (±1.7) Middle East 160 9.5 (±3.5) C/S Asia 200 7.9 (±5.4) East Asia 229 31.2 (±11.1) Oceania 28 37.6 (±8.8) America 63 41.5 (±5.5) Africa 102 57.1 (±10.8)

Table S2. Computational time required to complete ancestry inference analysis with the LASER server (excluding download and format check). Dataset Input data type Ancestry reference panel No. of overlapping SNPs Computational time per individual T2D-GENES/GoT2D 80X WES Genotypes HGDP 12,719 8 seconds GoT2D 5X WGS Genotypes POPRES 294,217 37 seconds GoT2D 80X WES Sequence reads Imputed POPRES 4,212,452 543 seconds

Supplementary Figures Figure S1. Ancestry estimation of HGDP individuals and a test set of 365 POPRES individuals using a European reference panel of 1,000 POPRES individuals. In panels A-C, colored points represent study individuals and grey points represent reference individuals. (A) Placement of HGDP Europeans. (B) Placement of HGDP non-europeans. (C) Placement of POPRES test individuals. (D) Violin plot of Z scores for individuals from different regions. The red embedded box includes a zoom-in visualization of the Z scores for POPRES and HGDP Europeans.

Figure S2. Comparison of standard PCA against LASER using whole genome sequence data for 2,335 Europeans (1,336 Finnish, 471 British, 341 Swedish, 187 German) from the GoT2D study. (A) Standard PCA: top two PCs were dominated by Finish population that had largest sample size. (B) LASER analysis using POPRES reference panel (reference individuals not shown). The Procrustes similarity t 0 score between PCA and LASER results was 0.52. 200 0 200 400 400 200 0 200 PC2 PC1 A Finnish British Swedish German PC2 PC1 40 0 40 80 40 0 40 B Finnish British Swedish German