Genome-Wide Association Exercise - Data Quality Control

Similar documents
fbat August 21, 2010 Basic data quality checks for markers

JAMP: Joint Genetic Association of Multiple Phenotypes

Nature Genetics: doi: /ng Supplementary Figure 1. Quality control of FALS discovery cohort.

ville, VA Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX

Lecture 1: Introduction to pedigree analysis

Illumina GenomeStudio Analysis

Two-point linkage analysis using the LINKAGE/FASTLINK programs

Linkage Analysis in Merlin. Meike Bartels Kate Morley Danielle Posthuma

Population Structure. Population Structure

TDT vignette Use of snpstats in family based studies

Detecting Heterogeneity in Population Structure Across the Genome in Admixed Populations

Gene coancestry in pedigrees and populations

Bottlenecks reduce genetic variation Genetic Drift

Developing Conclusions About Different Modes of Inheritance

BIOL 502 Population Genetics Spring 2017

Factors affecting phasing quality in a commercial layer population

Identification of the Hypothesized African Ancestry of the Wife of Pvt. Henry Windecker Using Genomic Testing of the Autosomes.

Pedigrees How do scientists trace hereditary diseases through a family history?

Spring 2013 Assignment Set #3 Pedigree Analysis. Set 3 Problems sorted by analytical and/or content type

ARTICLE PRIMUS: Rapid Reconstruction of Pedigrees from Genome-wide Estimates of Identity by Descent

Kinship/relatedness. David Balding Professor of Statistical Genetics University of Melbourne, and University College London.

Lecture 6: Inbreeding. September 10, 2012

Genetic Research in Utah

Decrease of Heterozygosity Under Inbreeding

Inbreeding and self-fertilization

Inbreeding and self-fertilization

Methods of Parentage Analysis in Natural Populations

Using Pedigrees to interpret Mode of Inheritance

Investigations from last time. Inbreeding and neutral evolution Genes, alleles and heterozygosity

A hidden Markov model to estimate inbreeding from whole genome sequence data

The Pedigree. NOTE: there are no definite conclusions that can be made from a pedigree. However, there are more likely and less likely explanations

Puzzling Pedigrees. Essential Question: How can pedigrees be used to study the inheritance of human traits?

AFDAA 2012 WINTER MEETING Population Statistics Refresher Course - Lecture 3: Statistics of Kinship Analysis

Detection of Misspecified Relationships in Inbred and Outbred Pedigrees

DNA: Statistical Guidelines

Autosomal DNA. What is autosomal DNA? X-DNA

University of Washington, TOPMed DCC July 2018

Supplementary Note: Analysis of Latino populations from GALA and MEC reveals genomic loci with biased local ancestry estimation

Implementing single step GBLUP in pigs

CONGEN. Inbreeding vocabulary

This is a repository copy of Context-dependent associations between heterozygosity and immune variation in a wild carnivore.

NON-RANDOM MATING AND INBREEDING

Genetics. 7 th Grade Mrs. Boguslaw

ICMP DNA REPORTS GUIDE

Package garfield. March 8, 2019

Genetic Genealogy. Rules and Tools. Baltimore County Genealogical Society March 25, 2018 Andrew Hochreiter

Contributed by "Kathy Hallett"

Edinburgh Research Explorer

Populations. Arindam RoyChoudhury. Department of Biostatistics, Columbia University, New York NY 10032, U.S.A.,

Objective: Why? 4/6/2014. Outlines:

Population Genetics 3: Inbreeding

Eastern Regional High School. 1 2 Aa Aa Aa Aa

Assessment of alternative genotyping strategies to maximize imputation accuracy at minimal cost

[CLIENT] SmithDNA1701 DE January 2017

Human Pedigree Genetics Answer Key

How to Combine Records in (New) FamilySearch

Walter Steets Houston Genealogical Forum DNA Interest Group January 6, 2018

Using Autosomal DNA for Genealogy Debbie Parker Wayne, CG, CGL SM

Recent effective population size estimated from segments of identity by descent in the Lithuanian population

Chapter 2: Genes in Pedigrees

LASER server: ancestry tracing with genotypes or sequence reads

Exercise 8. Procedure. Observation

Legacy FamilySearch Overview

Pedigree Worksheet Name Period Date Interpreting a Human Pedigree Use the pedigree below to answer 1-5

Exercise 4 Exploring Population Change without Selection

Optimum contribution selection conserves genetic diversity better than random selection in small populations with overlapping generations

Pizza and Who do you think you are?

GEDmatch Home Page The upper left corner of your home page has Information about you and links to lots of helpful information. Check them out!

Growing the Family Tree: The Power of DNA in Reconstructing Family Relationships

DNA: UNLOCKING THE CODE

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity, including those reporting another ethnicity.

Steps involved in microarray analysis after the experiments

Genetic Analysis for Spring- and Fall- Run San Joaquin River Chinook Salmon for the San Joaquin River Restoration Program

Primer on Human Pedigree Analysis:

White Paper Global Similarity s Genetic Similarity Map

Pedigree Reconstruction using Identity by Descent

Package pedantics. R topics documented: April 18, Type Package

The History of African Gene Flow into Southern Europeans, Levantines, and Jews

DNA Opening Doors for Today s s Genealogist

Manual for Familias 3

Package sequoia. August 13, 2018

Inbreeding depression in corn. Inbreeding. Inbreeding depression in humans. Genotype frequencies without random mating. Example.

GenePix Application Note

Forensic use of the genomic relationship matrix to validate and discover livestock. pedigrees

Scott Wolfe Department of Horticulture and Crop Science The Ohio State University, OARDC Wooster, Ohio

EmbryoCellect. RHS Scanning and Analysis Instructions. for. Genepix Pro Software

Genomic insights into the population structure and history of the Irish Travellers.

Duplicate Checker User Guide for Parishes

Large scale kinship:familial Searching and DVI. Seoul, ISFG workshop

Kelmemi et al. BMC Medical Genetics (2015) 16:50 DOI /s

COMMUNITY UNIT SCHOOL DISTRICT 200 Science Curriculum Philosophy

SUPPLEMENTARY INFORMATION

Genomic Variation of Inbreeding and Ancestry in the Remaining Two Isle Royale Wolves

Genetics Practice Problems Pedigree Tables Answer Key

PopGen3: Inbreeding in a finite population

SNP variant discovery in pedigrees using Bayesian networks. Amit R. Indap

Need a little help with the lab?

Statistical methods in genetic relatedness and pedigree analysis

Click here to give us your feedback. New FamilySearch Reference Manual

1.4.1(Question should be rather: Another sibling of these two brothers) 25% % % (population risk of heterozygot*2/3*1/4)

Transcription:

Genome-Wide Association Exercise - Data Quality Control The Rockefeller University, New York, June 25, 2016 Copyright 2016 Merry-Lynn McDonald & Suzanne M. Leal Introduction In this exercise, you will learn how to perform data quality control by removing markers that fail quality control criteria and detecting samples that fail quality control criteria due to amount of missing genotype data. You will also examine your samples for individuals that are related to each other and/or are duplicate samples. Each sample will also be tested for excess homozygosity and heterozygosity of genotype data. Each SNP will be tested for deviations from Hardy-Weinberg Equilibrium. You will also examine QQ plots to see the effect of carrying-out analysis of association study data. These exercises will be carried out using PLINK and R. 1. Running the program: You can run the program from the dos prompt on your computer or on a LINUX machine or server (highly recommended for larger datasets). In order to get to the folder where the data we are going to analyze is type (please note the data files you will use may be in a different directory and you will be instructed where they are located) : cd plink/exercise/ 2. Load the data PLINK can take in your data in many different formats and when you have real data you should examine it and then look over the PLINK documentation (http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml) to determine which format is easiest on you to get the data into PLINK. Today s data is already formatted for you in the standard file format. You should have 2 files: a pedfile (GWAS.ped) and a map file (GWAS.map). Please spend some time examining these files and the documentation on the PLINK website before you begin. Navigate via the command prompt to the directory where your data is found. Then type plink in the command prompt and make note of the program output. Next type: plink --file GWAS Note, that PLINK outputs a file called plink.log that contains the output of what you see in your command window. 3. Clean the data Plink outputs information on the number of people and SNPs in your dataset. You now have information to fill in the Oval 1 in the flowchart below.

a. Sample Quality Initial Screen In the next step of the cleaning process, you will exclude samples that are missing more than 10% of their genotype calls. Such samples are likely to be low quality DNA samples with error-ridden genotype calls. plink -file GWAS -mind 0.10 -recode -out GWAS_clean_mind Look at the file GWAS_clean_mind.log to see how many samples are excluded based on this criteria and fill in Box 1. b. Minor allele frequency, SNP quality and DNA quality criteria: Type the following: plink -file GWAS_clean_mind -maf 0.05 -recode -out MAF_greater_5 plink -file GWAS_clean_mind -exclude MAF_greater_5.map -recode -out MAF_less_5 Above you are creating two versions of you dataset, one with minor allele frequencies (MAFs) greater than 5% and one with MAFs less than 5%. You will now clean out SNPs with MAF>5% that are missing in more than 5% of the samples and then clean out SNPs with MAF<5% that are missing in more than 1% of the samples. Type the following: plink --file MAF_greater_5 --geno 0.05 --recode --out MAF_greater_5_clean Fill in Box 2a. plink --file MAF_less_5 --geno 0.01 --recode --out MAF_less_5_clean Fill in Box 2b. Finally, in the last line of code you are recombining the cleaned SNPs. Fill Box 2b. plink --file MAF_greater_5_clean --merge MAF_less_5_clean.ped MAF_less_5_clean.map --recode --out GWAS_MAF_clean Next, we will be more stringent with the sample quality clean by removing samples missing more than 3% of their calls (mind 0.03). plink --file GWAS_MAF_clean --mind 0.03 --recode --out GWAS_clean2 You can now fill in Box 3. c. Check Sex Entering clinical variables into a database is a tedious process that can be error-prone. It is not always possible to double-check all the variables you have been given. However, you can use the information from the SNP genotypes to verify the sex of individuals in your study. This is done by

looking at the homozygosity (F) on the X chromosome in each individual. This number is expected to be less than 0.2 in females and greater than 0.8 in males. Question 1: Why do you expect the homozygosity rate to be higher on the X chromosome in males than females? Run the following command in plink: plink -file GWAS_clean2 -check-sex -out GWAS_sex_checking Use R to open the file GWAS_sex_checking.sexcheck and determine if there are any individuals who may have sex coded incorrectly. sexcheck = read.table("gwas_sex_checking.sexcheck", header=t) names(sexcheck) sex_problem = sexcheck[which(sexcheck$status=="problem"),] sex_problem From that file, NA20530 and NA20506 were coded as a female (sex code = 2) and from the genotypes appear to be males (sex code = 1). In addition, 3 individuals (NA20766, NA20771 and NA20757) do not have enough information for PLINK to decide if they are males or females so the program reports sex = 0 for the genotyped sex. Fill in the table below: Table 1: Sex check FID IID PEDSEX SNPSEX STATUS F NA20506 NA20506 NA20530 NA20530 NA20766 NA20766 NA20771 NA20771 NA20757 NA20757 There are two reasons for these kinds of discrepancies. The first is that the records are wrong. So you would ask your collaborators to double-check the sex in the clinical records. The second reason is that you may not have genotyped enough SNPs on the X chromosome to predict the sex. In this dataset, there are 194 SNPs on the X chromosome. Question 2: With a limited number of SNPs genotyped on the X chromosome, are you concerned PLINK is indicating that some females are males? What does this mean for the interpretation of your results?

d. Duplicates Your collaborator has informed you that individual NA25001 in family 25985 is a duplicate of individual NA12057 in family 1344. Use the following command in PLINK to check if there are any duplicate samples in the dataset: plink -file GWAS_clean2 -genome -out duplicates In larger datasets (more people and more markers) this command takes a lot longer. It is calculating an IBS matrix between all members of the study. However, the genome command needs to be run only once and in subsequent analyses the read-genome command can be used to access this information. Open the duplicates.genome file R with the following command: dups = read.table( duplicates.genome, header = T) We are interested in the value for Pi-Hat (the proportion IBD or in this dataset IBS). You may notice that you have more than one duplicate. Also, keep your eyes open for individuals who have Pi-Hat values that are high. problem_pairs = dups[which(dups$pi_hat > 0.4),] problem_pairs Table 2: Duplicates and relatedness FID1 IID1 FID2 IID2 PI_HAT Question 3: How many duplicate pairs do your find (hint: Pi-Hat = 1)? What proportion would you expect a parent/ child to share IBS? Can you find any such relationship? Word of caution: Pi-hat can be inflated and many individuals appear to be related to each other if you have samples from different population. This explains why we see a number of pairs of individuals with Pi-hat greater than 0.05 since three distinct populations were sampled. Additionally this phenomenon can be observed if a subset of your sample was genotyped using "bad" chips, which creates two or more populations and the individuals within these populations appear to be more closely related than the true situation. In R see for youself how many pairs have pi-hat great than 0.05 with the following code: problem_pairs = dups[which(dups$pi_hat > 0.05),] myvars = c("fid1", "IID1", "FID2", "IID2", "PI_HAT") problem_pairs[myvars]

Make a txt file that looks like this: 1344 NA12057 1444 NA12739 M033 NA19774 name it IBS_excluded.txt saving it in the folder with your PLINK data. Then type the command: plink --file GWAS_clean2 --remove IBS_excluded.txt --recode --out GWAS_clean3 You can now fill in Box 4 and Oval 3. Question 4: Your collaborators have given you a list of duplicates in the dataset that were sent as quality control for genotyping. You have this list so why would you bother obtaining the IBS matrix on all the samples in your study? e. Excess homozygosity and heterozygosity Type the following: plink -file GWAS_clean3 -het and then open up R and examine the data for excess homozygosity and heterozygosity. Open up the R program by typing: R then in R type: Dataset <- read.table("plink.het", header=true, sep="", na.strings="na", dec=".", strip.white=true) mean(dataset$f) sd(dataset$f) jpeg("hist.jpeg", height=1000, width=1000) hist(scale(dataset$f), xlim=c(-4,4)) dev.off() in order to obtain the mean and SD of the inbreeding coefficient (F) for each person in the study and additionally to plot a histogram so you can observe if there are any outliers. The last command should create a new jpeg image file (filename=hist.jpeg) in your working directory. You should be concerned if you have individuals who are greater than 4 SDs from the mean of F. F is calculated based on the observed number of homozygous genotype calls an individual has in comparison with

the expected number of homozygous genotype calls. So if an individual has less homozygous call than expected F is negative and if the individual has more homozygous calls than expected F is positive. Those individuals whose genotypes either have excess homozygosity (more homozygous calls than expected F is positive) or excess heterozygosity (less homozygous calls than expected F is negative) if F greater than 4SD from the mean F and these samples should be removed from the dataset. In our case, you can see from the histogram that we do not have this problem. Fill this information in Box 5. Question 5: You observe a sample with a negative value for F that is 6 SD outside the mean F for the dataset. Is this excess heterozygosity or excess homozygosity? Give one reason a sample might exhibit this. Note: You would usually examine the data for outliers by plotting the first and second principal or multidimensional scaling (MDS) components. Using a subset of markers that have been trimmed to remove LD (r 2 <0.5). MDS analysis will be performed in the second part of the exercise to control for populations substructure. Outlier can be due to study subjects coming from different populations e.g. European- and African-Americans or batch effects. If it is suspected that outliers are due to study subjects having been sampled from different populations than data from HapMap can be included to elucidate population membership. For example if a study of European-Americans is being performed and there are African-American study subjects included in the sample they would cluster between the European HapMap samples and the African HapMap samples but closer to the African samples. If you perform this type of analysis you should remove the HapMap samples and reestimate the MDS components before using then to adjust for population substructure. Here we are using data from HapMap Phase III and it consists of CEU (Europeans from Utah), MEX (Mexicans from Los Angeles) and TSI (Tuscans from Italy). We can observe three clusters that consist of the three data sets but do not observe any extreme outliers. We are using this data set for demonstration purposes only. You would not want to perform analysis of different populations together, instead they are usually analyzed separately and the results are combined using metaanalysis. In part two of this exercise you will trim the markers for LD and construct MDS components to control for populations substructure as well as plotting the first two MDS components as shown below. Since we are not going to remove outliers we will not run these commands here but perform them in the second part of the exercise..

f. Hardy-Weinberg Equilibrium (HWE): This is the first time we are introducing a trait in the cleaning. You should note that all of the cleaning above took place with the phenotype as missing in the ped file. Type the following to obtain an output of HWE test for all SNPs for the trait Aff: plink -file GWAS_clean3 -pheno pheno.txt -pheno-name Aff -hardy Open the file plink.hwe and look for SNPs with p-values of 10-7 or smaller. In R: hardy = read.table( plink.hwe, header = T) names(hardy) hwe_prob = hardy[which(hardy$p < 0.0000009),] hwe_prob Using a criteria of p of 10-7 or smaller to for HWE to be out, how many SNPs fail HWE in the controls? Fill out Oval 5 and Box 4. Using the same criteria, how many SNPs fail HWE in the cases? However, we will not exclude based on this criteria in the cases. We will only make a note of these SNPs. Are there any SNPs that fail HWE using this criterion in the cases? Complete Table 2 with this information. Table 3: Hardy-Weinberg Equilibrium Cases Controls SNP Pvalue SNP Pvalue

Create a text file called HWE_out.txt with the following SNP in it: rs2968487 and then type the following for PLINK: plink -file GWAS_clean3 -exclude HWE_out.txt -recode -out GWAS_clean4 There are a number of SNPs with HWE pvalues in the range of 10-5 to 10-6 in the controls and in the cases. Based on above criteria they will not be excluded however, a note should be made for such SNPs and kept in mind if they reach genome-wide significance during association testing. You can now fill in Box 6 and Oval 4.

Oval 1 N = DNA samples N = Box 1 DNA samples failed because missing more than 10% of calls (MIND 0.10) Oval 2 N = DNA samples N snp = SNPs Box 2a SNPs (MAF >5%) failed b/c missing rate per SNP > 5% of DNA samples (GENO 0.05) Box 3 DNA samples failed because missing more than 3% of calls (MIND 0.03) Box 2b SNPs (MAF <5%) failed b/c missing rate per SNP > 5% of DNA samples (GENO 0.01) Box 4 Individuals with inconsistent sex (resolve with collaborators) Duplicate pairs found (go to Question 2) Individuals excluded due to relatedness Oval 3 N = DNA samples N snp = SNPs Box 5 DNA samples excluded based on excess heterozygosity or homozygosity (4SD) Box 6 SNPs in controls out of HWE with p<10-7 Oval 4 N = DNA samples N snp = SNPs

Solutions to Questions: Oval 1 and 2 also and Box 1 information: Analysis finished: Thu Jul 7 13:25:11 2011 PLINK! v1.07 10/Aug/2009 (C) 2009 Shaun Purcell, GNU General Public License, v2 For documentation, citation & bug-report instructions: http://pngu.mgh.harvard.edu/purcell/plink/ Web-based version check ( --noweb to skip ) Recent cached web-check found...problem connecting to web Writing this text to log file [ GWAS_clean_mind.log ] Analysis started: Thu Jul 7 13:26:33 2011 Options in effect: --file GWAS --mind 0.10 --recode --out GWAS_clean_mind 6424 (of 6424) markers to be included from [ GWAS.map ] [Oval 1] 248 individuals read from [ GWAS.ped ] [Oval 1] 0 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 0 cases, 0 controls and 248 missing 125 males, 123 females, and 0 of unspecified sex Before frequency and genotyping pruning, there are 6424 SNPs 248 founders and 0 non-founders found Writing list of removed individuals to [ GWAS_clean_mind.irem ] 1 of 248 individuals removed for low genotyping ( MIND > 0.1 ) [Box 1] 6 heterozygous haploid genotypes; set to missing Writing list of heterozygous haploid genotypes to [ GWAS_clean_mind.hh ] 1 SNPs with no founder genotypes observed Warning, MAF set to 0 for these SNPs (see --nonfounders) Writing list of these SNPs to [ GWAS_clean_mind.nof ] Total genotyping rate in remaining individuals is 0.996863 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 6424 SNPs [Oval 2] After filtering, 0 cases, 0 controls and 247 missing [Oval 2] After filtering, 125 males, 122 females, and 0 of unspecified sex Writing recoded ped file to [ GWAS_clean_mind.ped ] Writing new map file to [ GWAS_clean_mind.map ] Analysis finished: Thu Jul 7 13:26:34 2011

Box 2a information: PLINK! v1.07 10/Aug/2009 (C) 2009 Shaun Purcell, GNU General Public License, v2 For documentation, citation & bug-report instructions: http://pngu.mgh.harvard.edu/purcell/plink/ Web-based version check ( --noweb to skip ) Recent cached web-check found...problem connecting to web Writing this text to log file [ MAF_greater_5_clean.log ] Analysis started: Thu Jul 7 13:28:30 2011 Options in effect: --file MAF_greater_5 --geno 0.05 --recode --out MAF_greater_5_clean 5867 (of 5867) markers to be included from [ MAF_greater_5.map ] 247 individuals read from [ MAF_greater_5.ped ] 0 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 0 cases, 0 controls and 247 missing 125 males, 122 females, and 0 of unspecified sex Before frequency and genotyping pruning, there are 5867 SNPs 247 founders and 0 non-founders found 6 heterozygous haploid genotypes; set to missing Writing list of heterozygous haploid genotypes to [ MAF_greater_5_clean.hh ] Total genotyping rate in remaining individuals is 0.997028 1 SNPs failed missingness test ( GENO > 0.05 )[Box 2a] 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 5866 SNPs After filtering, 0 cases, 0 controls and 247 missing After filtering, 125 males, 122 females, and 0 of unspecified sex Writing recoded ped file to [ MAF_greater_5_clean.ped ] Writing new map file to [ MAF_greater_5_clean.map ] Analysis finished: Thu Jul 7 13:28:31 2011 Box 2b information: PLINK! v1.07 10/Aug/2009 (C) 2009 Shaun Purcell, GNU General Public License, v2 For documentation, citation & bug-report instructions: http://pngu.mgh.harvard.edu/purcell/plink/ Web-based version check ( --noweb to skip ) Recent cached web-check found...problem connecting to web Writing this text to log file [ MAF_less_5_clean.log ] Analysis started: Thu Jul 7 13:32:15 2011

Options in effect: --file MAF_less_5 --geno 0.01 --recode --out MAF_less_5_clean 557 (of 557) markers to be included from [ MAF_less_5.map ] 247 individuals read from [ MAF_less_5.ped ] 0 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 0 cases, 0 controls and 247 missing 125 males, 122 females, and 0 of unspecified sex Before frequency and genotyping pruning, there are 557 SNPs 247 founders and 0 non-founders found Total genotyping rate in remaining individuals is 0.995123 60 SNPs failed missingness test ( GENO > 0.01 ) [Box 2b] 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 497 SNPs After filtering, 0 cases, 0 controls and 247 missing After filtering, 125 males, 122 females, and 0 of unspecified sex Writing recoded ped file to [ MAF_less_5_clean.ped ] Writing new map file to [ MAF_less_5_clean.map ] Analysis finished: Thu Jul 7 13:32:15 2011 Oval 2 and Box 3 information: PLINK! v1.07 10/Aug/2009 (C) 2009 Shaun Purcell, GNU General Public License, v2 For documentation, citation & bug-report instructions: http://pngu.mgh.harvard.edu/purcell/plink/ Web-based version check ( --noweb to skip ) Recent cached web-check found...problem connecting to web Writing this text to log file [ GWAS_clean2.log ] Analysis started: Thu Jul 7 15:03:38 2011 Options in effect: --file GWAS_MAF_clean --mind 0.03 --recode --out GWAS_clean2 6363 (of 6363) markers to be included from [ GWAS_MAF_clean.map ] 247 individuals read from [ GWAS_MAF_clean.ped ] 0 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 0 cases, 0 controls and 247 missing 125 males, 122 females, and 0 of unspecified sex

Before frequency and genotyping pruning, there are 6363 SNPs 247 founders and 0 non-founders found 0 of 247 individuals removed for low genotyping ( MIND > 0.03 ) [Box 3] 6 heterozygous haploid genotypes; set to missing Writing list of heterozygous haploid genotypes to [ GWAS_clean2.hh ] Total genotyping rate in remaining individuals is 0.99716 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 6363 SNPs After filtering, 0 cases, 0 controls and 247 missing After filtering, 125 males, 122 females, and 0 of unspecified sex Writing recoded ped file to [ GWAS_clean2.ped ] Writing new map file to [ GWAS_clean2.map ] Analysis finished: Thu Jul 7 15:03:40 2011 Answer to Question 1: Why do you expect the homozygosity rate to be higher on the X chromosome in males than females? Because males only have one allele for each SNP on the X chromosome they will appear homozygous. Table 1: Sex check FID IID PEDSEX SNPSEX STATUS F NA20506 NA20506 2 1 PROBLEM 1 NA20530 NA20530 2 1 PROBLEM 1 NA20766 NA20766 2 0 PROBLEM 0.2292 NA20771 NA20771 2 0 PROBLEM 0.2234 NA20757 NA20757 2 0 PROBLEM 0.2141 Answer to Question 2: With a limited number of SNPs genotyped on the X chromosome, are you concerned PLINK is indicating that some females are males? What does this mean for the interpretation of your results? No. As a result of the lack of information, PLINK may indicate that the genotyped sex is male when in fact it is female (think back to the question 4 answer). In our case, our collaborators confirmed that the records are correct. We will not be changing the sex of any individual based on our analysis because we do not have enough information to support changing the sex code.

Table 2: Duplicates and relatedness FID1 IID1 FID2 IID2 M041 NA25000 M033 NA19774 1 PI_ HAT 13291 NA25001 1344 NA12057 1.00 1444 NA12739 1444 NA12749 0.51 1444 NA12739 1444 NA12748 0.50 Answer to Question 3: How many duplicate pairs do your find (hint: Pi-Hat = 1)? What proportion would you expect a parent/ child to share IBS? Can you find any such relationship? Fill in Box 4. You have one surprise duplicate pair (NA25001 and NA12057). You would expect a parent/child relationship to have a Pi_Hat value of 0.5. So NA12749 and NA12748 are the parents of NA12739 and you should exclude NA12739 in addition to NA19774 and NA12057 because they are duplicate samples of NA25000 and NA25001, respectively. Oval 3 information PLINK! v1.07 10/Aug/2009 (C) 2009 Shaun Purcell, GNU General Public License, v2 For documentation, citation & bug-report instructions: http://pngu.mgh.harvard.edu/purcell/plink/ Web-based version check ( --noweb to skip ) Recent cached web-check found...problem connecting to web Writing this text to log file [ GWAS_clean3.log ] Analysis started: Thu Jul 7 16:21:56 2011 Options in effect: --file GWAS_clean2 --remove IBS_excluded.txt --recode --out GWAS_clean3 6363 (of 6363) markers to be included from [ GWAS_clean2.map ] 247 individuals read from [ GWAS_clean2.ped ] 0 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 0 cases, 0 controls and 247 missing 125 males, 122 females, and 0 of unspecified sex Reading individuals to remove [ IBS_excluded.txt ]... 3 read 3 individuals removed with --remove option Before frequency and genotyping pruning, there are 6363 SNPs 244 founders and 0 non-founders found 6 heterozygous haploid genotypes; set to missing Writing list of heterozygous haploid genotypes to [ GWAS_clean3.hh ] Total genotyping rate in remaining individuals is 0.997225 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 6363 SNPs [Oval 3]

After filtering, 0 cases, 0 controls and 244 missing [Oval 3] After filtering, 123 males, 121 females, and 0 of unspecified sex Writing recoded ped file to [ GWAS_clean3.ped ] Writing new map file to [ GWAS_clean3.map ] Analysis finished: Thu Jul 7 16:21:58 2011 Answer to Question 4: Your collaborators have given you a list of duplicates in the dataset that were sent as quality control for genotyping. You have this list so why would you bother obtaining the IBS matrix on all the samples in your study? You need to thoroughly check the integrity of your dataset. Yes, you want to know if the expected duplicates are correct but you are also wondering if another sample was accidentally sent as a duplicate. The common example of this is when clinical investigators in a large study recruit the same patient at two different time points. The same person is given two different sample IDs. Answer to Question 5: You observe a sample with a negative value for F that is 6 SD outside the mean F for the dataset. Is this excess heterozygosity or excess homozygosity? Give one reason a sample might exhibit this. Excess heterozygosity. The sample has less homozygous calls than expected. One reason is that the sample could be contaminated with DNA from another person. Maybe DNA was transferred to the same tube twice from two different people? Table 3: Hardy Weinberg Equilibrium Fail Cases Fail Controls SNP pvalue SNP pvalue None rs2968487 2.262e-007 PLINK! v1.07 10/Aug/2009 (C) 2009 Shaun Purcell, GNU General Public License, v2 For documentation, citation & bug-report instructions: http://pngu.mgh.harvard.edu/purcell/plink/ Web-based version check ( --noweb to skip ) Recent cached web-check found...problem connecting to web Writing this text to log file [ GWAS_clean4.log ] Analysis started: Thu Jul 7 16:26:18 2011 Options in effect: --file GWAS_clean3 --exclude HWE_out.txt --recode --out GWAS_clean4

6363 (of 6363) markers to be included from [ GWAS_clean3.map ] 244 individuals read from [ GWAS_clean3.ped ] 0 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 0 cases, 0 controls and 244 missing 123 males, 121 females, and 0 of unspecified sex Reading list of SNPs to exclude [ HWE_out.txt ]... 1 read Before frequency and genotyping pruning, there are 6362 SNPs 244 founders and 0 non-founders found 6 heterozygous haploid genotypes; set to missing Writing list of heterozygous haploid genotypes to [ GWAS_clean4.hh ] Total genotyping rate in remaining individuals is 0.997229 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 6362 SNPs [Oval 4] After filtering, 0 cases, 0 controls and 244 missing [Oval 4] After filtering, 123 males, 121 females, and 0 of unspecified sex Writing recoded ped file to [ GWAS_clean4.ped ] Writing new map file to [ GWAS_clean4.map ] Analysis finished: Thu Jul 7 16:26:20 2011

Oval 1 N = 248 DNA samples N snp = _ 6424_ SNPs Box 1 _1 _ DNA samples failed because missing more than 10% of calls (MIND 0.10) Oval 2 N = 247 DNA samples N snp = _ 6424_ SNPs Box 2a _1 _ SNPs (MAF >5%) failed b/c missing rate per SNP > 5% of DNA samples (GENO 0.05) Box 3 _0 _ DNA samples failed because missing more than 3% of calls (MIND 0.03) Box 2b _ 60_ SNPs (MAF <5%) failed b/c missing rate per SNP > 5% of DNA samples (GENO 0.01) Box 4 _5 _ Individuals with inconsistent sex (resolve with collaborators) _2 _ Duplicate pairs found (go to Question 2) _1_ Individuals excluded due to relatedness Oval 3 N = 244 DNA samples N snp = _ 6363_ SNPs Box 5 _ 0_ DNA samples excluded based on excess heterozygosity or homozygosity (4SD) Box 6 1 SNPs in controls out of HWE with p<10-7 Oval 4 N = 244 DNA samples N snp = _ 6362_ SNPs