Bioinformatics for Evolutionary Biologists

Similar documents
Computational Intelligence for Network Structure Analytics

SpringerBriefs in Astronomy

SpringerBriefs in Applied Sciences and Technology

SpringerBriefs in Space Development

Management and Industrial Engineering. Series editor J. Paulo Davim, Aveiro, Portugal

Studies in Systems, Decision and Control

Fundamentals of Digital Forensics

SpringerBriefs in Electrical and Computer Engineering

Advances in Game-Based Learning

The Test and Launch Control Technology for Launch Vehicles

Advances in Multirate Systems

Robust Hand Gesture Recognition for Robotic Hand Control

K-Best Decoders for 5G+ Wireless Communication

Privacy, Data Protection and Cybersecurity in Europe

Discursive Constructions of Corporate Identities by Chinese Banks on Sina Weibo

The Cultural and Social Foundations of Education. Series Editor A.G. Rud College of Education Washington State University USA

Science Fiction, Ethics and the Human Condition

The Space Shuttle Program. Technologies and Accomplishments

Birds of Prey and Wind Farms

Application of Evolutionary Algorithms for Multi-objective Optimization in VLSI and Embedded Systems

Current Technologies in Vehicular Communications

COOP 2016: Proceedings of the 12th International Conference on the Design of Cooperative Systems, May 2016, Trento, Italy

Palgrave Studies in Comics and Graphic Novels. Series Editor Roger Sabin University of the Arts London London, United Kingdom

Research and Practice on the Theory of Inventive Problem Solving (TRIZ)

Surface Mining Machines

Satellite- Based Earth Observation. Christian Brünner Georg Königsberger Hannes Mayer Anita Rinner Editors

IIW Collection. Series editor IIW International Institute of Welding, ZI Paris Nord II, Villepinte, France

Fault Diagnosis of Hybrid Dynamic and Complex Systems

RF and Microwave Microelectronics Packaging II

International Series on Computer Entertainment and Media Technology. Series Editor Newton Lee Tujunga, California, USA

Studies in Computational Intelligence

Computer Supported Cooperative Work. Series Editor Richard Harper Cambridge, United Kingdom

Dry Etching Technology for Semiconductors. Translation supervised by Kazuo Nojiri Translation by Yuki Ikezi

Lecture Notes in Business Information Processing 326

Postdisciplinary Studies in Discourse

Design for Innovative Value Towards a Sustainable Society

SpringerBriefs in Applied Sciences and Technology

Health Information Technology Standards. Series Editor: Tim Benson

PIXAR S AMERICA. The Re-Animation of American Myths and Symbols DIETMAR MEINEL

Science Communication

Advances in Metaheuristic Algorithms for Optimal Design of Structures

Advanced Decision Making for HVAC Engineers

Computational Social Sciences

Advances in Computer Vision and Pattern Recognition

Enacting Research Methods in Information Systems: Volume 2

Faster than Nyquist Signaling

Multi-Criteria Decision Analysis to Support Healthcare Decisions

Electrohydrodynamic Direct-Writing for Flexible Electronic Manufacturing

Lecture Notes in Control and Information Sciences

SpringerBriefs in Computer Science

Learn Autodesk Inventor 2018 Basics

Analog Circuits and Signal Processing. Series Editors Mohammed Ismail, Dublin, USA Mohamad Sawan, Montreal, Canada

Advanced Information and Knowledge Processing

SpringerBriefs in Space Development

Analog Circuits and Signal Processing. Series editors Mohammed Ismail, Dublin, USA Mohamad Sawan, Montreal, Canada

Palgrave Studies in the History of Science and Technology

Contesting Water Rights

Digital Image Processing

Cross-Industry Innovation Processes

Hiroyuki Kajimoto Satoshi Saga Masashi Konyo. Editors. Pervasive Haptics. Science, Design, and Application

Applications of Cognitive Computing Systems and IBM Watson

MATLAB Guide to Finite Elements

Palgrave Studies in Comics and Graphic Novels. Series Editor Roger Sabin University of the Arts London London, United Kingdom

Management of Software Engineering Innovation in Japan

Broadband Networks, Smart Grids and Climate Change

ANALOG CIRCUITS AND SIGNAL PROCESSING

Drones and Unmanned Aerial Systems

Dao Companion to the Analects

Requirements Engineering for Digital Health

Studies in Computational Intelligence

The International Politics of the Armenian-Azerbaijani Conflict

SpringerBriefs in Electrical and Computer Engineering

Physiology in Health and Disease. Published on behalf of The American Physiological Society by Springer

Strategic Innovation in Russia

SpringerBriefs in Applied Sciences and Technology

Trends in Logic. Volume 45

Matthias Pilz Susanne Berger Roy Canning (Eds.) Fit for Business. Pre-Vocational Education in European Schools RESEARCH

The New Hollywood Historical Film

Offshore Energy Structures

Founding Editor Martin Campbell-Kelly, University of Warwick, Coventry, UK

Handbook of Engineering Acoustics

The Future of Civil Litigation

Literatures, Cultures, and the Environment. Series Editor Ursula K. Heise University of California Dept of English Los Angeles, California, USA

Francis Bacon on Motion and Power

Fuzzy Management Methods. Series editors Andreas Meier, Fribourg, Switzerland Witold Pedrycz, Edmonton, Canada Edy Portmann, Bern, Switzerland

SpringerBriefs in Applied Sciences and Technology

Smart Sensors, Measurement and Instrumentation

Socio-technical Design of Ubiquitous Computing Systems

Sustainable Development

Human and Mediated Communication around the World

Studies in Computational Intelligence

Algorithms for Genetics: Basics of Wright Fisher Model and Coalescent Theory

Better Business Regulation in a Risk Society

Building Arduino PLCs

SpringerBriefs in Applied Sciences and Technology

Human Computer Interaction Series. Editors-in-chief Desney Tan, Microsoft Research, USA Jean Vanderdonckt, Université catholique de Louvain, Belgium

EAI/Springer Innovations in Communication and Computing. Series editor Imrich Chlamtac, CreateNet, Trento, Italy

Technology Roadmapping for Strategy and Innovation

Applications to Marine Disaster Prevention

BIOSEMIOTICS. Aims and Scope of the Series VOLUME 8. For further volumes:

Transcription:

Bioinformatics for Evolutionary Biologists

Bernhard Haubold Angelika Börsch-Haubold Bioinformatics for Evolutionary Biologists A Problems Approach 123

Bernhard Haubold Department of Evolutionary Genetics Max-Planck-Institute for Evolutionary Biology Plön, Schleswig-Holstein Germany Angelika Börsch-Haubold Plön, Schleswig-Holstein Germany ISBN 978-3-319-67394-3 ISBN 978-3-319-67395-0 (ebook) https://doi.org/10.1007/978-3-319-67395-0 Library of Congress Control Number: 2017955660 Springer International Publishing AG 2017, corrected publication 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface Evolutionary biologists have two types of ancestors: naturalists such as Charles Darwin (1809 1892) and theoreticians such as Ronald A. Fisher (1890 1962). The intellectual descendants of these two scientists have traditionally formed quite separate tribes. However, the distinction between naturalists and theoreticians is rapidly fading these days: Many naturalists spend most of their time in front of computers analyzing their data, and quite a few theoreticians are starting to collect their own data. The reason for this coalescence between theory and experiment is that two hitherto expensive technologies have become so cheap, they are now essentially free: computing and sequencing. Computing became affordable in the early 1980s with the advent of the PC. More recently, next generation sequencing has allowed everyone to sequence the genomes of their favorite organisms. However, analyzing this data remains difficult. The difficulties are twofold: conceptual, which method should I use, and practical, how do I carry out a certain computation. The aim of this book is to help the reader overcome both difficulties. We do this by posing a series of problems. These come in two forms, paper and pencil problems, and computer problems. Our choice of concepts is centered on the analysis of sequences in an evolutionary context. The aim here is to give the reader a look under the hood of the programs applied in the computer problems. The computer problems are solved in the same environment used for decades by scientists, the UNIX command line, also known as the shell. This is available on all three major desktop operating systems, Windows, Linux, and OS-X. Like any skill worth learning, using the shell takes practice. The computer problems are designed to give the reader plenty of opportunity for that. In Chap. 1, we introduce the command line. After explaining how to get started, we deal with plain text files, which serve as input and output of most UNIX operations. Many of these operations are themselves text files containing commands to be executed on some input. Such command files are called scripts, and their treatment concludes Chap. 1. In Chap. 2, the newly acquired UNIX skills are used to explore a central concept in Bioinformatics: sequence alignment. A sequence alignment represents an evolutionary hypothesis about which residues have a recent common ancestor. This is v

vi Preface determined using optimal alignment methods that extract the best out of a very large number of possible alignments. However, this optimal approach consumes a lot of time and memory. The computation of exact matches, the topic of Chap. 3, is less resource intensive than the computation of alignments. Taken by themselves, exact matches are also less useful than alignments, because exact matches cannot take into account mutations. Nevertheless, exact matching is central to many of the most popular methods for inexact matching. We begin with methods for exact matching in time proportional to the length of the sequence investigated. Then we concentrate on methods for exact matching in time independent of the text length. This is achieved by indexing the input sequence through the construction of suffix trees and suffix arrays. In Chap. 4, we show how to combine alignment with exact matching to obtain very fast programs. The most famous example of these is BLAST, which is routinely used to find similarities between sequences. Up to now we have only looked at pairwise alignment. At the end of Chap. 4, we generalize this to multiple sequence alignment. In Chap. 5, multiple sequence alignments are used to construct phylogenies. These are hypotheses about the evolution of a set of species. If we zoom in from evolution between species to evolution within a particular species, we enter the field of population genetics, the topic of Chap. 6. Here, we concentrate on modeling evolution by following the descent of a sample of genes back in time to their most recent common ancestor. These lines of descent form a tree known as the coalescent, the topic of much of modern population genetics. We conclude in Chap. 7 by introducing two miscellaneous topics: statistics and relational databases. Both would deserve books in their own right, and we restrict ourselves to showing how they fit in with the UNIX command line. Our course is sequence-centric, because sequence data permeates modern biology. In addition, these data have attracted a rich set of computer methods for data analysis and modeling. The sequences we analyze can be downloaded from the companion website for this book: http://guanine.evolbio.mpg.de/problemsbook/ To these sequences, we apply generic tools provided by the UNIX environment, published bioinformatics software, and programs written for this course. The latter are designed to allow readers to analyze a particular computational method. The programs are also available from the companion site. At the back of the book, we give complete solutions to all the problems. The solutions are an integral part of the course. We recommend you attempt each problem in the order in which they are posed. If you find a solution, compare it to ours. If you cannot find a solution, read ours and try again. If our solution is unclear or you have a better one, please drop us a line at

Preface vii problemsbook@evolbio.mpg.de The tongue-in-cheek Algorithm 1 summarizes these recommendations. Algorithm 1 Using the Solutions 1: while problem unsolved do 2: solve problem 3: read solution 4: if solution unclear or your solution is better than ours then 5: drop us a line 6: end if 7: end while This book has been in the works since 2003 when BH started teaching Bioinformatics at the University of Applied Sciences, Weihenstephan. We thank all the students who gave us feedback on this material as it evolved over the years. We would also like to thank a few individuals who contributed in more specific ways to the gestation of this book: Mike Travisano (University of Minnesota) gave us encouragement at a critical time. Nicola Gaedeke and Peter Pfaffelhuber (University of Freiburg) commented on an early draft, and our students Linda Krause, Xiangyi Li, Katharina Dannenberg, and Lina Urban read large parts of the manuscript in one of the many guises it has taken over the years. We are grateful to all of them. Plön, Germany July 2017 Bernhard Haubold Angelika Börsch-Haubold

The original version of the book backmatter was revised: For detailed information please see Erratum. The erratum to this chapter is available at https://doi.org/10.1007/978-3-319-67395-0_9 ix

Contents 1 The UNIX Command Line... 1 1.1 Getting Started... 2 1.2 Files... 7 1.3 Scripts... 13 1.3.1 Bash... 14 1.3.2 Sed... 16 1.3.3 AWK... 17 2 Constructing and Applying Optimal Alignments... 23 2.1 Sequence Evolution and Alignment... 23 2.2 Amino Acid Substitution Matrices... 25 2.2.1 Genetic Code... 26 2.2.2 PAM Matrices... 30 2.3 The Number of Possible Alignments... 32 2.4 Dot Plots... 34 2.5 Optimal Alignment... 37 2.5.1 From Dot Plot to Alignment... 38 2.5.2 Global Alignment... 39 2.5.3 Local Alignment... 42 2.6 Applications of Optimal Alignment... 42 2.6.1 Homology Detection... 43 2.6.2 Dating the Duplication of Adh... 44 3 Exact Matching... 47 3.1 Keyword Trees... 47 3.2 Suffix Trees... 54 3.3 Suffix Arrrays... 57 3.4 Text Compression... 62 3.4.1 Move to Front (MTF)... 65 3.4.2 Measuring Compressibility: The Lempel Ziv Decomposition... 65 xi

xii Contents 4 Fast Alignment... 69 4.1 Alignment with k Errors... 69 4.2 Fast Local Alignment... 72 4.2.1 Simple BLAST... 73 4.2.2 Modern BLAST... 75 4.3 Shotgun Sequencing... 78 4.4 Fast Global Alignment... 82 4.5 Read Mapping... 86 4.6 Clustering Protein Sequences... 88 4.7 Position-Specific Iterated BLAST... 92 4.8 Multiple Sequence Alignment... 94 4.8.1 Query-Anchored Alignment... 96 4.8.2 Progressive Alignment... 96 5 Evolution Between Species: Phylogeny... 101 5.1 Trees of Life... 101 5.2 Rooted Phylogeny... 106 5.3 Unrooted Phylogeny... 108 6 Evolution Within Populations... 113 6.1 Descent from One or Two Parents... 113 6.1.1 Bi-Parental Genealogy... 113 6.1.2 Uni-Parental Genealogy... 115 6.2 The Coalescent... 120 7 Additional Topics... 127 7.1 Statistics... 127 7.1.1 The Significance of Single Experiments... 128 7.1.2 The Significance of Multiple Experiments... 128 7.1.3 Mouse Transcriptome Data... 130 7.2 Relational Databases... 131 7.2.1 Mouse Expression Data... 132 7.2.2 SQL Queries... 135 7.2.3 Java... 136 7.2.4 ENSEMBL... 137 8 Answers and Appendix: Unix Guide... 139 8.1 Answers... 139 8.2 Appendix: UNIX Guide... 292 8.2.1 File Editing... 292 8.2.2 Working with Files... 293 8.2.3 Entering Commands Interactively... 293 8.2.4 Combining Commands: Pipes... 295 8.2.5 Redirecting Output.... 295 8.2.6 Shell Scripts... 297

Contents xiii 8.2.7 Directories... 298 8.2.8 Filters... 299 8.2.9 Regular Expressions.... 306 Erratum to: Bioinformatics for Evolutionary Biologists... E1 References... 309 Index... 313