What is Big Data? Jaakko Hollmén Aalto University School of Science Helsinki Institute for Information Technology (HIIT) Espoo, Finland 6.2.2014
Speaker profile Jaakko Hollmén, senior researcher, D.Sc.(Tech.) Department of Information and Computer Science, Aalto University School of Science and Helsinki Institute for Information Technology (HIIT), Finland Industrial and university research in data analysis related topics since 1995, D.Sc.(Tech.) in computer science in 2000 Current research scope: machine learning, data mining, predictive analytics, time series analysis and prediction Exposure to various application areas: process industry, telecommunications, biology, medicine, environmental informatics, analysis of built environment Contact information in the end of the presentation
Speaker, short biography Jaakko Hollmén (b. 1970) received the degrees of M.Sc. (Tech.) in 1996, Lic.Sc. (Tech.) in 1999, and D.Sc. (Tech.) in 2000, all at the Department of Computer Science and Engineering at the Helsinki University of Technology in Finland. Since 2000, he has worked at the Department of Information and Computer Science (formerly Laboratory of Computer and Information Science) at the Aalto University School of Science in Finland. Currently, he is a Chief Research Scientist at Aalto University School of Science. He leads a research group Parsimonious Modelling at Helsinki Institute for Information Technology. The research group develops computational methods for data analysis and applies these methods on two particular application fields: cancer genomics and environmental informatics. Jaakko Hollmén's research interests include theory and practice of machine learning and data mining, especially their applications in bioinformatics and environmental time series analysis. He has served in program committees of conferences, such as SIGKDD, ICDM, ECML/PKDD, PAKDD, UAI. DS, and IDA. In 2011, he was the program chair for 14 th International Conference on Discovery Science (DS 2011) in Porto, Portugal and Tenth International Symposium on Intelligent Data Analysis (IDA 2011), held in Espoo, Finland. In 2012, he was General Chair of the Eleventh International Symposium on Intelligent Data Analysis (IDA 2012), held in Helsinki, Finland. He is an author of over 30 journal articles and 50 conference contributions. He is the volume editor of 3 conference proceedings and holds editorial positions in three journals in his areas of interest. He is an inventor in two patents. Jaakko Hollmén is a Senior Member of IEEE.
Before Big Data 1000+ years: Data Analysis 100+ years: Statistics 50 years: Transistor, Computers, Artificial Intelligence 40 years: Internet 30 years: microcomputers 20 years: World Wide Web 20 years: Search engines for the Web Sensor technology, massive deployment 10 years: Social networks 2-3 years: Big Data
Of photographs, large and small Think of a simple photo and think of questions you can easily answer: how many persons? etc. Then, take the largest photograph in the world and see how the situation changes: how to answer most questions becomes non-obvious!
Of photographs, large and small World s largest photograph, size 320 Gigapixels
Of photographs, large and small Largest photograph of the world, taken from BT Telecom Tower in London in 2012 by 360Cities A single photograph: 320 Gigapixels The printed photo would be about 25 m by 100 m Photo consists of 46000 high-resolution photos, stitched together with computational techniques Web application with 1 million image patches, see the site: http://btlondon2012.co.uk/pano.html Posing and answering questions is non-obvious and may be a huge amount of work!
Big Data definitions First approach to defining Big Data: Processing of data becomes non-obvious and we end up in difficulties with normal tools and analytic processes Wikipedia definition: Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. A lasting definition: there will always be difficult things to solve with standard tools, however, the meaning of the definition will change over time
3V definition of Big Data Most often, Big Data is defined in terms of the 3 V s: Volume (amount of the data) Velocity (speed of information collection, processing, analysis) Variety (types of measurements, structured data vs. unstructured data) Other V s added (minor twists of lesser importance)
Big Data: Volume Unit of information: one bit, one byte A4 paper with written typewriter text: 5 kilobytes Pile of papers printed on both sides: 1 kilometer of papers is 100 Gigabytes, 10 km of papers is 1 Terabyte Facebook: 2.7 billion likes a day and 300 million photos in one day, resulting in 105 (NY Times, December, 2012) 105 Terabytes, a pile of papers equal to length of Finland (about 1000 km) Storage and retrieval problem, resource allocation
Big Data: Velocity The speed with which the data is created High sampling rates: mobile phone 30-50 Hz, that is, 30-50 samples per second, multiple sensors Extremely high sampling rates in scientific measurements settings Smart Cities proposal by IBM: world is intelligent, instrumented, and interconnected Data needs to be analyzed immediately, predictions provided within milliseconds
Big Data: Variety Many measurement modalities: video, numbers, text Structured data vs. non-structured data Images, unstructured text, likes, places, names, product names How to structure the unstructured data? Areas: Text mining, Sentiment analysis This may be the most difficult dimension of the 3V definition: how to combine different sources of data with differing modalities?
Big Data definitions, remarks Big Data has more to it than mere size, name is a misnomer and oversimplifies what is important for the topic Volume is not the most difficult to handle, seen from the analytics point of view Variety may the most challenging, how to combine the different modes of measurement? Prediction: name Big Data will be obsolete by 2014 (or 2015), the topic will still be important Maturity of Big Data? Over-expectations in the past
Maturity of Big Data Gartner Hype Cycle: Big Data is currently in the trough of disillusionment (blogs.gartner.com)
Big Data: Need for analytics Twitter feed @IBMSPSS, August, 2012: #BigData without #analytics is like a flashlight without batteries. Analytics shines the light on where to go next. Analytics serve situations best, when there is a lot data and little understanding (J.H.) Data has been compared to being the new oil, as a new kind of raw material: what you do with the data and how you refine it makes the difference!
Data analysis problems in general Prediction Classification Pointing out the relevant variables Profiling and finding natural groups These problems illustrated through case studies from the personal research of the presenter
Time series prediction Data: daily electricity consumption Model: predict the future with the knowledge of the past Benefits: capacity planning, pricing
Classifying patients Data: patient profile, genetic markers Model: probablity (risk) model for disease classification Benefits: diagnostics, prioritizing care, personalized medicine
Selection of important variables Data: monitored nutrients in time Model: prediction model, use only relevant information Benefits: improve understanding, pinpoint important variables
Profiling a patient database Data: DNA amplification data from cancer patients Model: profile the patients with probabililty models Benefits: improve understanding, classify patients Unstructured
Profiling a patient database Data: DNA amplification data from cancer patients Model: profile the patients with probability models Benefits: improve understanding, classify patients Structured
Summary of the illustrated analyses Benefits come from the combination of data and the analysis work Illustrated analyses: prediction, classification, profiling, pointing out the relevant variables, profiling and finding natural groups Plenty of challenges when applying to Big Data scenarios
Big Data Application areas Smarter healthcare Finance, trading analysis Telecom Log Analysis Traffic control Search quality Fraud detection, risk analysis Retail, churn detection Process industry Natural sciences
Summary and Conclusions Big Data will inevitably change a lot of business practices Many will take two steps: to Data and to Big Data Inflated expectations, hype, still an important area for a longer period of time, no turning back Non-obvious solutions using standard tools, existing divide-and-conquer solutions for Big Data problems How would Big Data support your business goals? Businesses must know the important questions, analysis and modelling help in providing the answers Benefits from Big Data vs. investments in capabilities and resources
Contact information Jaakko Hollmén, D.Sc.(Tech.), Chief Research Scientist Department of Information and Computer Science, Aalto University School of Science, Espoo, Finland Helsinki Institute for Information Technology (HIIT) Web: http://users.ics.aalto.fi/jhollmen/ Twitter: @jhollmen E-mail: Jaakko.Hollmen@aalto.fi Telephone: +358-50-3260110