Big Data Best Practice Sean Patrick Murphy sean@pingthings.io JSIS, Salt Lake City May 23, 2017 1
The Value of Data Circa 2006! Data is just like crude. It s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value. M. Palmer, Data is the new oil, http://ana.blogs.com/maestros/2006/11/data_is_the_new.html 2
! The problem is that just 2 percent of all the terabytes and petabytes of data generated by connected power plants, windfarms, grids, substations and energy management systems is being analyzed and used today. - Ganesh Bell Chief Digital Officer of GE Power 3
4
5
How Much Data? x 2 x 64 bits / sample 30 samples / second 60 seconds / minute 60 minutes / hour 24 hours / day 30 days / month 12 streams / PMU 1/8 bit/byte Assume 2500 deployed and active PMUs 2500 * 15 GB = 30 TB / month 14,929,920,000 bytes / month / PMU or 15 GB / month / PMU 6
Significant Scale 500,000 PMUs Deployed Today Dr. Edmund O. Schweitzer III, President, Chairman of the Board Schweitzer Engineering Laboratories 7
How Much Data? x 2 x 64 bits / sample 30 samples / second 60 seconds / minute 60 minutes / hour 24 hours / day 30 days / month 12 streams / PMU 1/8 bit/byte Assume 2500 deployed and active PMUs 2500 * 15 GB = 30 TB / month If all 500,000 PMUs came online 500,000 * 15 GB = 14,929,920,000 bytes / month / PMU or 15 GB / month / PMU 7.5 Petabytes / month 8
Just in Case You Were Wondering Bytes (8 Bits) Kilobyte (1000 Bytes) Megabyte (1 000 000 Bytes) Gigabyte (1 000 000 000 Bytes) Terabyte (1 000 000 000 000 Bytes) Petabyte (1 000 000 000 000 000 Bytes) Exabyte (1 000 000 000 000 000 000 Bytes) Zettabyte (1 000 000 000 000 000 000 000 Bytes) Yottabyte (1 000 000 000 000 000 000 000 000 Bytes) Named after Yoda! Xenottabyte (1 000 000 000 000 000 000 000 000 000 Bytes) Shilentnobyte (1 000 000 000 000 000 000 000 000 000 000 Bytes) Domegemegrottebyte (1 000 000 000 000 000 000 000 000 000 000 000 Bytes) 9
Perspective: Gorilla 2 billion unique time series identified by a string key. 700 million data points (time stamp and value) added per minute. Store data for 26 hours. More than 40,000 queries per second at peak. Reads succeed in under one millisecond. Support time series with 15 second granularity (4 points per minute per time series). Two in-memory, not co-located replicas (for disaster recovery capacity). Always serve reads even when a single server crashes. Ability to quickly scan over all in memory data. Support at least 2x growth per year. 10
Cramming More Components onto Integrated Circuits CPU RAM Storage Software Cost per Gigaflop Cost per Gigabyte Cost per Gigabyte Enterprise Data Systems 1995 $42,000 $32,000 $60,000 Million$ 2017 $0.03 $4 $0.03 Free, Open source The RAM required to hold a month s worth of PMU data for the entire North American continent costs approximately $10K 11
Current Analytics Process 1. Gain access to the data 2. Query data from historian and pull down to laptop 3. Write code in Excel/Matlab/R to do one-time analysis 4. Store data and code in local folder 5. Move to production? 12
Microsoft Excel Various studies over the past few years report that 88 percent of all spreadsheets have "significant" errors in them. Even the most carefully crafted spreadsheets contain errors in 1 percent or more of all formula cells. JPMorgan Chase lost more than $6 billion in its London Whale incident, in part due to Excel spreadsheet errors (including alleged copying and pasting of incorrect information from multiple spreadsheets). Raymond R Panko, What We Know About Spreadsheet Errors, Journal of End User Computing's. Special issue on Scaling Up End User Development, Volume 10, No 2.Spring 1998, pp. 15-21 13
Analyst or Engineer 14
Best Practices Free the Data 1. Data should be easily accessible and visualization should be free at full resolution within seconds not days 15
Best Practices Free the Data 2. The faster analyses execute, the more use cases emerge. Distributed, horizontally scalable message bus designed to handle any and all sensor data input. BTrDB Distributed, horizontally scalable data store designed for hierarchical time series data. Distributed analysis and deep learning platform for both real time and asynchronous data analysis at scale. Web/Mobile Apps for Specific Use Cases Data Quality GMD/GIC Asset Maintenance Anomaly Detection PT Work Bench Collaborative data science environment to build and share analytics within the utility. 16
Best Practices Free the Data 3. Code, results, and ideas should be easily shared within the organization. 17
Best Practices Free the Data 4. Move ad-hoc analyses to production rapidly 5. Never throw away data 18
Questions? sean@pingthings.io Leadership Co-Founder Chief Executive Officer Jerry Schuman Co-Founder Chief Data Scientist Sean Murphy PingThings is a great example of the type of disruptive software that industries need to scale up the individual Internet, and we're delighted to make an investment in the company. - WILLIAM 'BILL' RUH, CEO, GE DIGITAL Partners & Customers 19