–The Big Data Imperative
Posted by Brett Sheppard on July 22, 2010
World Data Volumes in the Zettabytes
With enterprise data volumes moving past terabytes to tens of petabytes and more, business and IT leaders face significant opportunities and challenges from Big Data. While IDC estimates that global data volumes have already reached multiple zettabytes, we’re seeing just the beginning of the Big Data era.
What constitutes Big Data is relative, and varies by organization. Big Data presents “significant design and decision factors for implementing a data management and analysis system.” (O’Reilly Radar Release 2.0, Roger Magoulas and Ben Lorica, February 2009). For a large enterprise, Big Data may be in the petabytes or more, while for a small or mid-size enterprise, data volumes that grow into tens of terabytes can become challenging to analyze and manage.We’ve entered the era of the “industrial revolution of data”, with the equivalent of “data factories” ranging from video and audio files to software logs, UPC scanners, RFID, and GPS transceivers, as noted by Joe Hellerstein, Professor of Computer Science at the University of California at Berkeley. “These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide.” (O’Reilly Radar, “The Commoditization of Massive Data Analytics”, November 19, 2008).
As described by The Economist magazine in their February 25, 2010 cover story “The Data Deluge”, “Data are becoming the new raw material of business: an economic input almost on par with capital and labour.” As pictured visually by The Economist’s magazine cover that week showing an oversized funnel with data pouring through, it’s a major challenge how to parse, interpret and communicate huge volumes of real-time and historical data – without getting soaked.
Big Data is not just about cost control – there are business opportunities and even new business models from more advanced analytics. Alon Halevy, Peter Norvig, Fernando Pereira at Google note in a seminal article “The Unreasonable Effectiveness of Data” that simple algorithms on Big Data sets can sometimes trump a previous generation of complex mathematical models that were limited to smaller data sets. For example, Google search activity for flu symptoms can predict flu outbreaks faster than the hospital reports collected by the Centers for Disease Control.
While initial cloud approaches in 2008 and 2009 showed inherent scaling limitations, second-generation technologies are enabling operational databases in the cloud, distributed data stores, and cloud-based analytic databases, with the key combination of central management and distributed use. In-memory caches and data grids are showing significant advances. MapReduce continues to grow in popularity, not as a replacement for other databases but as a complimentary technology based on specific use cases. There continue to be opportunities for innovation in creating an end-to-end analytical platform for data science: as quoted in The Fourth Paradigm, Microsoft Fellow Jim Gray notes that “We have to do better producing tools to support the whole research cycle – from data capture and data curation to data analysis and data visualization.”
As noted by Google Chief Economist Hal Varian in the McKinsey Quarterly, while we have ever and ever larger volumes of data, “… the complimentary scarce factor is the ability to understand that data and extract value from it. We need individuals who not only have a deep understanding of statistics and probabilities, but who also bring “T-shaped skills” to access corporate silo’ed data, summarize and catalog data, understand natural language processing, and present compelling visualizations that drive business actions. Some of the most valuable, market-transforming work under way today within businesses and the public sector is occurring at the intersection of disciplines, where communities of thought leaders from different roles, departments and geographies share and apply knowledge gleaned from Big Data.
Zions Bancorporation provides an example of analytics collaboration in practice. At an EMC/Greenplum and 451 Group event held February 2010 at the San Jose Museum of Art, Clint Johnson, Vice President of Business Intelligence, described how his bank quickly and effectively answered a SEC regulatory request by enabling multiple departments to coordinate together. Previously, answering an SEC request could take many days or weeks. With improved collaboration, each department added value to a central document store, much like a virtual assembly line, marking an important milestone in implementing Bill Inmon’s Corporate Information Factory (CIF) vision.