–Next LAMP Stack
Posted by Brett Sheppard on July 27, 2010
The Next LAMP Stack: Hadoop Platform for Big Data Analytics
Editor’s note: a shorter version of this article appeared on GigaOM.
Many Fortune 500 and mid-size enterprises are intrigued by Hadoop for Big Data analytics and are funding Hadoop test/dev projects, but would like to see Hadoop evolve into a more fully integrated analytics platform, similar to what the LAMP (Linux, Apache HTTP Server, MySQL and PHP) stack has enabled for web applications. For example, head of technology strategy and innovation at credit card giant Visa, Joe Cunningham, told the audience at last year’s Hadoop World that he would like to see Visa’s use of Hadoop evolve from an alpha/beta environment into mainstream use for transaction analysis, but has concerns about integration and operations management.
Trident Capital senior managing director Evangelos Simoudis noted in his Trident Capital Blog that to date there has not been a “LAMP stack equivalent” for Big Data aggregation, processing and analytics. Enterprise business analysts and IT departments don’t want to have to code their own MapReduce functions. As quoted in The Fourth Paradigm, Microsoft fellow Jim Gray said “We have to do better producing tools to support the whole research cycle – from data capture and data curation to data analysis and data visualization.”
For Hadoop to realize its potential for widespread enterprise adoption, it needs to be as easy to install and use as Lotus 1-2-3 or its successor Microsoft Excel. When Lotus introduced 1-2-3 in 1983, they chose the name to represent the tight integration of three capabilities: a spreadsheet, charting/graphing and simple database operations. As a high school student in the mid-1980’s, I used Lotus 1-2-3 to manage the reseller database for a storage startup, Maynard Electronics, that developed what eventually became Symantec’s Backup Exec software. Even as a 15 year old, I found Lotus 1-2-3 easy to understand and use. More recently, with Microsoft Excel 2010 and SQL Server 2008 R2, I can click on Excel ribbon buttons to load and prepare PowerPivot data and quickly create charts and graphs using built-in templates.
Unlike Lotus 1-2-3 or its successor Microsoft Excel, Hadoop still requires highly technical staff to understand and install its many components. Pig-based scripting and SQL-like interfaces through Hive help, but should be viewed as pieces of a broader data analytics platform.
I recently attended a small gathering organized by Accel Partners, a Silicon Valley VC firm. The event was held in order for us to better understand and discuss the new data stack that needs to come together much like the LAMP stack emerged for the new Internet in the last decade. At the Accel event, Jeff Hammerbacher, co-founder and vice president of products at Cloudera, discussed evolution toward a new analytical platform extending Hadoop, like the LAMP stack for web servers, but in this case a platform for analytical data management.
This emerging platform for Big Data analytics includes:
- Hadoop Distributed File System (HDFS) for storage
- MapReduce for distributed processing of large data sets on compute clusters
- HBase for fast read/write access to tabular data
- Hive for SQL-like queries on large data sets as well as a columnar storage layout using RCFile
- Flume and Scribe for log file and streaming data collection, along with Sqoop for database imports
- JDBC and ODBC drivers to allow tools written for relational databases to access data stored in Hive
- Hue for user interfaces (as well as interesting user interfaces from Karmasphere and Datameer)
- Pig for dataflow and parallel computations
- Oozie for workflow
- Avro for serialization
- Zookeeper for coordinated service for distributed applications
- RHIPE (R and Hadoop Integrated Processing Environment) to code Hadoop in R.
Most if not all of these tools are not specific to any one Hadoop distribution. You can choose for example among the Apache Hadoop, Amazon EMR, Cloudera and Yahoo! distributions. For more information about specific tools, visit the Apache Hadoop page.
Hadoop Enterprise Adoption
Adobe’s infrastructure services team began working with HBase in mid-2008, when one of their internal clients requested a service that could handle 40 million records with fast aggregation and access. They have since scaled HBase implementations to handle several billion records with access times under 50 milliseconds. Adobe software engineers has developed what they term “Hstack”, in which they have integrated HDFS, HBase and Zookeeper with a Puppet configuration management tool. Adobe can now automatically deploy a complete analytical data stack across a cluster and implement cluster-wide upgrades.
At Facebook, Hive allows developers to perform analysis against large datasets using SQL. Facebook created a web-based tool, HiPal, for business analysts to work with Hive. According to Bobby Johnson and Mark Rabkin from Facebook engineering, who presented at the Accel event, non-engineers at Facebook do Hive queries all the time. Business analysts can view reports and test hypotheses using familiar web browser interfaces.
As evidenced by these Hadoop cases, with the benefit of an analytical platform, data science becomes more integral to businesses, and less a quirky separate function. As an industry, we’ve come a long way since, famously, Jim Gray was once thrown out of the IBM Scientific Center in Los Angeles for failure to adhere to IBM’s dress code. Successful data scientists require “T-shaped skills” that include an understanding not only of SQL and statistics but also of natural language processing, content management, data visualization and enterprise-wide collaboration.
There has been significant progress in enabling an end-to-end “hello world” for analytical data management. If you download Cloudera’s CDH3b2, you can import data with Flume, write it into HDFS, and then run queries using Cloudera’s Beeswax Hive user interface.
There’s still work to be done, of course. Doug Cutting and his colleagues in the Apache Avro project are developing support for Map/Reduce over Avro for data interchange. Cloudera is working on a training guide to help enterprises that are not specialists in HDFS and MapReduce to benefit from Hadoop-driven analytics and integrate Hadoop with their existing enterprise IT architectures. Cloudera partners Datameer and Karmasphere are creating enterprise solutions on top of the Hadoop stack.
At HP’s business intelligence solutions group, I led a program that developed migration tools and documented implementation best practices to extend HP’s Global Method for BI Implementation and integrate HP BI with enterprise architectures such as TOGAF, the Zachman framework and the U.S. government’s Federal Enterprise Architecture. Having similar tools and documented best practices for Hadoop will add a lot of value to enable organizations to reduce the time, cost and risk for deploying and managing Hadoop as an enterprise analytical data platform.
As Michael Dell, founder and chief executive of Dell, told the Financial Times “We are still in the early stages of our industry in terms of how do organizations take advantage of, and tap into, the power of the information that they have.” (“The IT revolution is just beginning”, May 19, 2010). As the Hadoop data stack becomes more LAMP-like, we get closer to realizing Jim Gray’s vision and giving enterprises an end-to-end analytics platform to unlock the power of their Big Data with the ease of use of a Lotus 1-2-3 or Microsoft Excel spreadsheet.