Thoughts on analytics, data management, visualization and collaboration

–Accel Event Recap

Posted by Brett Sheppard on July 22, 2010

Hadoop, Memcached and Solid-state Storage

Hosted at Stanford University, Accel Partners brought together executives from four of their portfolio companies to discuss evolution of a “New Data Stack” incorporating Hadoop, memcached and sold-state storage.At the Accel Partners event, Jeff Hammerbacher, Co-founder and Chief Scientist at Cloudera, discussed evolution toward a new analytical platform extending Hadoop, like the LAMP stack (Linux, Apache HTTP Server, MySQL and PHP) for web servers, but in this case a platform for analytical data management. To continue the LAMP stack analogy, the goal is to enable an end-to-end “hello world” for analytical data management. End-users can view reports and test hypotheses using familiar web browser interfaces.

This emerging platform includes Hadoop, HBase for read/write access, Hive for SQL-like queries on large data sets, Pig for dataflow, Oozie for workflow, Flume for streaming data collection, and Zookeeper for coordinated service for distributed apps. Hive is adding JDBC and ODBC interfaces.

A copy of Jeff’s slides are available at SlideShare. You can also find the book Beautiful Data, edited by Jeff and Toby Segaran, on the O’Reilly and Amazon.com websites.

Cloudera is preparing a training package to help enterprises that are not specialists in Hadoop Distributed File Systems (HDFS) and MapReduce to benefit from Hadoop-driven analytics. Datameer and Karmasphere are two of the Cloudera partners that are providing data analytics solutions.

NorthScale Co-founder and Senior Vice President of Products James Phillips discussed their Membase Server. Memcached is implemented as a critical component within tiered web architectures where it is deployed on web or application servers or independently alongside a traditional database layer. As traffic demands increase, web applications using NorthScale scale out horizontally by increasing the number of web servers and caching servers, alleviating database load and improving web application performance.

Fusion-io provides flash-based solid-state storage application accelerators to HP, IBM, Dell and other customers. An example customer, Answers.com, achieved an 8x improvement in disaster recovery backup times. Another customer, Cloudmark, improved database replication performance by five times.

Bobby Johnson and Mark Rabkin from Facebook engineering discussed lessons learned from scaling a leading Internet property. Facebook generates about 60 Terabytes of log files a day. Bobby is one of the principal authors of Scribe, which is a server for aggregating log data streamed in real time from a large number of servers. Facebook infrastructure resource reporting integrates multi-level histograms.

Hive allows Facebook developers to perform analysis against large datasets using SQL. Facebook has created a web-based tool, HiPal, for business analysts to work with Hive. Non-engineers do Hive queries all the time.

Many thanks to Ping Li and his colleagues at Accel Partners for hosting an excellent event, with a thought-provoking look at how advances in Hadoop, memcached and solid-stage storage are enabling a new stack for Big Data analytics.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: