Thoughts on analytics, data management, visualization and collaboration

–Parse and Visualize Unstructured Data in Hadoop

Posted by Brett Sheppard on November 5, 2011

In October 2011, I recorded a video chat with Ronen Schwartz, Vice President of B2B Products at Informatica, and Karl Van den Bergh, Vice President, Product and Alliances at Jaspersoft. The video runs a little under 10 minutes, and discusses approaches by Informatica and Jaspersoft to parse and visualize big data in Hadoop. (Disclaimer: Informatica is a Zettaforce client).

These approaches extend Jaspersoft business intelligence and Informatica data integration investments to leverage data stored in Hadoop, and support an Informatica drag-and-drop visual studio approach for parsing data in Hadoop that reduces the need for developers to manually write Map/Reduce scripts. For a broader discussion of data integration with Hadoop beyond the specific topics of parsing and visualization covered in this video and article, please refer to a Hadoop article series I wrote this summer for the Informatica Perspectives Blog.

Video chat with Brett Sheppard (left-hand side, wearing a white shirt), Karl Van den Bergh (center, in a blue shirt), and Ronen Schwartz (right-hand side, in a black shirt).

Challenges with Parsing

Parsing breaks down data into its component parts, with an explanation of the form, function, arrangement or patterns (syntax) of each part. Natural language is one of the most difficult to parse. While most languages including English have grammar rules for subjects, verbs, pronouns, etc., specific sentences can be difficult to parse in the absence of context. Even with the combination of IBM natural language processing software, a supercomputer and Hadoop, IBM Watson struggled to understand some of the language formulations in questions posed by the Jeopardy! TV game show last year.

Numbers can also require context or industry-specific data formats to understand correctly. For example “451” may refer to: a telephone area code in Ontario (Canada), the city of Lubeck (Germany) or the island of Guam (unincorporated U.S. territory in the western Pacific Ocean); the title of a Ray Bradbury novel (Fahrenheit 451) or a technology research consultancy (The 451 Group), both named after the temperature in degrees Fahrenheit at which paper burns; or the address of the Nebraska Department of Motor Vehicles office at 451 Main Street in the town of Chadron, Nebraska.

Within the context of a call data record (CDR), it might at first seem straightforward to identify that 451 is a telephone area code. However, is it the area code for the phone caller or the phone recipient, part of the IP address for a Voice over IP call, or part of the ID of a cell tower that connected a mobile phone call? Answering that question requires interpreting the structure of the call data record through parsing.

Parsing Data in Hadoop

Parsing data through Map/Reduce jobs can be important to empower business intelligence software to interpret raw or unstructured data stored in a Hadoop Distributed File System (HDFS) cluster. Previously, parsing data in Hadoop required developers to write custom Map/Reduce scripts based on industry-specific data formats and/or data transformation rules. For example, to parse and “flatten” data for millions or billions of call data records (through the Map and Reduce steps, respectively), it was necessary to understand and define the CDR data format, structure and logic.

As defined by David Loshin in a white paper “Parsing and Standardization” available at the TDWI website: “The goal of parsing is to scan the value(s) embedded in an attribute and produce the data components that represent the conceptual tokens in the data stored within the data element.” For a CDR, token refers to chunks of information such as a person’s name, phone number or IP address.

As is often the case with data management, what starts out sounding simple – such as a person’s name – quickly becomes complicated when there are millions or billions of records, which may derive from multiple data sources that have incompatible formats, missing information or duplicate records. The data elements in a name may comprise multiple pieces of information, such as first, middle and last name; title (Ms., Mr., Dr., etc); academic suffixes (i.e., Jane Ellen Smith, M.D.); or generational suffix (i.e., John Quincy Smith, Jr.). There can be multiple acceptable formats that data can take (i.e., not everyone has a middle name, and some countries list the family name first).

Parsing also requires a means for tagging records that have unidentified data components. For example, there are infrequently used suffixes that a parser may not recognize, such as K.B.E. for “Knight Commander of the Order of the British Empire” – that refers to male honorary British knights – or the variant for women, “Dame Commander of the Order of the British Empire” (D.B.E.). And just in case that’s not complex enough for you, if you’re a truly special person in Great Britain you can be recognized as a Knight Grand Cross or Dame Grand Cross and add the suffix G.B.E. to your name. Rather than attempting and failing to try to define all of this minutiae of name variants, what parsing will do is tag the unidentifiable data components. In addition to parsing, related steps in an information management process may include data cleansing or data de-duplication.

Informatica HParser

With an announcement this week, Informatica now offers a visual studio approach to parse raw or unstructured data stored in Hadoop. (Disclaimer: Informatica is a Zettaforce client). With Informatica HParser, developers can create an abstraction layer between the application logic in Map/Reduce and the data sources. Business analysts that use Jaspersoft or other business intelligence (BI) tools can then visualize that data in their standard BI desktop, web or mobile user interfaces.

With HParser, a developer or data analyst can build a parser – or a broader data translation or data quality process, if needed – using an Eclipse-based Data Transformation (DT) Studio (example screenshot pictured below). The developer then “deploys” the parser from the DT Studio to a file-based repository and distributed to Hadoop cluster nodes. Either Map/Reduce scripts or Hive/Pig can then invoke and execute the rules defined in the translation on a given input.

According to a pre-announcement briefing last month by Informatica, with HParser’s combination of out-of-the-box parsers for industry standards as well as visual environment to handle file format definitions, an estimated 80% of the development time for parsers can be reduced. The Informatica DT Studio also makes it easier to manage new formats and changes to existing formats over time.

For example, within the healthcare and life sciences industry, there are a lot of variations and complexity in the International Standards Organization (ISO) Abstract Syntax Notation One (ASN.1) data representation format used for the storage and retrieval of data for genomic research such as nucleotide and protein sequences. Organizations in telecommunications and financial services, among other industries, can also benefit from out-of-the-box support for industry standards.

For one of the most common use cases for Hadoop – storage and analysis of web logs – the Informatica HParser offering goes beyond the relatively straightforward parsing of Apache web logs to interpret proprietary vendor-specific logs such as web analytics data from Adobe Omniture. In the beta testing leading up to this week’s announcement, Informatica was able to relate data items from different parts of the web logs and interpret data through log hierarchies.

The Informatica HParser Commercial Editions support web log files, XML, JSON (JavaScript Object Notion), Microsoft Word documents, Excel spreadsheets, PDFs and industry-standard formats such as HIPAA for healthcare and SWIFT for financial transfers. The Informatica website lists the currently supported industry formats. For the Commercial Editions, Informatica offers a free 30-day evaluation. In addition, Informatica offers a free (but not open source) HParser Community Edition – available for download at Informatica Marketplace – that supports web logs, XML files and JSON objects.

For more details, visit Informatica.com/Hadoop.

Jaspersoft Support for Hadoop

Jaspersoft supports three primary modes of access to data in Hadoop. First, directly though Hive, to accept SQL-like queries through HiveQL. This approach can suit IT staff or developers who want to run batched reports, but current-generation Hive can be a rather slow interface, so it’s not ideal for use cases that require a low-latency response.

Second, Jaspersoft is one of the first – if not the first – business intelligence software vendors to provide connectivity directly to HBase. Jaspersoft feeds this data into its in-memory engine through an HBase connector. This approach can work well for a business analyst to explore data stored in Hadoop without the need to write Map/Reduce tasks. HBase has no native query language, so there’s no filtering language. But there are filtering APIs. Jaspersoft’s HBase connector supports the various Hadoop filters from simple ones such as StartRow and EndRow to more complex filters such as RowFilter, FamilyFilter, ValueFilter, or SkipValueFilter. The Jaspersoft HBase query can specify exactly the ColumnFamilies and/or Qualifiers that are returned; some HBase users have very wide tables, so this can be important for performance and usability. The Jaspersoft HBase connector also ships with a deserialization engine (SerDe) framework for data entered via HBase’s shell and for data using Java default serialization; users can plug in their existing deserialization .jars so the connector will automatically convert from HBase’s raw bytes into meaningful data types.

Third, you can use a data integration process through Informatica or other ETL providers into a relational database that Jaspersoft software can then report on directly or perform analysis on through Jaspersoft’s in-memory database or OLAP engine.

To quote Karl Van den Bergh, Vice President, Product and Alliances at Jaspersoft: “you get direct, real-time, interactive access to data in Hadoop without the latency of Hive or the complexity of extracting Hadoop data into another database.” In addition to Hadoop, Jaspersoft provides native reporting to a variety of “NoSQL” / “NotJustSQL” / “NewSQL” key value stores, document databases, graph databases and data grid caches in addition to EMC Greenplum, HP Vertica and IBM Netezza MPP analytic databases. For more details, visit Jaspersoft.com/BigData.

Bottom Line

While some developers and their employers will still prefer to write custom Map/Reduce scripts, for many organizations there are benefits from time-savings, re-use and adaptability by using a visual studio approach such as Informatica’s HParser. As more business intelligence software providers such as Jaspersoft enable native reporting to Hadoop, business analysts and other BI users can benefit from Hadoop without making the time commitment to understand the underlying complexities of Apache Software Foundation (ASF) HDFS, Map/Reduce and the related ASF projects and sub-projects.

2 Responses to “–Parse and Visualize Unstructured Data in Hadoop”

  1. Thanks for this wonderful post and hoping to post more of this.

  2. Many thanks for a few other good write-up. Exactly where different might any individual obtain that variety of information and facts in this particular best procedure for writing? I’ve a business presentation in the future, for around the search for this sort of facts.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: