Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
May 16, 2012
prev slideNext slide

Big Data Refinery Fuels Next-Generation Data Architecture

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of all your data are belong to us. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.

In my post entitled 7 Key Drivers for the Big Data Market, I asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the new(er) worlds of interactions and observations. This new world brings with it a wide range of multi-structured data sources that are forcing a new way of looking at things.

In order to make sense of this emerging space, I’ve created two graphics designed to walk through a vision of a next-generation data architecture. At the highest level, I describe three broad areas of data processing and outline how these areas interconnect.

The three areas are:

  1. Business Transactions & Interactions
  2. Business Intelligence & Analytics
  3. Big Data Refinery

The graphic below illustrates a vision for how these three types of systems can interconnect in ways aimed at deriving maximum value from all forms of data.

Apache Hadoop: Big Data Refinery

Enterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.

The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers.

More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.

The Big Data Refinery platform provides fertile ground for new types of tools and data processing workloads to emerge in support of rich multi-level data refinement solutions.

With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 ̊ view of customers, for example.

By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.

Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.

Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a third party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.

Let me conclude by describing how the various data processing technologies fit within this next-generation data architecture.

Next Generation Enterprise Data Architecture - Hortonworks

In the graphic above, Apache Hadoop acts as the Big Data Refinery. It’s great at storing, aggregating, and transforming multi-structured data into more useful and valuable formats.

Apache Hive is a Hadoop-related component that fits within the Business Intelligence & Analytics category since it is commonly used for querying and analyzing data within Hadoop in a SQL-like manner. Apache Hadoop can also be integrated with other EDW, MPP, and NewSQL components such as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana, Microsoft SQL Server PDW and many others.

Apache HBase is a Hadoop-related NoSQL Key/Value store that is commonly used for building highly responsive next-generation applications. Apache Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB, MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many others.

Finally, data movement and integration technologies help ensure data flows seamlessly between the systems in the above diagrams; the lines in the graphic are powered by technologies such as WebHDFS, Apache HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic, Splunk, Attunity and many others.

Key Takeaway

A next-generation data architecture is emerging that connects the classic systems powering Business Transactions & Interactions and Business Intelligence & Analytics with Apache Hadoop, a “Big Data Refinery” capable of storing, aggregating, and transforming multi-structured raw data sources into usable formats that help fuel new insights for the business.

Enterprises that can maximize the value from all of their data (i.e. transactions, interactions, and observations) will put themselves in a position to drive more business, enhance productivity, or discover new and lucrative business opportunities.

~ Shaun Connolly



Tony Baer says:

If you think about it, the Data Refinery concept is a logical follow-up to the original idea of data warehousing, which was to gather lots of data, progressively transform it to make it consumable for reporting & analytics. More to the point, a logical follow yup on steroids, given the volume and variety of data.

Of course what we do with the data afterwards marks the real differentiator, as we get more into explore and learning instead of the known-questions mode. But that’s a whole other can of worms.

Shaun Connolly says:

Thanks for the comments Tony.

One of the goals for my graphic was to provide a bit of a visual “you are here” to the wide variety of use case discussions I’m seeing in/around Hadoop.

For example, I used the graphic at MarkLogic World 2 weeks ago. It helped frame up the 3 use cases the MarkLogic folks are seeing between MarkLogic and Hadoop:

1. Pre-processing for real-time analytics

2. Progressive enhancement (including initial pre-proc and ongoing feedback loop)

(1 & 2 reinforce the point you made)

3. Bulk loading of XML files (parallel in/out of Hadoop gets data into MarkLogic faster than direct ingestion, larger cluster = faster). This has added benefit of enabling some cleansing and normalization to occur before shipping downstream.

Agree with your “get more into explore and learning” point; sifting thru all of the cans of worms to enable that to happen in a way that’s adoptable by enterprises needs to be the goal.

Amos Ferrari says:

I found this refinery concept very good and have expanded it to ‘Big Data Services’ as part of this refinery system.
Will value your comments on what I have posted.

Kingdom Hearts Unchained X Hack Jewels says:

This design is incredible! You obviously know how to keep a
reader amused. Between your wit and your videos, I
was almost moved to start my own blog (well, almost…HaHa!) Great job.
I really loved what you had to say, and more than that, how you presented it.
Too cool!

Leave a Reply

Your email address will not be published. Required fields are marked *