Enterprise Hadoop and the Journey to a Data Lake
If there’s one thing my interactions with our customers has taught me, it’s that Apache Hadoop didn’t disrupt the datacenter, the data did. The explosion of new types of data in recent years has put tremendous pressure on the datacenter, both technically and financially, and an architectural shift is underway where Enterprise Hadoop is playing a key role in the resulting modern data architecture.
The successful Hadoop journey typically starts with new analytic applications, which lead to a Data Lake. As more and more applications are created that derive value from the new types of data from sensors/machines, server logs, clickstreams, and other sources, the Data Lake forms with Hadoop acting as a shared service for delivering deep insight across a large, broad, diverse set of data at efficient scale in a way that existing enterprise systems and tools can integrate with and complement the Data Lake journey.
Hadoop and Your Existing Data Systems: A Modern Data Architecture
The use of Hadoop as a complement to existing data systems is extremely compelling since it provides a low cost scale-out approach to data storage and processing and is proven to scale to the needs of the very largest web properties in the world.
Hadoop and the Value of “Schema on Read”
Unlike traditional relational databases that require data to be transformed into a specified structure (or schema) before it can even be loaded into the database, Hadoop focuses on storing data in its raw format where analysts and developers can then apply structure to suit the needs of their applications at the time they access the data. The traditional “Schema On Write” approach requires a lot more forethought and IT involvement, whereas Hadoop’s “Schema on Read” approach empowers users to quickly store data in any format and apply structure in a very flexible and agile way whenever needed.
For example, assume an existing application exists that combines CRM data with Clickstream data to obtain a single view of a customer interaction. As new types of data become available that might be relevant (ex. server log or sentiment data) that data can easily be added to enrich the view of the customer. The key distinction being that at the time the data was stored, it was NOT necessary to declare its structure and association with any particular application.
The Hadoop Journey Typically Starts With New Analytic Applications…
Hadoop usage typically begins by creating new analytic applications fueled by data that was not previously being captured. While the applications tend to be unique to specific industries or organizations, there are many similarities across these applications when viewed through the lens of the specific types of data involved.
Examples of analytics applications across industries include:
…Which Lead to the Data Lake
With the continued growth in scope and scale of applications using Hadoop and other data sources, the vision of an enterprise Data Lake starts to materialize. Combining data from multiple silos, including internal and external data sources, helps your organization find answers to complex questions that no one previously knew how to ask.
For example, a large U.S. home improvement retailer’s data about 100 million customer interactions per year was stored across isolated silos, preventing the company from correlating transactional data with various marketing campaigns and online customer browsing behavior. What this large retailer needed was a “golden record” that unified customer data across all time periods and across all channels, including point-of-sale transactions, home delivery and website traffic.
By making the golden record a reality, the Data Lake delivers key insights for highly targeted marketing campaigns including customized coupons, promotions and emails. And by supporting multiple access methods (batch, real-time, streaming, in-memory, etc.) to a common data set, it enables users to transform and view data in multiple ways (across various schemas) and deploy closed-loop analytics applications that bring time-to-insight closer to real time than ever before.
In a practical sense, a Data Lake is characterized by three key attributes:
- Collect everything. A Data Lake contains all data, both raw sources over extended periods of time as well as any processed data.
- Dive in anywhere. A Data Lake enables users across multiple business units to refine, explore and enrich data on their terms.
- Flexible access. A Data Lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.
As data continues to grow exponentially, Enterprise Hadoop investments can provide a strategy for both efficiency in a modern data architecture, and opportunity in an enterprise Data Lake.
The end result? Maximum scale and insight with the lowest possible friction and cost.
Think Pigabyte, Not Petabyte
Implementing Hadoop as part of a modern data architecture is a substantial decision for any enterprise and, as discussed, its adoption is typically a journey from single instance applications to a fully-fledged Data Lake. My photo with Pigabyte – a porcelain pig I met in Bath UK a few years ago – reminds me that the journey is NOT about assembling petabytes of data, it’s about encouraging people to gather sufficient new types of data with existing data sources and not only allowing but enabling them to wallow in that data in ways that creatively unlock the value within. Yep, you heard me right: allow your inner child to come out and don’t be afraid to get dirty!
To learn more about Hadoop, the Data Lake, and the Modern Data Architecture, I encourage you to download our whitepaper.