Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
October 14, 2016
prev slideNext slide

Jumpstart Your Digital Transformation with Hadoop Native SQL Powered by Apache HAWQ

Guest author: Jeff Kelly, Data Strategist, Pivotal

The phrase “digital transformation” gets bandied about a lot these days, but what exactly does it mean? When you strip away the hyperbole, I believe digital transformation is the process by which enterprises evolve from using traditional information technology to merely support existing business models to adopting modern and emerging technologies and processes to enable new, innovative business models not otherwise possible. Further, these new business models, regardless of market or industry, emphasize delighting customers through personalized software-based experiences.

Part of any digital transformation, therefore, must include learning to use data to personalize the customer experience. This is a new skill, made up of a number of asynchronous steps and related technologies, for most enterprises. Traditionally, enterprises collect data from various transactional systems, transform the data into a common format, and load it into a data warehouse for reporting and analytics. These reports and analytics are largely backwards looking, providing an aggregate view of an enterprise’s operational and financial performance.

While important, not least of which for meeting regulatory reporting requirements, the use of data to understand past business performance is not sufficient to support the types of business models required to compete in today’s customer-centric economy. Customers today expect personalized software-based experiences from the companies and organizations they regularly interact with, regardless of industry. In order to deliver personalized software-based experiences, enterprises need to adopt new data storage, processing and analytics technologies that support iterative data science and machine learning at scale.

Hadoop Native SQL Powered by Apache HAWQ
Hortonworks HDB, based on Apache HAWQ (incubating), is a Hadoop native SQL database developed precisely to provide enterprise data engineers, data scientists and developers these capabilities. It combines the merits of Pivotal Greenplum, the leading massively parallel processing (MPP) analytical database with 10+ years of R&D investment behind it, with the strengths of Apache Hadoop, namely cost-effective distributed storage and powerful scale-out data processing.

Like Greenplum, HDB boasts a number of characteristics that make it ideal for data science at scale. These include:

  • Performance. HDB’s MPP, shared nothing architecture, including dynamic pipelining and cost-based query optimization, results in analytical queries that return results at the “speed of thought.” This level of performance is critical to support iterative, exploratory analytics, the results of which help data scientists identify unforeseen but potentially valuable deep statistical relationships between data. HDB allows users to fail fast and ask more questions of their data leading to faster time to insight.
  • Scale. HDB supports “speed of thought” query results even at Big Data scale. With HDB, data scientists are able to run predictive models and machine learning algorithms in-database, taking advantage of its MPP architecture to analyze hundreds of terabytes of data. HDB supports PL/R, PL/Python and Apache MADlib (incubating), a SQL-based open source machine learning library for scalable in-database analytics.
  • Accessibility. We all know data scientists and Hadoop experts are in short supply. With HDB, business analysts and even savvy business users can now interrogate data stored in Hadoop thanks to robust ANSI SQL compliance. HDB complies with ANSI SQL-92, -99, and -2003 standards, plus OLAP extensions enabling complex queries and joins, including roll-ups and nested queries. In addition, HDB supports Apache MADlib (see above.). This makes Hadoop accessible to a much wider user base than previously possible.

Individually each of these characteristics is impressive. But taken together they make HDB the best in class. Benchmark data bears this out. In native TPC-DS 15 TB single user, five user and 10 user data loading and query response times, HDB performs 143% faster, on average, than Cloudera Impala. In addition, Impala only supports 74 of the 99 standard benchmark queries. HDB supports all 99. Impala also lacks significant SQL feature support, including something as basic as handling a comment line at the end of a SQL file. In contrast, HDB 2.0, released earlier this year, further expands SQL functionality and adds three tier resource management and elastic query execution, among other improvements, further solidifying HDB as the clear leader in Hadoop native SQL analytics.

Hortonworks HDB is the result of a strategic partnership between Pivotal, the company accelerating digital transformation for enterprises, and Hortonworks, the leader in Connected Data Platform. HDB runs seamlessly on the Hortonworks Data Platform, where it is managed via Ambari just like any other native Hadoop service. With Hortonworks HDB, interactive SQL-based machine learning and data science at scale truly is a first class Hadoop citizen.

HDB’s Role in Supporting Digital Transformation
It is important to take a step back to understand the larger process of developing personalized software-based customer experiences and where HDB and the data science capabilities it enables fits.

The process starts with (1) collecting, storing and processing huge volumes of data with scale out systems. Next comes (2) analyzing the data and building predictive models. Then (3) the predictive models must be wrapped in APIs and (4) data pipelines must be adapted to execute the model scoring process. Finally, (5) the results of the model scoring process – be they personalized suggestions, estimates or recommendations – must be surfaced to applications in forms that are useful for and actionable by users.

There are, of course, a number of technologies and practices that play important roles in this process, including agile software development and microservices-based architectures. HDB’s role is to support steps one and two in the process. As a native Hadoop service, HDB is able to access and analyze huge volumes of data stored in Hadoop, be it in HDFS or in another format such as HBase. And it’s in-database analytic capabilities and support for machine learning libraries like Apache MADlib enable data scientists to build the predictive models that are ultimately operationalized to create the suggestions, estimates and recommendations that are at the heart of personalized software-based experiences.

Without iterative data science capabilities and machine learning at scale, true digital transformation wouldn’t be possible. Even the most elegant, intuitive and user-friendly applications are only as useful as the data and insights that feed them. From retailers and banks to healthcare providers and transportation companies, today providing customers personalized software-based experiences is key to competitive differentiation. That’s why Pivotal and Hortonworks are so excited to bring HDB to the market and to your enterprise.

To see HDB in action, don’t miss this upcoming webinar on Oct. 24 with HDB user Molina Healthcare. Ben Gordon, Molina’s vice president for enterprise infrastructure services, joins Pivotal’s Dormain Drewitz to talk about how HDB and the Hortonworks Data Platform are helping Molina undertake their own digital transformation. Register here.

Leave a Reply

Your email address will not be published. Required fields are marked *