Posts by Jim Walker:


Streaming IN Hadoop: Yahoo! release Storm-YARN

Over the past year, customers have told us they want to store all their data in one place and interact with it in multiple ways… they want to use Hadoop, but in order to do so, it needs to extend beyond batch.  It also needs to be interactive and real-time (among others).

This is the entire principle behind YARN, which together with others in the community, Arun Murthy and the team at Hortonworks have been working on for more than 5 years!  The YARN based architecture of Hadoop 2.0 is hugely significant and we have been working closely with many partners to incorporate it into their applications.

Storm-YARN Released as Open Source

Yahoo! has been testing Hadoop 2 and its YARN-based architecture  for quite some time.  All the while they have worked on the convergence of the streaming framework Storm with Hadoop.  This work has resulted in a YARN based version of Storm that will radically improve performance and resource management for streaming.

We borrow from their blog post because they say it best…

Collocating real-time processing with batch processing offers a number of advantages over segregated clusters.

  • It provides a huge potential for elasticity. Real-time processing will rarely produce a constant and predictable load. As such, Storm needs more resources to keep up with spikes in demand. Collocating Storm with batch processing allows Storm to steal resources from batch jobs when needed and give them back when demand subsides. The Storm-YARN effort lays the groundwork to make this possible.
  • Many applications use Storm for low-latency processing and Map/Reduce for batch processing while sharing data between Storm and Map/Reduce. By placing Storm physically closer to the data source and/or other components in the same pipeline we can reduce network transfers and in turn the total cost of acquiring the data.

YARN as the basis of Hadoop 2.0 Architecture

We are excited about this development because it reinforces our approach of enabling the broader ecosystem of Hadoop based applications.  And that an open community is the fastest path to this innovation.  It is amazing to watch the pace of innovation that is occurring and we know we are still in the very early days of this evolution of technologies around Hadoop to meet the needs of the broad enterprise.

We are also excited about Storm-YARN as it is yet another application to move IN Hadoop.  Now we have SQL-IN-Hadoop for interactive queries with Stinger / Tez, Continuuity and WEAVE and now Storm-IN-Hadoop for streaming!  We look forward to a summer full of innovation around YARN.

Hadoop Tooling with Talend Open Studio for Big Data and Hortonworks Data Platform

Talend Open Studio for Big Data provides an intuitive set of tools that make dealing with data in the Hadoop world (and Hortonworks Data Platform in particular) a lot easier.  We often use the tools often to speed delivery of a proof of concept or to operationalize movement of data from sources like web logs and machine sensors to load HDFS.  It is simple to use and typically takes only minutes to perform something that once took hours in a script.

Recently. Talend launched Talend Open Studio for Big data version 5.3.  it is a substantial upgrade and provides some pretty cool tools.  The component I look forward to playing with is tPigMap which allows you to graphically create data transforms and have the underlying Apache Pig scripts written for you.  Talk about simplicity!

Talend Studio

 

If you’re using the Hortonworks Sandbox to experiment with Hadoop, then we’ve written a How To that shows how you can connect Talend Open Studio for Big Data to Sandbox.

You can download the Hortonworks Sandbox here, and download Talend Open Studio for Big Data here.

Great tools to get more productive with you Hadoop development –  go for it!

Big Data Defined – Part Deux: Value Definition

A few weeks back we posted a definition of “big data”.  There was definitely some internal conversation about the term and if this definition had captured what the term means.  Sum finding: it is a loaded term.  It means a lot of different things to a lot of different people.

When I first joined Hortonworks, I bought in to the three V’s (volume velocity and variety) definition of big data.  It works for the most part, but is more a descriptor of the data.  It explains the characteristics of the data.  The definition is cold and lacks soul.  Afterall,  “big data” represents promise of “big” business value.

A “Value” Definition of Big Data

Screen Shot 2013-04-29 at 8.19.59 AMLast year, Shaun Connolly, Hortonworks VP of Corporate Strategy came up with this definition…
Big Data = Transactions + Interactions + Observations.

I gravitate to this because it outlines WHAT the data is, not just the characteristics.  It points to areas that we should focus on as businesses.  It lends to the value a bit more. Each of the three components are important.

  • Transactions are pretty simple to understand.  This is our ERP data.  It is the data that we maintain and track in our OLTP systems.  It can be any record of any system-to-system or human-to-system interaction.  It can even be a human-to-human interaction as long as it is captured electronically. We use a lot of this data in our analytics today.
  • Interactions are the points in time we relate with a system.  It could be a tweet or a facebook post.  It could be an electronic or paper customer satisfaction survey.  Interactions are web logs and A/B tests.  We have a lot of this data but typically no efficient way to understand or extract value from it.
  • Observations are interesting because they represent a world of net new data sources that we once never thought of analyzing.  It is data that was once thought of as low to medium value data or even exhaust data that was too bulky and just too expensive to store. This can be machine-generated data from sensors or web logs and clickstreams or even audio/video or largely unstructured content.  Typically, we never even thought of this data before.

The Intersection Is Where Things Get Interesting

This “value” definition of big data gets interesting when you substitute the plus signs in Shaun’s definition with intersections…
Big Data = Transactions ∩ Interactions ∩ Observations.

With big data technology (one of these being Apache Hadoop) we can now efficiently store and process all of this data.  We can refine observation data down to the salient details that may be interesting in the context of our EDW.  But even more interesting we can ask these big data systems new questions.  We can combine data across all these types and come up with new value for organizations.  There is a world of data in our organizations that are used for an explicit purpose.  When we start to combine things, the big data world gets really interesting.

If you’re using Hadoop to create value from your big data, why not check out our Hadoop Patterns of Use whitepaper and see how it can work for you.

Field Report: OpenStack Summit – The Hadoop Bizarro World

portland2PORTLAND – The Rose city is a great place and this week it got even more interesting with the OpenStack Summit in town. I am more a data geek and very rarely do I venture down the stack into infrastructure, but wow, there is something cool going on with the OpenStack community.  I couldn’t help but to get wrapped up in the excitement.  Not only was the enthusiasm palpable, it was also very familiar. I don’t know if it was the organic buzz of Portland or not, but I felt a little like I was in Hadoop bizarro world.

Hadoop on OpenStack

Hortonworks was the only “app” vendor on the show floor and our story was well received.  When you partner with the leading code contributor (Red Hat) and the leading system integrator (Mirantis) and have existing relationships with the founders (Rackspace) of OpenStack, you get some relative street cred. But honestly, the attendees I spoke with were incredibly happy to see us at the event because they saw our joining the community was about contributing serious code and Hadoop experience to Project Savanna.  This is characteristic of a vibrant community of developers.

It really didn’t take a lot of explaining to open the eyes of the audience to the reality that “Hadoop is the Perfect App for OpenStack”.  These guys and gals get it.  They are looking for the right application to drive adoption of OpenStack and Hadoop with its new workloads for an enterprise fits the bill. We look forward to seeing some crossover audience at Hadoop Summit when we roll out the first wave of our efforts by demonstrating the ease of deployment of Hadoop on OpenStack via the new Savanna project.

We were pretty busy on the show floor and were also invited by our friends at theCube (@furrier & @jefffrick ) to speak about Savanna and how Hadoop is good for OpenStack.  The video and corresponding article were great coverage.  Also, among a range of other press outlets picking up the story, the Register had a great summary of Project Savanna from the show floor.

Socialism v Capitalism

Being an Apache guy, I was curious to how the OpenStack community is governed.  With all these vendors in the building, it seemed there was a lot of powerful players involved.  Who is in charge? I had a few conversations about this and it seemed to me that there is a healthy democracy with some very powerful parties and lobbyists involved.  Sounded to be a bit like capitalism to me, which led me to a comparison with Apache….  Perhaps we are Socialism and OpenStack is Capitalism.  ;)

I met and spoke to a few of the committee members for OpenStack, including Devin Carlen (Nebula) and Josh McKenty (@jmckenty & PistonCloud).  Both are founders of OpenStack, founders of companies and have contributed significantly to the project.  They were amused by my theory.

OpenStack Summit Growth: Enter Sales and Marketing into the Community

The show has historically been mostly a “real” summit where developers got together to discuss, design and code.  There is still a lot of that going on, but the influx of “business” was overwhelming.  The growth of the show demonstrates the importance of the project. To quote Rackspace, “Between OpenStack’s Folsom and Grizzly releases, OpenStack experienced a more than 50 percent growth in contributions. According to some of the businesses closest to the project, OpenStack isn’t just about writing code; it’s about creating an infrastructure everyone can use. It’s about creating something amazing.”  Enter business.

Screen Shot 2013-04-19 at 8.56.06 AMWith some help from Chris Horne (@fpcguru) at CloudScaling and Fresh Perspective Consulting I was able to analyze (no data science here, just marketing guy stuff) the attendee list.  Out of 3000 registered, I would say close to one third were from the leading vendors in this space.  This seems to be a pretty mix for the show (and the community for that matter) and shows a vibrant range of adoption beyond the large players.  There are some big names involved and we can only expect the countdown has started and OpenStack is set to take off.

The Third Coming

One of my most interesting conversations this week was with a financial analyst at the show who characterized OpenStack as the “The Coming of The Third Generation of IT”. (Oh, I forgot to mention that they were all over the show as well.  It seems everyone wants to know who this helps or hurts and which small company is gonna crush it.) This led me to explore what exactly were gen 1 and 2.  Perhaps the old world of mainframes and PC in the 70s, 80s and early 90s was the first generation IT team.  They were a group of pencil protected, flannel shirt wearing guys with big glasses who walked around with disks and screwdrivers.  Mid nineties, we shifted into the second generation with client server and the Internet.  Data centers grew up and a shift towards SaaA started.

Today, the third generation is becoming reality.  The Cloud hype over the past few years provided us with PaaS and now with OpenStack, we may really see widespread adoption of IaaS.  We know one thing, in order to fuel adoption of OpenStack and this new infrastructure, an application must come along to spur adoption.  Funny enough, at the same time, Hadoop has established itself as the driver of net new workloads in an organization.  This is the exact greenfield opportunity for the OpenStack enthusiast to help drive adoption.  Hadoop is the Perfect App for OpenStack in this “Third Generation of IT”.

HP Moonshot: Big Potential for Big Data & Hadoop

moonshotWhile we are quite a far way away from hearing “Houston, tranquility base here… the eagle has landed”, the HP moonshot is definitely pushing us all toward a new class of infrastructure to run more efficient workloads, like Apache Hadoop. Hortonworks applauds the development of flexible Big Data appliances like Moonshot. We are excited about this development as it signals alignment across development, operations and infrastructure within organizations.  For quite some time, our team has been accustomed to a natural balance required across these three constituents and now the server the market is joining in on the game.

We agree with our friend, Jeff Kelly at Wikibon in that “Big Data as one example of a workload that requires a lot of low level optimization. One of the main reasons is that Hadoop clusters are scaled over time in response to increased usage, and factors like power efficiency and the physical footprint of servers become major considerations as the environment grows in size.”

Wait!  1800 Nodes in a Single Chassis?

Did I just hear Moonshot can enable up to 1800 nodes in a single chassis.  Wow! Sounds like physical resource optimization to go along with optimizations provided for compute and storage in Apache Hadoop.  To quote one of our Hortonworkers, “this is awesomeness on a stick”.  Moonshot seems to be forward looking as well.  It will eventually lead to further price/power/utilization optimizations as the price of SSD drops and I/O becomes more widely deployed against flash.  The HP Moonshot approach is interesting with sled for servers and sleds for disk enables completely new server and rack configurations to be optimized for Hadoop.  We are looking forward to getting our hands on it.

Ultimately, Hadoop workloads are somewhat unique and we are intrigued to say the least at where the future can go with the HP Moonshot approach.

Hive/HCatalog – Data Geeks & Big Data Glue

Unstructured data, semi-structured data, structured data… it is all very interesting and we are in conversations about big and small versions of each of these data types every day. We love it…  we are data geeks at Hortonworks. We passionately understand that if you want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it.  Within Hadoop, this critical metadata service is provided by HCatalog.

As a key component of Apache Hive, HCatalog is a metadata and table management system for the broader Hadoop platform. It enables the storage of data in any format regardless of structure. Hadoop can then process both structured and unstructured data and it can store and share information about data’s structure in HCatalog. This capability combined with the ‘schema on read’ nature of Hadoop versus traditional EDW ‘schema on write’ reduces cycle time for data scientists seeking insight as it encourages exploration and discovery on a continuous basis.

Similarly, Hive/HCatalog also enables sharing of data structure with external systems including traditional data management tools. It is the glue that enables these systems to interact effectively and efficiently and is a key component in helping Hadoop fit into the enterprise.

SQL Interface for Hadoop? HCatalog as enabler…

Since 2008, Hive has reigned as the defacto SQL interface for Hadoop as it provides a relational view through SQL like language to data within Hadoop. HCatalog publishes this same interface but abstracts it for data beyond Hive.  It also publishes a REST interface for external use so that your existing tools can interact with Hadoop in the way you expect… via ODBC and JDBC into SQL!

Good for the ecosystem is good for you

HCatalog intends to enable the ecosystem to more general SQL interaction to Hadoop. Our partners are building dedicated interfaces on top of this key interaction point to drive a Hadoop strategy within their products.  For instance, Teradata has created SQL-H on top of HCatalog as their default interface to Hadoop, enabling their users to query across this big data resource from existing tools. So now, as performance enhancements of Hive through the Stinger initiative progresses, their tools get better and better.

Hadoop Developer productivity and HCatalog

HCatalog also allows developers to share data and metadata across internal Hadoop tools such as Hive, Pig, and MapReduce. It allows them to create applications without being concerned how or where the data is stored, and insulates users from schema and storage format changes.  It is a repository for schema that can be referred to in these programming models so that you don’t have to explicitly type your structures in each program. It provides a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements.  It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the warehouse.

HCatalog in Use

So how might you use HCatalog? Organizations today are using HCatalog in a variety of different ways, however, the key uses could be summarized as the following:

  • Enabling the Right Tool for the Right Job
    The majority of heavy Hadoop users do not use a single tool for data processing.  Often users and teams will begin with a single tool:  Hive, Pig, MapReduce, or another tool.  As their use of Hadoop deepens they will discover that the tool they chose is not optimal for the new tasks they are taking on.  Users who start with analytics queries using Hive discover they would like to use Pig for ETL processing or constructing their data models.  Users who start with Pig discover they would like to use Hive for analytics type queries.  While tools such as Pig and MapReduce do not require metadata, they can benefit from it when it is present.  Sharing a metadata store also enables users across tools to share data more easily.  A workflow where data is loaded and normalized using Map Reduce or Pig and then analyzed via Hive is very common.  When all these tools share one metastore users of each tool have immediate access to data created with another tool.  No loading or transfer steps are required.
  • Capture Processing States to Enable Sharing
    When used for analytics, users will discover information using Hadoop.  Again, they will often use Hive, Pig and Map Reduce to uncover information.  The information is valuable but typically only in the context of a larger analysis.  With HCatalog you can publish results so they can be accessed by your analytics platform via REST.  In this case, the schema defines the discovery. These discoveries are also useful to other data scientists.  Often they will want to build on what others have created or use results as input into a subsequent discovery.
  • Integrate Hadoop with everything
    Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption it must work with and augment existing tools.  Hadoop should serve as input into your analytics platform or integrate with your operational data stores and web applications.  The organization should enjoy the value of Hadoop without having to learn an entirely new toolset.  REST services opens up the platform to the enterprise with a familiar API and SQL-like language.  Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform. By tieing in more closely they can hide complexity from users and create a better experience. A great example of this is the SQL-H integration from Teradata Aster. SQL-H queries the structure of data stored in HCatalog and exposes that back through to Aster enabling Aster to access just the relevant data stored within the Hortonworks Data Platform.

HCatalog is just one of many components of Apache Hadoop and the Hortonworks Data Platform. You can find out more here, including further integration points, and how Hortonworks provides the enterprise rigor to Apache Hadoop.

Apache Hadoop Patterns of Use: Refine, Enrich and Explore

“OK, Hadoop is pretty cool, but exactly where does it fit and how are other people using it?”  Here at Hortonworks, this has got to be the most common question we get from the community… well that and “what is the airspeed velocity of an unladen swallow?”

We think about this (where Hadoop fits) a lot and have gathered a fair amount of expertise on the topic.  The core team at Hortonworks includes the original architects, developers and operators of Apache Hadoop and its use at Yahoo, and through this experience and working within the larger community they have been privileged to see Hadoop emerge as the technological underpinning for so many big data projects. That has allowed us to observe certain patterns that we’ve found greatly simplify the concepts associated with Hadoop, and our aim is to share some of those patterns here.

ThumbnailAs an organization laser focused on developing, distributing and supporting Apache Hadoop for enterprise customers, we have been fortunate to have a unique vantage point.

With that, we’re delighted to share with you our new whitepaper ‘Apache Hadoop Patterns of Use’. The patterns discussed in the whitepaper are:

Refine: Collect data and apply a known algorithm to it in a trusted operational process.
Enrich: Collect data, analyze and present salient results for online apps.
Explore: Collect data and perform iterative investigation for value.

You can download it here, and we hope you enjoy it.

 

 

 

Thankful…

Happy Thanksgiving!

Today, like the rest of the U.S., we take a pause from our regular blog schedule to give thanks…

We are thankful for mappers and reducers. We are thankful for namenodes and jobtrackers. We give thanks to speculative execution battling the march of the last reducer. Give thanks to every petabyte, terabyte, gigabyte, file and block of data. We are thankful for the capacity scheduler.

We are very thankful for many things here at Hortonworks and I know many of us are thankful for an extra long weekend. This has been an amazing year at Hortonworks. We have seen our team double and then triple in size and we are thankful for our smart and hard-working Hortonworkers. We are thankful for sushi lunches, an office of candy, snacks, drinks AND paid gym memberships.

We are thankful for everyone in the Apache Hadoop community and to all those who have downloaded HDP. We are thankful for a ecosystem of partners who are second to none. MOST of all, we are thankful to our investors and to all those companies who have chosen to partner with us as customers.

Happy Thanksgiving!

Rackspace and Hortonworks, a Match Made in the Clouds

As we speed towards wide spread enterprise adoption of Apache Hadoop, it has become readily apparent that this new data platform must not only capture, process and distribute data, but it also must be able to be deployed in a variety of ways, be it on premise, in a VM, as an appliance or better yet in the cloud…

Today we announced a new relationship with Rackspace in which we will develop an OpenStack based Hadoop solution for the public and private cloud. This is not just a paper relationship.  It is a joint effort to produce and make available Hortonworks Data Platform for OpenStack in early 2013.

There are customers today that deploy Hadoop clusters using HDP on dedicated hardware at Rackspace and this is now available as a turn-key, on-demand service running on the Rackspace open cloud and in clusters on private cloud infrastructure in data centers or a customer’s data center.

Why does this make sense?
Well, when you speak of the OpenStack we think of compute, networking and storage as the three main components. OpenStack was created by Rackspace as a collaborative software project designed to create freely available code, badly needed standards, and common ground for the benefit of both cloud providers and cloud customers. In this environment, Hortonworks just makes sense.  Our 100% open source approach is freely available; standards based and better yet open to integrate with the ecosystem and other stack components. More importantly, core Hadoop is compute and storage and Hortonworks provides the most stable and reliable distribution for this.  For wide scale adoption, Hadoop must be enterprise ready and HDP represents this.

Avoid Vendor Lock
The point of an OpenStack is to provide an open and scalable operating system for building public and private clouds. It provides both large and small organizations an alternative to closed cloud environments, reducing the risks of lock-in associated with proprietary platforms. With Rackspace you simply provision the service and you are “good to go”.  With Hortonworks, we add a new service to the stack that is also provisioned via Rackspace so you can be up and running in minutes and without license and without the vendor lock.

The main reason we can do this is we package a fully open Apache Ambari for monitoring and managing a cluster.  With other distributions you need to purchase these same capabilities, which not only locks you in to the vendor for license but also closes the ecosystem, as the open source community can no longer be a source for patches or upgrades.  You need to wait for your vendor to release their proprietary fix, even for the open source bits they built on top of. Not with Hortonworks.

This approach allows customers to invest further into the open cloud future to confidently invest in a technology for the long term.

Where exactly IS your data?
Many have turned to the cloud to store or process data.  Doesn’t it make sense to extend this processing for big data in the cloud where much data already resides?  Well with this new offer you can do just that and in only a matter of minutes.  You can easily extend your current Rackspace environment by firing up a Hadoop cluster and there is no need to move data from internal resources to the cloud the data is already there.  While this may not be the case for every Hadoop project, it makes sense for many and it may make sense for many Rackspace customers.

Rackspace & Hortonworks… seems like a match made in heaven, well, maybe in the clouds

 

If you would like more information, please contact us or Rackspace.

 

Hortonworks & Teradata: More Than Just an Elephant in a Box

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye…  it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with.  It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this
This is an engineered solution.  Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL.  This is a great approach but it lacks integration of metadata and metadata exchange.  With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product.  SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata.  Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze.  All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration
In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics.  With this package, these valuable functions can now be applied to big data in Hadoop.  This shortens the time it takes for an analyst to explore and discover value in big data.  And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Lighting up operations
Often overlooked when an organization considers Hadoop is the impact on IT operations.  They are tasked with making sure a cluster is functional.  Well, these guys have countless tools to perform their job and for Teradata they use Viewpoint Teradata Vital Infrastructure.  In this release, we have integrated the management and monitoring communications use by Ambari with these monitoring tools. Now, the ops guy has a true single pane of glass to monitor the Teradata environment AND the Hadoop cluster used to provide the big data analytics.

Some details on the appliance
The Teradata Aster Big Analytics Appliance runs on proven Teradata hardware, leverages the most current Intel® processor chip technology, SUSE® Linux operating system, and market-leading enterprise-class storage. It can be configured to store a maximum of 5 petabytes of uncompressed user data for Aster and up to 10 petabytes of uncompressed user data for Hadoop.

“The Teradata Aster Big Analytics Appliance offers the faster path from diverse big data acquisition to big insights, and seamlessly delivers these insights to the business owners. Unmatched by any other stack in the industry, it enables organizations to overcome the barriers to big data analytics and provides a high-definition view of the business to optimize operations.”– Scott Gnau, president, Teradata Labs.

This is unique and it ushers in a new approach to big data analytics.

Insights from DataWeek: San Francisco

I spent some time at the first ever DataWeek in San Francisco last week.  It is a brand new show and it was very well-run, spread across a few cool spaces with an interesting mix of novice to experienced data professionals.  They had a good blend of labs, speakers, panels and great networking opportunities.  In all, it was great and a big thanks and kudos to the organizers.

I took part in a panel and also presented a three-hour overview of Hadoop.  There were some good questions thrown at the panel but more interesting was the discussion over the three sessions.  Before each presentation, I ran an informal survey of the room to get a sense of audience and there was an even mix of complete novice, those new to Hadoop and experienced practitioners.

Each session had lively discussion and great engagement.  There were three segments to the presentation: Hadoop market overview, Intro to Hadoop, Hadoop usage patterns.  I would also say that, in general there were three key points that the audience really seemed to focus on.

Forest/Trees :: Distribution/Project
There are Hadoop distributions and there is the Apache Hadoop project.  When you are new to this world and learning through all the media, you can get lost in this terminology and the clarification of this point seemed important to the some of the Dataweek crowd.

The conversation went a little like this… the Apache Hadoop project comprises MapReduce and HDFS.  Sometimes we refer to this as “core Hadoop” as it is the central focus of a Hadoop project. It provides redundant and reliable storage and distributed processing or compute. In order for Hadoop, the project, to become a more complete data platform, we, the community have created several related projects that make Hadoop more useful and dependable. When we package these projects (Hive, HBase, Pig, HCatalog, Ambari, ZooKepper, Oozie, etc…) with core Hadoop, this becomes a “distribution”.

A distribution came about because each project has its own release cycle and getting the right versions together is sometimes difficult.  Also, a distribution will package the projects and provide an installer to make deployment much easier.

Insatiable Thirst for Use Cases
Design Patterns by Gamma et al. has and always will be one of the best developer books written. I like design patterns because they take a lot of data and boil it down to naturally occurring state.  They make sense of chaos.

In the third hour of our overview, we presented some reusable patterns of use for Hadoop, namely, Refine, Explore and Enrich.  With refine we apply a known process to a set of big data to extract results and use them in a business process.  With explore, we use Hadoop to discover new information that was not attainable before.  Often with explore, we will operationalize findings to be used in the refine patters.  Finally with enrich we use big data to supplement and improve a user experience for an online application.

This session was scheduled for 45 minutes and went the full hour and beyond.  There were a LOT of questions and interactions.  The material was well received by the experienced professionals as it made sense of their projects and for those new to Hadoop it provided a good sense of where to start or how to approach this big data thing.

We Face Challenges
It seemed everyone wants to get started but are presented with challenges.  There were really three areas of focus in this discussion, acquiring skills, managing a cluster and building a business case. The business case and validation of a project was interesting as some said you should just start with a project and run with it, while others advocated careful planning and a formal process.I guess in the end both sides were right.

It depends on your org and what they can stomach really.I will add my two cents however…  Hadoop is open source and available to you today so use it and start addressing all three of the challenges in the immediate future.

As noted, Dataweek was a huge success and I am honored to have taken part in what surely will be a regular event.  Congrats to the organizers on the birth of a new show.

Welcome Hortonworks Data Platform 1.1

Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop

It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…

  • Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
  • Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
  • And, our Hortonworks team grew by leaps and bounds!

In these same three months our growing team of committers, engineers, testers and writers have been busy knocking out our next release, Hortonworks Data Platform 1.1.  We are delighted to announce availability of HDP 1.1 today! With this release, we expand our high availability options with the addition of Red Hat based HA, add streaming capability with Flume, expand monitoring API enhancements and have made significant performance improvements to the core platform.

Ask our sales and support teams, adoption of Apache Hadoop is clearly growing.  In order to accelerate this wide spread interest and adoption our customers demand that their Hadoop distribution is both stable and reliable. It is overwhelming… the enterprise needs to have confidence in the platform.  To this end, we are dedicated to meeting these expectations and these key new features in HDP 1.1 represent a step in that right direction.

Highly Available Hadoop
Not only is HDP 1.1 built on the most stable and reliable release of Hadoop, we are the only distribution to provide full stack high availability on this release. With HDP 1.1, we extend our HA options with the ability to include the most current versions of Red Hat Enterprise Linux (RHEL) and the High Availability Add On. So, now our customers have an option to use industry leading solutions from both VMware and Red Hat as well.

Capturing Data Streams
The addition of Apache Flume into the distribution enables expanded streaming data capture for analysis within the Hortonworks Data Platform. Organizations can now easily and reliably collect and analyze real-time data streams, such as high-volume web logs, in Apache Hadoop, driving additional insights from data that was previously too bulky to capture and process.

Empowering Ops
Operations is a key player in a Hadoop implementation as they are tasked with monitoring and managing the Hadoop infrastructure.  HDP 1.1 delivers easier and deeper integration into third-party management tools and systems so that operations can more easily manage a cluster along side other resources… through a single pane of glass.

Faster, Faster
Hadoop is fast, but why not make it faster?  With this release, we have tested out a 10% + performance improvement on MapReduce jobs over our previous release.  Faster read and writes speed data capture and delivery within the platform. Improved Map Reduce execution performance means that jobs process data more quickly.

To get started with HDP 1.1, please visit our downloads page.

There are also a wealth of useful technical resources available as well, including online documentation, community forums and a Hortonworks knowledge base. Please visit the Community section of our website for these resources and more.

Finally, please join us for our next “What’s New” webinar this week where we will talk more about the new 1.1 features.

UC Irvine Medical Center: Improving Quality of Care with Apache Hadoop

This is the first part of a series written by Charles Boicey from the UC Irvine Medical Center.  The series will demonstrate a real case study for Apache Hadoop in healthcare and also journal the architecture and technical considerations presented during implementation.

With a single observation in early 2011, the Hadoop strategy at UC Irvine Medical Center started. While using Twitter, Facebook, LinkedIn and Yahoo we came to the conclusion that healthcare data although domain specific is structurally not much different than a tweet, Facebook posting or LinkedIn profile and that the environment powering these applications should be able to do the same with healthcare data.

In healthcare, data shares many of the same qualities as that found in the large web properties.  Each has a seemingly infinite volume of data to ingest and it is all types and formats across structured, unstructured, video and audio. We also noticed the near zero latency in which data was not only ingested but also rendered back to users was important. Intelligence was also apparent in that algorithms were employed to make suggestion such as people you may know.

We started to draw parallels to the challenges we were having with the typical characteristic of Big Data, volume, velocity and variety.

In the beginning, our first project was to build an environment capable of ingesting Continuity of Care Documents (CCD) via a JSON pipeline, store them in MongoDB and then render them via a web user interface that had search capabilities. From that initial success project Saritor was launched.

Saritor is the Roman god for cultivation, in this case the cultivation of healthcare data for the purposes of rapidly progressing through the data to information, to knowledge, to wisdom continuum. We saw this project as vehicle for demonstrating the value of Applied Clinical Informatics and promoting the translational effects of rapidly moving from “code side to bedside”.

Why Saritor? The Electronic Medical Record (EMR) cannot handle complex operations such as anomaly detection, machine learning, building complex algorithms or pattern set recognition and the Enterprise Data Warehouse (EDW) supports quality, operations, clinicians & researchers. We, like many organizations with data warehouses run ETL processes at night to minimize the load on the production systems. We have some have real time interfaces with the data warehouse,but not all data is ingested in real time. In turn, our data suffers from a latency factor of up to 24 hours in many cases making this environment suboptimal. An adjunctive environment is needed to fill in the gaps.

Why Apache Hadoop?

Hadoop has a very attractive scale to cost ratio because it is A) open source and B) the server requirements are minimal and VM is an option. We currently deploy eight nodes, which is a far cry from the multiple 4000+ node clusters that Yahoo employs but our small environment is providing us big value.

Hadoop is uniquely capable of storing a wide range of healthcare environment data not matter the type or amount of structure.  For us, this includes:

  • all ancillary HL7 feeds (without the need for modification),
  • EMR generated data,
  • genomic data,
  • financial data,
  • RTLS data from assets,
  • patient and caregiver data,
  • smart pump data,
  • incremental physiological monitoring measurements (across one minute increments),
  • ventilator data in one minute or less increments
  • and temperature and humidity data.

Any electronically generated data in a healthcare environment can be ingested and stored in Hadoop and most importantly on commodity hardware.

But wait, that’s not all. The Hadoop ecosystem is modular and within those modules lays the functionality to build algorithms for surveillance, detection and notification of conditions such as sepsis or the prediction of potential 30 day readmits. Other uses cases we are working on include monitoring “Sink Time”, that is how much time caregivers spend washing their hands; patient throughput with the ability to capture actual hand off times; patient scorecards pushed to the patient portal and the ability to discover the unknown unknowns in our data.

Hadoop has also answered the problem of legacy data. UC Irvine Healthcare like many healthcare organizations has a legacy system, clinicians and researchers needed access to the data. Data conversion from the legacy system to the new EMR or data warehouse was not feasible. Our legacy system like others has the ability to print to text the patient record. For UCI that meant 1.2 million patients and over 3 million records. Those records are now in Saritor and are searchable. Solving this use case was our first deliverable with a demonstrable ROI.

We believe that Hadoop is the right environment for developing an analytic ecosystem to aide in the delivery of quality care at the lowest possible cost and an environment to enable clinical researchers to examine healthcare data in its entirety.

Next time we’ll dive deepr into the Saritor Hadoop ecosystem, ongoing and future development as well as collaborations with our partners.

Apache Hadoop, the Energy Softgrid and my Imaginary Tesla

This week, I spent some time and enjoyed speaking at the Softgrid 2012 conference in San Francisco. It was a great collection of speakers and attendees and opened my eyes to some Hadoop driven possibilities that not only differentiate utilities companies but will also transform our day-to-day lives.

The conference focused on software (in this case intelligent analytics) as a competitive advantage to enable value and growth for utilities.  These often large and historically conservative organizations have moved beyond the notion that their sole business is to distribute electric power efficiently, reliably, and cost-effectively to consumers. They now rely on analysis of massive amounts of data they already collect from smart meters and existing networks about distribution and consumption, and are taking progressive action on that data.

As we have seen in other markets, such as Financial Services and Retail, data is becoming the currency for an energy market transformation.

While I am not a Prius, Volt or Tesla (unfortunately) driver, I am sensitive to eco-friendly causes that have a large and immediate impact on the way we consume our natural resources.  I feel I am like many consumers in that saving five to ten or twenty dollars on my monthly bill is important but honestly I am more interested in knowledge and insight into usage and just how green I am. Call me an armchair activist I guess.

This conference opened my eyes to a broad range of possibilities for the utilities to really change the way we live and increase their bottom line through green tech. Here are two possible uses of big data in Energy.

Generation vs. consumption

My friend and Hortonworker, Rikin Shah, walked me through one potential use case of Hadoop in energy before I even left for San Francisco.  There is no such thing as a big battery that will hold any excess energy that is generated by the utility companies.  That means if we burn the coal or split the atoms we have to use all the energy produced or it gets wasted.  The challenge is that the consumption curve is erratic and this leads to waste, as we have to produce more than necessary to avoid a brownout when consumption extends beyond generation.  It is difficult at best to predict consumption.  However we can get a lot better through data.

In some companies they use smart meter technology that can automatically read meters at any desired interval. For many organizations this is once or twice a month, however they are moving to collect readings every four hours. That’s 6/day x 30 – 180x growth in data points collected per month per house! Why shouldn’t this be eve more frequent?  Well the amount of data is massive.  What if we could extend this to near continuous meter reads and analyze in near real time.  It could get us to better predictability of spikes and reduce the padding between production and consumption.  Further, new technologies (such as Nest thermostats) bring this direct touch to the point of consumption.  As we evolve, certainly smart light switches and wall outlets could all be tied into the grid to provide real touch with real consumption.  Perhaps we combine usage data with detailed weather data that drills down to a square meter.  The profound analysis could revolutionize help us conserve through near real time production.

Individually provisioned consumers

My phone goes everywhere I go.  I use it… a lot and I am often found borrowing a charger or asking someone if I could plug in and give it a charge.  They pay the bill.  This is ok when it is just a few kilowatts but what if I was out of electricity and I was at your house with my (pretend) Tesla?  I will presume I would consume much more than a few kilowatts to give my car a jump.  Currently, there is no way to track and provision usage per device and through to the owner.    Why not?  In part it is a data problem.

A more intelligent provisioning system would require a massive registry of devices and locations and the ratings engine…  well, that would be heck of an algorithm and would require some pretty heavy computation.  If you think the telecom call data record analysis is complex, this would be insane.  Devices come and go and there are a magnitude more options for plugging in.

Enter Hadoop

Apache Hadoop with massively parallel process and widespread storage.  Many utilities companies are already enjoying the benefits of this open source data platform.  There are probably a few innovations and some fairly substantial capital expense necessary to make a fully connected grid a reality but it is not that far off and definitively a possibility.  When I was a kid, my dad used to yell at me when I left lights on in the house.  Maybe if the light switch registered my presence, I could ask my pop to take the cost out of my allowance.

Back to our green Earth

IT was an interesting day full of great speakers and roomful of people very interested in how technology and software in general can aid in creating amore efficient grid.  Sure, a more intelligent grid will reduce costs through more intelligent production, but it can also change the way we all think about consumption.

I’m still waiting to get that Tesla!

Data Integration Services & Hortonworks Data Platform

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us.  Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Talend addresses four key concerns for those using HDP:

  • Bridge the skills gap– Not everyone has a PHD in computer science…  Talend presents a graphical tool where you drag and drop pre-built components on to a canvas, configure them and then all the underlying code is created for you.  This is Java code that can be executed anywhere Java runs and even package as a service.  You can also customize the code however you see fit or use it within another IDE.  This radically simplifies the data load process.  All you need to know is the basic configurations and voila!… your data is loaded.
      
  • HCatalog Integration – Hortonworks and Talend engineering teams have partnered to bring HCatalog specific components and functions deeply integrated with the Talend connectors.  Components allow you to easily create, drop and modify tables and databases and check for existence, etc. Also, when storing data you can choose HCatalog as a storage option.  This provides the developer with options within the specific tools for Hive and Pig to integrate with HCatalog and share data and its structure much more easily. HCatalog then provides the metadata services for the underlying data and opens up the platform.
  • Connect to the entire enterprise – The enterprise is full of different sources and targets for data.  These can be databases, applications, files, services and even data warehouses and cubes.  Integration with these resources is not always simple.  We could take the top ten and provide connectors and call it a day, but enterprise data centers are not so homogeneous. With Talend we are able present a palette full of options, in fact they have over 400 connectors available.  In this video, you can see us grab and parse an Apache log file in seconds using a component.  These pre-tested components that save integration time by providing proven and tested APIs and schemas to make these connections.  Want to pull data from Salesforce.com?  …drop a component, configure your login credentials and your Salesforce metadata and data are at your fingertips.
  • Graphic Pig Script Creation– Talend also provides components to deliver Pig Scripts without writing a line of code.  Components for join, aggregate, filtering, cross and others are all included.  Again you drop a component, connect schema, configure the function, and then all the underlying code is written for you…making your time to delivery all that faster.

This approach can help all of your Hadoop-related projects move a lot faster so you can quickly move past the “where do I start?” question to the more interesting “what’s possible with all this data?”.

Related links: