Category Archives: Apache Hadoop


Cascading for Hadoop and Hortonworks Data Platform

cascading-logo-315x97Today Concurrent announced that we have certified the Hortonworks Data Platform  against the Cascading application framework. As Hadoop adoption continues to grow more organizations are looking to take advantage of new data types and build new applications for the enterprise. By combining our enterprise-grade data platform and unparalleled growing ecosystem with the power, maturity and broad platform support of Concurrent’s Cascading application framework, we have now closed the modeling, development and production loop for all data-oriented applications.

Cascading and Big Data Applications

For those that aren’t familiar Cascading is the most widely used and deployed application framework for building robust, enterprise Big Data applications on Hadoop. Recognized companies, including The Climate Corporation, eBay, Etsy, FlightCaster, iCrossing, Razorfish, Trulia, TeleNav and Twitter, are using Cascading to streamline data processing, data filtering and workflow optimization for large volumes of unstructured and semi-structured data. Cascading is also at the core of popular language extensions including PyCascading (Python + Cascading), Scalding (Scala + Cascading) and Cascalog (Clojure + Cascading) – open source projects sponsored by Twitter. Cascading has become the most reliable and repeatable way of building and deploying Big Data applications.

Cascading and Hortonworks Data Platform

HDP is the only 100-percent open source ApacheTM Hadoop®-based data management platform. HDP allows users to capture, process and share data in any format and at scale. Built and packaged by the core architects, builders and operators of Hadoop, HDP includes all of the necessary components to manage a cluster at scale and uncover business insights from existing and new big data sources.

Together, with the simplicity and flexibility of Cascading and the reliability and stability of the HDP, companies can rapidly build, test and deploy new data transformation and refinement, data processing, analytics and machine-learning applications. Enterprises can now leverage existing skill sets, core competencies and product investments by carrying them over to HDP via the standards-based technology – Java, ANSI SQL and machine-learning standards. Analysts and data scientists familiar with these can now easily run predictive data models at scale and integrate ETL, data preparation and predictive analytics in the same application, greatly reducing time to production and unlocking access to large Hadoop data sets.

You can read more about Modern Data Architecture with Hadoop here.

Streaming IN Hadoop: Yahoo! release Storm-YARN

Over the past year, customers have told us they want to store all their data in one place and interact with it in multiple ways… they want to use Hadoop, but in order to do so, it needs to extend beyond batch.  It also needs to be interactive and real-time (among others).

This is the entire principle behind YARN, which together with others in the community, Arun Murthy and the team at Hortonworks have been working on for more than 5 years!  The YARN based architecture of Hadoop 2.0 is hugely significant and we have been working closely with many partners to incorporate it into their applications.

Storm-YARN Released as Open Source

Yahoo! has been testing Hadoop 2 and its YARN-based architecture  for quite some time.  All the while they have worked on the convergence of the streaming framework Storm with Hadoop.  This work has resulted in a YARN based version of Storm that will radically improve performance and resource management for streaming.

We borrow from their blog post because they say it best…

Collocating real-time processing with batch processing offers a number of advantages over segregated clusters.

  • It provides a huge potential for elasticity. Real-time processing will rarely produce a constant and predictable load. As such, Storm needs more resources to keep up with spikes in demand. Collocating Storm with batch processing allows Storm to steal resources from batch jobs when needed and give them back when demand subsides. The Storm-YARN effort lays the groundwork to make this possible.
  • Many applications use Storm for low-latency processing and Map/Reduce for batch processing while sharing data between Storm and Map/Reduce. By placing Storm physically closer to the data source and/or other components in the same pipeline we can reduce network transfers and in turn the total cost of acquiring the data.

YARN as the basis of Hadoop 2.0 Architecture

We are excited about this development because it reinforces our approach of enabling the broader ecosystem of Hadoop based applications.  And that an open community is the fastest path to this innovation.  It is amazing to watch the pace of innovation that is occurring and we know we are still in the very early days of this evolution of technologies around Hadoop to meet the needs of the broad enterprise.

We are also excited about Storm-YARN as it is yet another application to move IN Hadoop.  Now we have SQL-IN-Hadoop for interactive queries with Stinger / Tez, Continuuity and WEAVE and now Storm-IN-Hadoop for streaming!  We look forward to a summer full of innovation around YARN.

Optimizing Hadoop for Microservers

SM15K_Frt2_RThere are plenty of server and storage options for the wave of data that is being collected and analyzed.  New platforms such as Apache™ Hadoop® provide the opportunity to make all the new data types being collected useful.  However, like any other platform, performance varies depending on the underlying servers being used.  There is great promise in what Hadoop can deliver in terms of business value, and the ecosystem is continuously growing with companies making strides to make Hadoop easier to deploy and manage.

One area that has experienced huge advancements is the data center server.  The power and cooling requirements of data centers have really become an important issue, and the major vendors are all focused on helping the industry become cleaner and greener.  AMD SeaMicro has been a leader in this area and reimagined the server and pioneered fabric-based dense, micro server with technology that interconnects pools of resources over a supercompute fabric with an unprecedented 1.28 Tbps bisectional bandwidth that can access more than five petabytes of direct attached storage.  The SeaMicro Freedom™ Fabric removes the constraints of the traditional server and allows data centers to expand in multiple dimensions without adding unneeded hardware and costs. Hadoop does not need the fastest processor, but it does need to be affordable and easily scaled out as the amount of data that is collected and analyzed increases.

The data center server is the key underlying infrastructure that enables all of these new innovative services.  Though the amount of data being collected is unlimited, data center capacity clearly is not.  The industry is realizing that data center servers need real innovation that extends beyond the individual server components and takes into account the end-to-end perspective encompassing compute, storage and networking. It’s time to re-imagine the data center server, and deliver what the industry needs.  Companies are experiencing problems that just cannot be solved with traditional servers.

To hear more about how microservers can improve your Hadoop performance and minimize operations, join AMD SeaMicro and Hortonworks for a provocative discussion in the June 18 webinar: How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers, hosted by the Linux Journal.

To learn more about AMD SeaMicro visit: www.seamicro.com

For more on delivering a modern data architecture for your business, click here.

 

 

Hortonworks and Red Hat engineering collaboration to increase enterprise adoption of Hadoop and Red Hat Storage

redhat-logoThis week we’re at the Red Hat Summit along with many others enjoying the great discussions within the community. As part of the summit, we are delighted to announce extended collaboration with Red Hat to continue to advance open source big data community projects.

Some details on the the three areas of collaboration forming the announcement:

  • Enhancing Apache Ambari to support the management of Hadoop-compatible file systems, such as GlusterFS. With this integration, users will be able to provision, deploy, monitor and manage alternative file systems with Ambari, further cementing Ambari’s position as the standard for Hadoop management.
  • Creating generic test suites to validate compatibility between Hadoop and alternative file systems. Hortonworks and Red Hat will contribute these extensive testing blueprints to the open source community for use by any developer looking to test file system compatibility with Hadoop.
  • Working to integrate Hortonworks Data Platform with Red Hat Storage so that enterprise customers will be able to process stored data on Red Hat Storage. Since Red Hat Storage is POSIX-compliant, it makes it easy to connect to the enterprise applications and run Hadoop analytics on enterprise data to reduce duplication of data and save costs.

We’re excited about our engineering-level relationship with Red Hat and other leading technology vendors as they help ensure the integration of the Hortonworks Data Platform with existing enterprise datacenter investments, a crucial requirement of a modern data architecture.

Hadoop Use Case: Harnessing Big Data in the Social Advertising Industry

Successful social advertising campaigns today take a special blend of data intelligence and automation – enabling businesses to link fluctuations in media and tactics to sales and revenues.  Those with better data relative to their competitors, will be positioned to outperform their peers tactically and, if used effectively, strategically.  At one of the fastest growing Advertising Technology startups, harnessing Big Data made big sense in a highly competitive business environment.

The Advertising Technology startup sells Social Ad Campaign management software and wanted its in-house engineering team to focus on its core product and to outsource certain areas of its non-core technology needs. The non-core portion of its technology stack required cutting edge computing skills and entailed creating a Big Data Analytics infrastructure built on the Hortonworks Data Platform (HDP), and hosted on the Amazon Cloud.

A key component for the system development was a scalable crawler to aggregate social data to meet demanding latency requirements.  The crawl infrastructure had to meet two aspects; 1/ to support the timely refresh of data (e.g. existing social profile data), and 2/  to keep up with the exponential growth of data collection requirements in a timely manner.   Both of these requirements lead the startup logically to a Hadoop based framework and Hortonworks Data Platform (HDP) as the platform of choice.

SerendioFlow

Other factors the Advertising Technology startup needed to address was a reduction in cost for designing and maintaining a highly available and scalable HDP infrastructure with robust analytics and a predictive modeling backend to meet evolving business initiatives.  To accomplish this, the startup enlisted the services of Serendio, a provider of a Big Data Science platform – DisKoveror, designed to enable Enterprises to Aggregate, Discover, Analyze, Visualize, and Predict business outcomes from seemingly unrelated facts and relationships buried in all forms of digital assets for holistic intelligence and insights.

Harnessing Big Data proved to make big sense for the Advertising Technology startup; creating new revenue streams and a reduction in operational costs by 60%.

“We chose Hortonworks Data Platform (HDP) and Serendio for our implementation because collectively they had the right technology, skills and expertise to scale our infrastructure rapidly to keep up with our fast growing business.”

Thank you to our partner Serendio for this HDP (Hadoop) use case. For more use cases, visit Serendio’s case studies

Serendio’s Big Data Science solutions help in driving Decisions and Actions for a wide variety of businesses in Retail, Insurance, Media, Education, and Healthcare. Visit Serendio at http://www.serendio.com

For more on delivering a modern data architecture for your business, click here.

Hadoop Tooling with Talend Open Studio for Big Data and Hortonworks Data Platform

Talend Open Studio for Big Data provides an intuitive set of tools that make dealing with data in the Hadoop world (and Hortonworks Data Platform in particular) a lot easier.  We often use the tools often to speed delivery of a proof of concept or to operationalize movement of data from sources like web logs and machine sensors to load HDFS.  It is simple to use and typically takes only minutes to perform something that once took hours in a script.

Recently. Talend launched Talend Open Studio for Big data version 5.3.  it is a substantial upgrade and provides some pretty cool tools.  The component I look forward to playing with is tPigMap which allows you to graphically create data transforms and have the underlying Apache Pig scripts written for you.  Talk about simplicity!

Talend Studio

 

If you’re using the Hortonworks Sandbox to experiment with Hadoop, then we’ve written a How To that shows how you can connect Talend Open Studio for Big Data to Sandbox.

You can download the Hortonworks Sandbox here, and download Talend Open Studio for Big Data here.

Great tools to get more productive with you Hadoop development –  go for it!

Week in Review: HDP 1.3, Hadoop on Windows, More Hadoop Tutorials

The Hadoop goodness just keeps on flowing as we’ve delivered new releases and new content in the past 10 days. Let’s recap.

HDP 1.3 ReleaseThis milestone release takes advantage of improved performance in Hive 0.11 along with delivery on a series of enterprise requirements including NFS access to HDFS, improved MTTR for HBase, business continuity through HDFS and HBase snapshots, optimized connectors to Oracle and Netezza and the latest release of Ambari for management and operations. All of this represents the wicked fast pace of community-driven open source.

The full set of components is: Hadoop 1.2.0, Hive 0.11, Pig 0.11 HBase 0.94.6, Sqoop 1.4.3, Oozie 3.3.2, ZK 3.4.5, Mahout 0.7.0 Ambari 1.2.3 - Go get it!

 Hadoop 1.2.0, Hive 0.11, Pig 0.11 HBase 0.94.6, Sqoop 1.4.3, Oozie 3.3.2, ZK 3.4.5, Mahout 0.7.0 Ambari 1.2.3

Hadoop on Windows. The imaginatively named HDP for Windows which delivers, well, HDP for Windows became generally available on May 21st. HDP for Windows is particularly exciting as it unlocks the power of Hadoop for ‘the other half’ of development teams and enterprises everywhere – same teams, same skills, same powerful standards such as Hive. You can get it here, and you can get started here.

Hadoop Tutorials in the Sandbox. Meanwhile, we released a series of new tutorials for the Hortonworks Sandbox so you can continue to build out those skills. This time we included real-life use cases and specifically analyzing website user behavior through clickstream analysis covering loading the logs, working with ODBC drivers, and analyzing and visualizing in Excel. Take that @BigDataBorat :)

Hive 0.11 and SQL-Compatibility. Sneaking in a little too late for our last review, Alan outlined the progress of SQL compatibility with the Hive 0.11 release which is neatly laid out below, all in the context of delivering SQL-IN-Hadoop through the Stinger Initiative.

An ever-expanding ecosystem. Did we mention that we’ve partnered with Splunk to continue to enable next generation enterprise architecture? Sure we did. And we’ve worked with Alteryx to release a whitepaper on getting started implementing Hadoop-based analytics in your organization: The Business Analyst’s Guide to Hadoop.

Hadoop Summit. Don’t forget to register - it’s going to be excitingStill time to try and squeeze into one of the Meetups: Hive, Pig, HBase, YARN, Accumulo, Ambari, Oozie, Data Science and Architecture or maybe attend Big Data Camp or Machine Learning Evening on 25th June.

Just enough time to complete those tutorials. Have a great weekend.

Hortonworks Data Platform 1.3 Release: The community continues to power innovation in Hadoop

HDP 1.3 release delivers on community-driven innovation in Hadoop with SQL-IN-Hadoop, and continued ease of enterprise integration and business continuity features.

Almost one year ago (50 weeks to be exact) we released Hortonworks Data Platform 1.0, the first 100% open source Hadoop platform into the marketplace.  The past year has been dynamic to say the least!  However, one thing has remained constant: the steady, predictable cadence of HDP releases.  In September 2012 we released 1.1, this February gave us 1.2 and today we’re delighted to release HDP 1.3.

HDP 1.3 represents yet another significant step forward and allows customers to harness the latest innovation around Apache Hadoop and its related projects in the open source community.  In addition to providing a tested, integrated distribution of these projects, HDP 1.3 includes a primary focus on enhancements to Apache Hive, the de-facto standard for SQL access in Hadoop as well as numerous improvements that simplify ease of use.

The Relentless March of Community Driven Innovation

Consistent with our approach and together with many others in the community, Hortonworks has been working hard to progress the Hadoop projects at the Apache Software Foundation.  We believe that identifying enterprise requirements, introducing them into the community and working within those projects at the ASF is the fastest path to innovation and HDP 1.3 represents that philosophy realized.

Hortonworks Data Platform Releases

By incorporating all of the latest relevant and stable Apache project releases in HDP 1.3 we are able to provide our customers with the most up-to-date Hadoop platform available.  And because it is 100% open source, it eliminates any notion of vendor lock-in

In fact, the graphic above illustrates the progress we have made in a very short time.

By applying our consistent approach to innovation and maintaining a cadence of releases we believe that we can greatly accelerate Hadoop adoption and enable an ever-larger number of customers to adopt Apache Hadoop as a core component of their enterprise data architecture.

HDP 1.3, SQL-IN-Hadoop: Phase 1 of the Stinger Initiative

Stinger InitiativeApache Hive is the defacto standard for SQL access in Hadoop, and the Stinger Initiative is a coordinated effort by Hortonworks and many others to enhance Hive for the emerging requirement for interactive queries in Hadoop.

HDP 1.3 is the first distribution to include Apache Hive 0.11 which delivers a 50x improvement in performance for queries and broadens the range of SQL semantics supported in Hadoop as part of the Stinger Initiative.  Incorporating over 350 enhancements contributed by a broad community of over 55 developers from more than 10 organizations, Hive 0.11 is a phenomenal demonstration of the power of the community!

Ease of Use and Business Continuity

As the user base for Hadoop expands quickly, HDP 1.3 continues the focus on ease of use to include the following set of capabilities:

Ease Of Use

  • This release provides NFS v3 standards-based access to HDFS so that file system can be accessed as a mounted drive on the network, simplifying movement of data in and out of Hadoop.
  • HDP 1.3 provides more access to enterprise data from Hadoop with optimized Oracle and Netezza connectors, enhanced HCatalog support for Sqoop and the ability to transfer Sqoop direct loads to/from RCFile and ORCFile.
  • Apache Ambari, the open source management and provisioning solution for Apache Hadoop was upgraded to include job diagnostic improvements, more customization options, new heatmaps and broader support for existing enterprise platforms.

Business Continuity

  • HDP 1.3 delivers file dataset (HDFS) and HBase snapshots for point-in-time disaster recovery functionality.
  • An upgrade to HBase 0.94.6.1 provides multi-master high availability, table snapshots and shortened recovery times for online applications built on Hadoop.

We are very pleased to bring you HDP 1.3, and encourage you to download it today.

 

Hadoop Tutorials: Real Life Use Cases in the Sandbox

One of the goals with the Hortonworks Sandbox is around showcasing end-to-end use cases for Hadoop. With the most current release of Hadoop tutorials, you’ll find 2 specific use cases highlighted both around utilizing clickstream data.   There are 6 new tutorials for you to walk through – Tutorials 6 – 11.

(Update: if your version of Sandbox does not have “Enable Ambari” on the introductory page, you will need to download the latest version of the Sandbox in order to have access to these tutorials.)

Clickstream Analysis – Website User Behavior

 

Hadoop Tutorials

Hadoop Tutorials in Hortonworks Sandbox

Tutorials 6-10 are extensive, step-by-step lessons to walk you through the process to connect the Sandbox to Excel 2013 via the Hortonworks ODBC driver to access and analyze semi-structured data (like Omniture logs). Here are some highlights of the new tutorials:

Tutorial 6 – Loading Data into the Hortonworks Sandbox

This covers the basics of brining data into the Sandbox. In this example, we’ve provided access to anonymized Omniture logs. But you can bring in your own data into the Sandbox – your own log data, twitter feeds, etc. The Sandbox is a fully functional personal Hadoop environment where you can add your own datasets to validate the Hadoop use cases in your environment.

Tutorials 7 & 11 – Installing the ODBC Driver in the Hortonworks Sandbox (Windows and Mac)

You can download the Hortonworks ODBC driver, connect it to the Sandbox and then use that connection with your favorite visualization or business intelligence tool? This tutorial will help you with the set up and connection. Once it’s set up, connect to Excel, Tableau, Alteryx, or any other business intelligence tool that supports ODBC.

Tutorials 8 & 9 – Accessing and Analyzing Data in Excel

Imagine being able to take that semi-structured data from Tutorial 6 and surface it in Excel. You’ll be able to do that on your own laptop when you follow the step-by-step lessons in Tutorials 8 & 9.

Hadoop Tutorials with Excel

Data visualization in Excel

Tutorial 10 – Visualizing Clickstream Data

Hadoop Tutorials

Combining CRM and weblog data

Here you will see another end-to-end example of visualizing clickstream data – but in this case weblog data is combined with CRM data to visualize actual customer behavior. This tutorial assumes that you’ve got the ODBC driver and Excel 2013 installed. Even if you don’t have Excel 2013, you can use your favorite visualization tool to play with the dataset.

Datasets

With these new tutorials, you can easily work with your own data within the Sandbox to start seeing where you can use the Hortonworks Data Platform within your organization to find insights into your own business. If you are looking for publicly available data to use with the Sandbox to apply these Hadoop tutorials against, here are some suggestions:

Ready to do work on your own real-life example? Download the Sandbox now.

Get Started with Hadoop on Hortonworks Data Platform 1.1 for Windows

We are excited to release the Hortonworks Data Platform 1.1 for Windows as a Generally Available product. In this blog post, I’m going to outline how to get started with HDP 1.1 for Windows.

HDP for WindowsWith HDP for Windows, you can deploy Apache Hadoop and the HDP stack of components natively on a Windows Server cluster. The HDP for Windows download includes an MSI and remote installation scripts. With these artifacts, you can setup a multi-node Hadoop cluster in either a Workgroup or Active Directory Domain networking configuration. This enables HDP for Windows to be deployed for production use in Windows Data centers.

The best way to get started and evaluate HDP is to set up a single node cluster. We’ve written a quick start guide that walks you through all the pre-requisites and install steps needed to get going. With a single node cluster, you can experience the full functionality of the product – load data into HDFS, execute Hive, Pig and MapReduce jobs, schedule processing workflows through Oozie.

HDP enables seamless integration with the Microsoft BI tool ecosystem. You can explore data in HDFS through the Data Explorer  in Excel. You can query and analyze Hive data in Excel by using the ODBC driver to connect to Hive Server 2. You can import/export data from and to SQL Server through Apache Sqoop.

These integrations enable HDP to become an integral part of your Enterprise Data Architecture, and allow you to utilize the same tools that you are familiar with to interact with HDP.

Learn More. Please take a look at the Hortonworks Documentation to learn more about installing and using HDP 1.1 for Windows.

Tell Us About It. Please visit the HDP 1.1 for Windows Forum to ask questions, get help, provide feedback and hear what others are doing with HDP.

Mobile Telco Dials In and Harnesses Big Data with Hadoop

actuateSmartphones have transformed our daily lives. A key indicator of this trend is our increased spend on data plans versus voice. We are a new generation of people who are in a constant state of activity, communication, and community building wherever we go ─ including the couch in front of the television where we can multi-screen and multi-task!

What does this mean for the Mobile Telecom industry?  For one of the top five mobile phone service providers in the world, responsible for developing and managing advanced data services for European countries with data services including mobile internet access for various devices, mobile email, instant messaging, news, weather updates and traffic reports ─ it means as mobile data services grow in revenue, so does the need to monitor that contribution easily and accurately. While that sounds obvious, the mobile telecom growth rate has expanded so rapidly, the company’s existing systems could not keep up. And once the business leaders had the data – they wouldn’t trust it. Making accurate business decisions at the right time would be essential for their success and growth.

Big Data Challenge

The customer – a Mobile Telecom giant – had an existing method for determining business performance was ad hoc and decentralized. There was no single system to extract the information in a reliable and consistent manner. “We had a mix of systems and information which needed lots of cross-checking – if indeed this was even possible. Getting access to data took a long time and, even then, the business users in marketing had no real confidence in the information they were getting.” This in turn compromised their ability to develop and manage these services.

In order to gain market share and stay competitive, the customer had to be able to:

  • Leverage the data from mobile usage to get accurate information about real customer activity to provide improved levels of customer satisfaction.
  • Spot upcoming trends in mobile use to drive intelligent marketing.
  • Improve the information on customer usage, which drives the changes needed to their service offerings, such as the ability to offer the latest mobile phone technologies.
  • Handle large volumes of data, be easily configurable by in-house business users, and provide graphical representations of the results

Hadoop Solution

The strategy included harnessing Hadoop to handle the large volumes of data – 36 terabytes- that had to be consolidated into a single environment. Our Mobile Telecom customer decided to use Actuate – a Hortonworks partner in open source based Business Intelligence and Reporting Tools (BIRT) technology that connects analytics capabilities directly to Hadoop. Actuate’s ability to report directly against the Hadoop big data source, meanwhile, allows business users to generate on-demand analytics and reports consisting of thousands of pages in a matter of seconds through an easy-to-use web portal, with negligible training.

The Mobile Telecom giant now has a single source of clean data they can stand behind with absolute confidence in making the right decisions to stay competitive, and keep customer satisfaction levels high.  In addition, the consumer data services division is now in a position where it can replace several of its older systems, dropping extra licenses and hardware, because of the ability to do all of its business analytics in one place.

A Business Intelligence Analyst at the company stated; “It’s all automatic. Before, business users would be sending emails and calls to chase the data. Anyone across the whole business can have access to the information they need, and find it on their own. I particularly like the ability to drill down into the figures. You can now see at a glance what’s happening right across our activities.

Customers’ want accurate and fast analytics reporting without a lot of training so a partnership between Hortonworks and Actuate, just makes big data sense.

Thank you to our partner Actuate for this Hadoop use case. Find more partners here.

 Actuate founded and co-leads the BIRT (Business Intelligence and Reporting Tools) open source project with the Eclipse Foundation, the home of the open source Eclipse Development Framework, the leading IDE worldwide. The BIRT project’s goal was to bring the web design metaphor to creating visualizations of data. 

Boosting Big Data and the Hadoop Ecosystem with Splunk Alliance

SplunkLogoToday we announced a strategic alliance with operational intelligence leader Splunk. We are excited to be strengthening our relationship with Splunk and expanding the Apache Hadoop ecosystem and we expect this to further drive open source innovation. Additionally this alliance is further proof of Hadoop’s maturation as a key component of the next generation enterprise architecture.

One of the key benefits of the partnership is that it enables organizations to easily take advantage of the massive scale out storage and processing capabilities of Apache Hadoop with Splunk Enterprise via Splunk Hadoop Connect, which easily and reliably moves data between Splunk Enterprise and Hadoop.

This capability means the enterprise can easily use Splunk Enterprise to collect machine data from across the enterprise and deliver it to Hadoop for batch analytics. Likewise, the output of Hadoop jobs can be imported into Splunk Enterprise for rapid analysis and visualization.

Visit the Splunk website to learn more about Splunk Enterprise and Splunk Hadoop Connect.

Find out more about how Hadoop and the Hortonworks Data Platform enables next-generation data architecture.

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA!

HDP for WindowsToday we are very excited to announce that Hortonworks Data Platform for Windows (HDP for Windows) is now generally available and ready to support the most demanding production workloads.

We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.

With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.

Additionally, customers now also have complete portability of their Hadoop applications between on-premise and cloud deployments via HDP for Windows and Microsofts’s HDInsight Service.

Enterprise Hadoop Momentum

Since its beta availability, we’ve been working with customers across a wide range of industries including automotive, manufacturing, financial services, retail and government. Here are just a few examples of the tremendous opportunity those customers are seeing:

  • Automotive – a major automotive company wants to use HDP on Windows to create a centralized repository for all of the sensor data collected from their cars. The refinement and exploration of the data trends and patterns found through driving habits, maintenance and repair data and myriad other signals will be used to further improve the quality of their cars.
  • Healthcare – a major healthcare applications provider is looking to build the next generation of healthcare apps that integrate patient health record data with clinical study and FDA data so that the customer experience is enriched and provides a higher level of health care services at a lower cost.
  • Financial services – multiple major financial services organizations are looking to create centralized repositories across different divisions enabling them to explore and gain deeper insight into customer risk patterns.
  • Manufacturing – a major manufacturer of electronics will create a centralized repository of machine generated data coming from the production lines and compare and analyze that data with part failure and return data enabling them to identify and predict problems in production and increasing the quality of their products.

This is just a small sample of the emerging use cases for HDP on Windows. You can explore how Hadoop fits into your data architecture here.

Availability & Training

Hortonworks Data Platform for Windows is now available for download at: http://hortonworks.com/download/.

We also have training specifically designed for HDP on Windows, you can get more information here: http://hortonworks.com/hadoop-training/hadoop-on-windows-for-developers/

Hive 0.11, Stinger and SQL-Compatibility

The release of Hive 0.11 is exciting and represents a big step forward to delivery of Project Stinger  and SQL-IN-Hadoop.  There is still some work to be done however.  We look forward to delivery of Hadoop 2 with YARN and the Apache Tez project as being huge increases to Hive performance, but this is not the only goal of Stinger.

SQL-In-Hadoop simply can’t be SQL without SQL compatibility

Today, HiveQL provides a fairly good set of SQL data types and semantics and while this (or a subset thereof) may be good enough for some of the “on” Hadoop solutions, we feel there needs to be more, especially if Hadoop and Hive are to meet the stringent requirements of enterprise class business analytics. To this end, we have set a goal of compatibility with most of SQL-92 and beyond with some SQL-2003 extensions.

The release of Apache Hive 0.11 pushes us further towards SQL-compatibility with the decimal data type becoming more usable (JIRA HIVE-4271) and the addition of analytic functions for windowing and aggregates.  It also vastly improves joins and all the while improves performance.  Awesome.

What else?

There is a lot more work to be done however and well work with the community to get it done.  Hive 0.11 had contributions from over 50 community members to close over 380 Jira tickets.  That is astounding and a huge proof point of the open community and its unrivaled capability to innovate faster than any proprietary solution.

We will reach our goal soon.  Here is what’s left to be done:

sqlcompat

We look forward to providing updates to Hive all summer long!

Apache Hive 0.11: Stinger Phase 1 Delivered

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop.  Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11.  This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others.  Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them.  See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring.  This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor.  As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Key features in Hive 0.11

  • ORCFile.  It’s Optimized.
    The ORC File (Optimized RC File) presents key new features that speed access of data Apache Hive as it adds meta information at the file and block data level so that queries can be more intelligent and use meta data to optimize access.  Further, with the ORC file, only the bytes from the required columns are read from HDFS which minimizes I/O and speeds the query chain.  These are major advances for improved performance in Hive.
  • Improved Data Types
    As Apache Hive marches towards full SQL-compatibility, an update to the decimal data type was made more usable.
  • Analytic Functions
    Hive 0.11 introduces windowing functions for RANK, LEAD/LAG, ROW_NUMBER, FIRST_VALUE, LAST_VALUE and more. It also introduces aggregate OVER functions with PARTITION BY and ORDER BY
  • Joins improved in Hive 0.11
    Both the broadcast join and the SMB join were improved considerably in Hive 0.11.  Both joins work without user hints, so that the Hive optimizer now picks the correct join rather than depending on the user to do so. More broadcast joins are now packed into a single MapReduce job, making star join queries much more efficient.

Towards YARN and the Power of SQL-IN-Hadoop

Hadoop 2.0 and explicitly YARN turns Hadoop from a single application system to a multi-application operating system.  The next generation of Apache Hive, built on YARN, becomes part of the platform itself and can be managed by YARN to ensure that multiple use cases can be addressed beyond interactive query.  It is the delivery of a multi-application data system.  In this new world, Hive is a first class citizen along with a variety of workloads within a cluster and resources can be managed more discreetly.

Ultimately, this leads to further performance enhancements for Hive and with the inclusion of Tez, we will be able to demonstrate even more significant improvements as service startup times are removed a newly optimized execution chain within core Hadoop is delivered.  The near future is exciting!

Apache Hive is empowering an ecosystem of SQL Based Applications

This release represents significant enhancements to Hive that will improve direct SQL interaction with Hive and light up the hundreds of applications that already rely on Hive as the defacto SQL interface for Hadoop.  If you are one of the hundreds of software companies using Hive already, we hope you test out this new release and are happy with the results.  We look forward to supporting it in HDP 1.3 in the very near future.  ;)

Thank You to the Community

Thanks to 55 developers who contributed time and effort on this release: Alan Gates, Amareshwari Sriramadasu, Andrew Chalfant, Arup Malakar, Ashish Singh, Ashish Vaidya, Ashutosh Chauhan, Bennie Schut, Bhushan Mandhani, Billie Rinaldi, Brock Noland, Carl Steinbach, Chen Chun, Chris Drome, Dilip Joseph, Edward Capriolo, Gang Tim Liu, Gopal V, Gunther Hagleitner, Harish Butani, Ivan Gorbachev, Jarek Jarcec Cecho, Jean Xu, Jingwei Lu, Johnny Zhang, Jonathan Chang, Kevin Wilfong, Lars Francke, Li Yang, Mark Grover, Mayank Garg, Mikhail Bautin, Namit Jain, Navis, Nick Collins, Owen O’Malley, Pamela Vagata, Prajakta Kalmegh, Prasad Mujumdar, Roshan Naik, Sam Tunnicliffe, Samuel Yuan, Sean Busbey, Shreepadma Venugopalan, Sushanth Sowmyan, Teddy Choi, Thejas M Nair, Thiruvel Thirumoolan, Travis Crawford, Vikram Dixit K, Vinod Kumar Vavilapalli, Wonho Kim, Xiao Jiang, Zhenxiao Luo

Go to page:12345...10...Last »