Hortonworks on Apache Hadoop


Hive 0.11, Stinger and SQL-Compatibility

The release of Hive 0.11 is exciting and represents a big step forward to delivery of Project Stinger  and SQL-IN-Hadoop.  There is still some work to be done however.  We look forward to delivery of Hadoop 2 with YARN and the Apache Tez project as being huge increases to Hive performance, but this is not the only goal of Stinger.

SQL-In-Hadoop simply can’t be SQL without SQL compatibility

Today, HiveQL provides a fairly good set of SQL data types and semantics and while this (or a subset thereof) may be good enough for some of the “on” Hadoop solutions, we feel there needs to be more, especially if Hadoop and Hive are to meet the stringent requirements of enterprise class business analytics. To this end, we have set a goal of compatibility with most of SQL-92 and beyond with some SQL-2003 extensions.…

Read More

Week in Review: SQL IN Hadoop and Hive, Beyond Batch with YARN, NFS access to HDFS and HBase MTTR

Or as it’s more commonly being called: Week-ish in Review. Let’s recap on the latest – there’s some juicy technology goodness here.

Delivering on Stinger: Phase 1. Just this week, Hive 0.11 has been released. Owen (@owen_omalley) brought us the news that 55 – yes, fifty-five – developers from across the community have addressed 386 JIRA tickets and have delivered significant improvements to Hive along with an awesome demonstration of the power of community open-source development. Thanks to everyone! This release of Hive means that we’ve delivered on the first phase of the Stinger Initiative too – aiming to deliver 100x performance increases to Hive.

Taking Hadoop Beyond Batch with YARN. All of which means we step closer to delivering SQL-in-Hadoop and respond to the needs of enterprises for multi-application operating systems for their big data. Arun (@arunmurthy) gives a terrific update on Hadoop 2.0 and YARN and how that development will move Hadoop Beyond Batch.…

Read More

Apache Hive 0.11: Stinger Phase 1 Delivered

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop.  Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11.  This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others.  Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes.…

Read More

Advanced Analytics: Making Decisions at the Speed of Business

Retailers today are faced with addressing the new behaviors of an evolving customer base by leveraging the changing landscape and its new dynamics.  Retail consumers online are sharing, friend validating, researching, learning and developing a point of view ─ offline they are touching, brand comparing and brand associating.  Retailers now more than ever before have to think in terms of “integrated commerce” and leverage Big Data for big results in the marketplace.

Forward-thinking organizations are discovering the possibilities of unconstrained analytics and quickly realizing the potential of accelerating the spread of analytics across the company ─ ultimately driving the speed of acquiring new customers, responding to consumer and market change, and increasing their “share of wallet”. Retail analysts want to spend more time in the analytic discovery process, and less time acquiring and preparing data, so they can uncover new market opportunities and reduce risks.…

Read More

Moving Hadoop Beyond Batch with Apache YARN

Apache Hadoop 2.0 continues to make its way through the open source community process at the Apache Software Foundation and is getting closer to being declared “ready” from a community development perspective.  Once ready, our team at Hortonworks will apply our usual enterprise rigor in providing a tested and integrated distribution that includes Hadoop 2.0 along with the other enterprise-focused services our customers and partners require.

In my roles both at Hortonworks and in the open-source Apache Hadoop community, I’m asked a lot of questions regarding the key aspects and motivations behind Hadoop 2.0. Here is some information to sate the curious mind.

First-generation success inspires second-generation focus

In the early days of Hadoop at Yahoo!, we had a very particular objective: store and process very large amounts of data to support our internet search efforts.  And so the first generation of Hadoop was a purpose-built system for web-scale data processing that was embraced by Yahoo!…

Read More

Hadoop SDK and Tutorials for Microsoft .NET Developers

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel.

Read More

Meetups at Hadoop Summit

UPDATED: To include the Oozie meetup.

The main Hadoop Summit agenda is looking awesome – go take a look here, and register here - but there’s also a series of meetups planned for the day before the general sessions. If you want to get up close and personal on topics of interest to you with other like-minded folk then take a look at these options. We’ll be providing refreshments along the way.

Meetups

You should go ahead and register at the links below, note that space will be limited and remember all meet ups are in San Jose!

Morning Sessions: 25th June, 10:00am – 12:30pm at San Jose Convention Center

Afternoon Sessions: 25th June, 1:30pm – 4:00pm at San Jose Convention Center

Camps

Additionally, there are two camps in the evening:

All this Hadoop-y goodness should get you nicely in the mood for the next two days of general and track sessions.…

Read More

Simplifying data management: NFS access to HDFS

We are excited that another critical Enterprise Hadoop integration requirement – NFS Gateway access to HDFS – is making progress through the main Apache Hadoop trunk.  This effort is architected and designed by Brandon Li and Suresh Srinivas, and is being delivered by the community. You can track progress in Apache JIRA HDFS-4750.

With NFS access to HDFS, you can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS.  NFS thus enables file-based applications to perform file read and write operations directly to Hadoop. This greatly simplifies data management in Hadoop and expands the integration of Hadoop into existing toolsets.

NFS and HDFS

Network File System (NFS) is a distributed file system protocol that allows access to files on a remote computer in a manner similar to how local file system is accessed.  …

Read More

Introduction to HBase Mean Time to Recover (MTTR)

The following post is from Nicolas Liochon and Devaraj Das with thanks to all members of the HBase team.

HBase is an always-available service and remains available in the face of machine failures and rack failures. Machines in the cluster runs RegionServer daemons. When a RegionServer crashes or the machine goes offline, the regions it was hosting goes offline as well. The focus of the MTTR work in HBase is to be able to detect abnormalities and to be able to restore access to (failed) offlined regions as early as possible.

In talking with customers and users, it turned out that MTTR for HBase regions is one of the significant concerns. A lot of improvements were implemented recently. In this blog post and a couple after this one, we will go over the work the HBase team in Hortonworks, and the community at large, has done, in the area of MTTR.…

Read More

Week in Review: Hadoop Summit, Value of Big Data, and more Ambari

And we are just about done with this week. But not quite – dig into the conversation from the past few days.

Hadoop Summit. We published the vast majority of sessions (70 so far) for the Hadoop Summit in San Jose, 26-27 June. The sessions stretch across 7 tracks from Architecture to Economics and we hope you can join us for THE Hadoop community event of the year. You can register here, and the schedule is here.

Big Data Defined Part Deux: Value Definition. Jim picked up from the last Big Data definition and talked about it here. Regardless of your views on volume, variety and velocity there is one V to rule them all: Value.

Enterprise Data Analytics with Hortonworks and Datameer. I’ve been having a ton of fun with Datameer visualizations this week. If you want to learn a little more about enterprise analytics and how to better unlock the insights in your own data (with cool graphics) then take a look here.…

Read More

Enterprise Big Data Analytics with Hortonworks and Datameer

Today, 94% of Hadoop users perform analytics on large volumes of data that were not possible before. How do they do it? Cool applications, that’s how.

You have seen various stats that indicate enterprises need better ways of making use of data but they bear repeating: The volume of business data worldwide, across all companies, doubles every 1.2 years, according to a study published by eBay in May, 2012. And market research firm IDC released a forecast showing the big data market may grow from $3.2 billion in 2010 to $16.9 billion in 2015. Clearly, enterprises need better ways of making use of all of this data, which contains innumerable insights for improving business processes and profitability.

Hortonworks partner Datameer, has a horizontal application for big data discovery that includes self-service data integration, analytics and visualization on top of Hadoop, including pre-built analytic applications.…

Read More

Hortonworks at Yahoo! Hack Europe

Some news from the UK as Yahoo! Hack Europe welcomed Hortonworks this past weekend in central London.  This two-day event sponsored by Yahoo! was focused on celebrating collaboration, learning and innovation using the worlds leading technologies.  Chris Harris, our local EMEA Solution Engineer was on hand to add to the discussions.  Partnering with Microsoft, we were able to showcase our HDP on the Azure platform.  This was a fantastic opportunity for the 350 delegates to be expose to both Azure and enterprise ready Hadoop provided as HDInsight Service.

After an appearance of the Yahoo bigger than life, Hack Robot (seriously, check it out…), who made sure that everyone was entertained, the hack started with vengeance.  Hyped up on the sweetie cart full of everyone’s favorites, most delegates were now officially up for the challenge.  Inspired by the passion, Chris lead a thought provoking workshop, where a number of the hackers were able to try out real life scenarios on how Hadoop as part of the HDInsight service can and will be impacting business decisions.  …

Read More

Hadoop Summit Schedule is now available!

Now is the time to get registered for the Hadoop Summit in San Jose, 26-27 June, 2013 – we’d love to see you there. A few weeks ago, we revealed the selectees from the community choice voting, and we’re now delighted to announce the full schedule of sessions is available here.

Session Schedule

Our thanks to the track selection committees and track chairs for the work on building a great schedule for an awesome event. There are 70 sessions on the schedule so far with more to come later.

This year, the tracks are as follows:

  • Enterprise Data Architecture. This track focuses on Hadoop as a data platform and how it fits within broader enterprise data architectures.
  • Applications and Data Science. Sessions in this track focus on the practice of data science using Hadoop.
  • Deployment and Operations. This track focuses on the deployment, operation and administration of Hadoop clusters at scale.

Read More

Big Data Defined – Part Deux: Value Definition

A few weeks back we posted a definition of “big data”.  There was definitely some internal conversation about the term and if this definition had captured what the term means.  Sum finding: it is a loaded term.  It means a lot of different things to a lot of different people.

When I first joined Hortonworks, I bought in to the three V’s (volume velocity and variety) definition of big data.  It works for the most part, but is more a descriptor of the data.  It explains the characteristics of the data.  The definition is cold and lacks soul.  Afterall,  “big data” represents promise of “big” business value.

A “Value” Definition of Big Data

Last year, Shaun Connolly, Hortonworks VP of Corporate Strategy came up with this definition…
Big Data = Transactions + Interactions + Observations.

I gravitate to this because it outlines WHAT the data is, not just the characteristics. …

Read More

Go to page:12345...10...Last »