Hortonworks on Apache Hadoop


Mobile Telco Dials In and Harnesses Big Data with Hadoop

Smartphones have transformed our daily lives. A key indicator of this trend is our increased spend on data plans versus voice. We are a new generation of people who are in a constant state of activity, communication, and community building wherever we go ─ including the couch in front of the television where we can multi-screen and multi-task!

What does this mean for the Mobile Telecom industry?  For one of the top five mobile phone service providers in the world, responsible for developing and managing advanced data services for European countries with data services including mobile internet access for various devices, mobile email, instant messaging, news, weather updates and traffic reports ─ it means as mobile data services grow in revenue, so does the need to monitor that contribution easily and accurately. While that sounds obvious, the mobile telecom growth rate has expanded so rapidly, the company’s existing systems could not keep up.…

Read More

Boosting Big Data and the Hadoop Ecosystem with Splunk Alliance

Today we announced a strategic alliance with operational intelligence leader Splunk. We are excited to be strengthening our relationship with Splunk and expanding the Apache Hadoop ecosystem and we expect this to further drive open source innovation. Additionally this alliance is further proof of Hadoop’s maturation as a key component of the next generation enterprise architecture.

One of the key benefits of the partnership is that it enables organizations to easily take advantage of the massive scale out storage and processing capabilities of Apache Hadoop with Splunk Enterprise via Splunk Hadoop Connect, which easily and reliably moves data between Splunk Enterprise and Hadoop.

This capability means the enterprise can easily use Splunk Enterprise to collect machine data from across the enterprise and deliver it to Hadoop for batch analytics. Likewise, the output of Hadoop jobs can be imported into Splunk Enterprise for rapid analysis and visualization.…

Read More

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA!

Today we are very excited to announce that Hortonworks Data Platform for Windows (HDP for Windows) is now generally available and ready to support the most demanding production workloads.

We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.

With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.…

Read More

Hive 0.11, Stinger and SQL-Compatibility

The release of Hive 0.11 is exciting and represents a big step forward to delivery of Project Stinger  and SQL-IN-Hadoop.  There is still some work to be done however.  We look forward to delivery of Hadoop 2 with YARN and the Apache Tez project as being huge increases to Hive performance, but this is not the only goal of Stinger.

SQL-In-Hadoop simply can’t be SQL without SQL compatibility

Today, HiveQL provides a fairly good set of SQL data types and semantics and while this (or a subset thereof) may be good enough for some of the “on” Hadoop solutions, we feel there needs to be more, especially if Hadoop and Hive are to meet the stringent requirements of enterprise class business analytics. To this end, we have set a goal of compatibility with most of SQL-92 and beyond with some SQL-2003 extensions.…

Read More

Week in Review: SQL IN Hadoop and Hive, Beyond Batch with YARN, NFS access to HDFS and HBase MTTR

Or as it’s more commonly being called: Week-ish in Review. Let’s recap on the latest – there’s some juicy technology goodness here.

Delivering on Stinger: Phase 1. Just this week, Hive 0.11 has been released. Owen (@owen_omalley) brought us the news that 55 – yes, fifty-five – developers from across the community have addressed 386 JIRA tickets and have delivered significant improvements to Hive along with an awesome demonstration of the power of community open-source development. Thanks to everyone! This release of Hive means that we’ve delivered on the first phase of the Stinger Initiative too – aiming to deliver 100x performance increases to Hive.

Taking Hadoop Beyond Batch with YARN. All of which means we step closer to delivering SQL-in-Hadoop and respond to the needs of enterprises for multi-application operating systems for their big data. Arun (@arunmurthy) gives a terrific update on Hadoop 2.0 and YARN and how that development will move Hadoop Beyond Batch.…

Read More

Apache Hive 0.11: Stinger Phase 1 Delivered

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop.  Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11.  This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others.  Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes.…

Read More

Advanced Analytics: Making Decisions at the Speed of Business

Retailers today are faced with addressing the new behaviors of an evolving customer base by leveraging the changing landscape and its new dynamics.  Retail consumers online are sharing, friend validating, researching, learning and developing a point of view ─ offline they are touching, brand comparing and brand associating.  Retailers now more than ever before have to think in terms of “integrated commerce” and leverage Big Data for big results in the marketplace.

Forward-thinking organizations are discovering the possibilities of unconstrained analytics and quickly realizing the potential of accelerating the spread of analytics across the company ─ ultimately driving the speed of acquiring new customers, responding to consumer and market change, and increasing their “share of wallet”. Retail analysts want to spend more time in the analytic discovery process, and less time acquiring and preparing data, so they can uncover new market opportunities and reduce risks.…

Read More

Moving Hadoop Beyond Batch with Apache YARN

Apache Hadoop 2.0 continues to make its way through the open source community process at the Apache Software Foundation and is getting closer to being declared “ready” from a community development perspective.  Once ready, our team at Hortonworks will apply our usual enterprise rigor in providing a tested and integrated distribution that includes Hadoop 2.0 along with the other enterprise-focused services our customers and partners require.

In my roles both at Hortonworks and in the open-source Apache Hadoop community, I’m asked a lot of questions regarding the key aspects and motivations behind Hadoop 2.0. Here is some information to sate the curious mind.

First-generation success inspires second-generation focus

In the early days of Hadoop at Yahoo!, we had a very particular objective: store and process very large amounts of data to support our internet search efforts.  And so the first generation of Hadoop was a purpose-built system for web-scale data processing that was embraced by Yahoo!…

Read More

Hadoop SDK and Tutorials for Microsoft .NET Developers

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel.

Read More

Meetups at Hadoop Summit

UPDATED: To include the Oozie meetup.

The main Hadoop Summit agenda is looking awesome – go take a look here, and register here - but there’s also a series of meetups planned for the day before the general sessions. If you want to get up close and personal on topics of interest to you with other like-minded folk then take a look at these options. We’ll be providing refreshments along the way.

Meetups

You should go ahead and register at the links below, note that space will be limited and remember all meet ups are in San Jose!

Morning Sessions: 25th June, 10:00am – 12:30pm at San Jose Convention Center

Afternoon Sessions: 25th June, 1:30pm – 4:00pm at San Jose Convention Center

Camps

Additionally, there are two camps in the evening:

All this Hadoop-y goodness should get you nicely in the mood for the next two days of general and track sessions.…

Read More

Simplifying data management: NFS access to HDFS

We are excited that another critical Enterprise Hadoop integration requirement – NFS Gateway access to HDFS – is making progress through the main Apache Hadoop trunk.  This effort is architected and designed by Brandon Li and Suresh Srinivas, and is being delivered by the community. You can track progress in Apache JIRA HDFS-4750.

With NFS access to HDFS, you can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS.  NFS thus enables file-based applications to perform file read and write operations directly to Hadoop. This greatly simplifies data management in Hadoop and expands the integration of Hadoop into existing toolsets.

NFS and HDFS

Network File System (NFS) is a distributed file system protocol that allows access to files on a remote computer in a manner similar to how local file system is accessed.  …

Read More

Introduction to HBase Mean Time to Recover (MTTR)

The following post is from Nicolas Liochon and Devaraj Das with thanks to all members of the HBase team.

HBase is an always-available service and remains available in the face of machine failures and rack failures. Machines in the cluster runs RegionServer daemons. When a RegionServer crashes or the machine goes offline, the regions it was hosting goes offline as well. The focus of the MTTR work in HBase is to be able to detect abnormalities and to be able to restore access to (failed) offlined regions as early as possible.

In talking with customers and users, it turned out that MTTR for HBase regions is one of the significant concerns. A lot of improvements were implemented recently. In this blog post and a couple after this one, we will go over the work the HBase team in Hortonworks, and the community at large, has done, in the area of MTTR.…

Read More

Week in Review: Hadoop Summit, Value of Big Data, and more Ambari

And we are just about done with this week. But not quite – dig into the conversation from the past few days.

Hadoop Summit. We published the vast majority of sessions (70 so far) for the Hadoop Summit in San Jose, 26-27 June. The sessions stretch across 7 tracks from Architecture to Economics and we hope you can join us for THE Hadoop community event of the year. You can register here, and the schedule is here.

Big Data Defined Part Deux: Value Definition. Jim picked up from the last Big Data definition and talked about it here. Regardless of your views on volume, variety and velocity there is one V to rule them all: Value.

Enterprise Data Analytics with Hortonworks and Datameer. I’ve been having a ton of fun with Datameer visualizations this week. If you want to learn a little more about enterprise analytics and how to better unlock the insights in your own data (with cool graphics) then take a look here.…

Read More

Enterprise Big Data Analytics with Hortonworks and Datameer

Today, 94% of Hadoop users perform analytics on large volumes of data that were not possible before. How do they do it? Cool applications, that’s how.

You have seen various stats that indicate enterprises need better ways of making use of data but they bear repeating: The volume of business data worldwide, across all companies, doubles every 1.2 years, according to a study published by eBay in May, 2012. And market research firm IDC released a forecast showing the big data market may grow from $3.2 billion in 2010 to $16.9 billion in 2015. Clearly, enterprises need better ways of making use of all of this data, which contains innumerable insights for improving business processes and profitability.

Hortonworks partner Datameer, has a horizontal application for big data discovery that includes self-service data integration, analytics and visualization on top of Hadoop, including pre-built analytic applications.…

Read More

Go to page:12345...10...Last »