The Hortonworks Blog

Posts categorized by : Apache Hadoop

This guest post from Simon Elliston Ball, Head of Big Data at Red Gate and all round top bloke. 

Hadoop is a great place to keep a lot of data. The data-lake, the data-hub and the data platform;  it’s all about the data. So how do you manage that data? How do you get data in? How do you get results out? How do you get at the logs buried somewhere deep in HDFS?…

Today, we are excited to announce the agenda for Hadoop Summit Europe 2014.  We welcome you to check it out now and hopefully start planning you trip to Amsterdam now!

The call for abstracts for Hadoop Summit Europe was open for just over two months and we received an unbelievable 354 submissions.  Wow!  Further, as we read through them, the quality was amazing.  We quickly surmised that the show was going to be great, but the selection process was going to rough.…

We are excited to announce that the Hortonworks Data Platform 2.0 for Windows is publicly available for download. HDP 2 for Windows is the only Apache Hadoop 2.0 based platform that is certified for production usage on Windows Server 2008 R2 and Windows Server 2012 R2.

With this release, the latest in community innovation on Apache Hadoop is now available across all major Operating Systems. HDP 2.0 provides Hadoop coverage for more than 99% of the enterprises in the world, offering the most flexible deployment options from On-Premise to a variety of cloud solutions.…

This guest post from Eric Hanson, Principal Software Development Engineer on Microsoft HDInsight, and Apache Hive committer.

Hive has a substantial community of developers behind it, including a few from the Microsoft HDInsight team. We’ve been contributing to the Stinger initiative since it was started early in 2013, and have been contributing to Hadoop since October of 2011. It’s a good time to step back and see the progress that’s been made on Apache Hive since fall of 2012, and ponder what’s ahead.…

I recently sat down with Owen O’Malley and Carter Shanklin to discuss the dramatic improvements delivered by the Stinger Initiative to version 0.12 of Apache Hive, which is well on its way to being 100x faster than pre-Stinger versions of Hive. That means interactive queries on petabytes of data.

Owen is one of the original architects of Apache Hadoop and Carter is the Hortonworks product manager focused on Hive. Together, they explain the speed, scale and SQL semantics delivered in Apache Hive v0.12, which is included in Hortonworks Data Platform v2.0.…

One aspect of community development of Apache Hadoop is the way that everyone working on Hadoop -full time, part time, vendors, users and even some researchers all collaborate together in the open. This developed is based on publicly accessible project tools: Apache Subversion for revision control, Apache Maven for the builds; Jenkins for automating those builds and tests. Central to a lot of work is the Apache JIRA server, an instance of Atlassian’s issue management tool.…

This is the third in our series on modern data architectures across industry verticals. Others in the series are:

Many of the world’s largest telecommunications companies use Hortonworks Data Platform (HDP) to manage their data. Through partnership with these companies, we have learned how our customers use HDP to improve customer satisfaction, make better infrastructure investments and develop new products.…

Whether you were busy finishing up last minute Christmas shopping or just taking time off for the holidays, you might have missed that Hortonworks released the Stinger Phase 3 Technical Preview back in December. The Stinger Initiative is Hortonworks’ open roadmap to making Hive 100x faster while adding standard SQL. Here we’ll discuss 3 great reasons to give Stinger Phase 3 Preview a try to start off the new year.

Reason 1: It’s The Fastest Hive Yet

Whether you want to process more data or lower your time-to-insight, the benefits of a faster Hive speak for themselves.…

Hadoop has traditionally been used for batch processing data at large scale. Batch processing applications care more about raw sequential throughput than low-latency and hence the existing HDFS model where all attached storages are assumed to be spinning disks has worked well.

There is an increasing interest in using Hadoop for interactive query processing e.g. via Hive. Another class of applications makes use of random IO patterns e.g. HBase. Either class of application benefits from lower latency storage media.…

The year is coming to its end. Maybe you’re reading this as you race to check a few more 2013 items off of your to-do list (at work or at home). Or maybe you’ve already got a hot toddy in your hand and your feet kicked up, with slippers warming your toes.

In 2013, I have been fortunate enough to spend the year speaking with our customers and I learned about how so many important organizations are using Apache Hadoop and Hortonworks Data Platform (HDP) to solve real problems.…

The network and security teams at your company do not allow internet access from the machines where you plan to install Hadoop. What do you do? How do you install your Hadoop cluster without having access to the public software packages? Apache Ambari supports local repositories and in this post we’ll look at the configuration needed for that support.

When installing Hadoop with Ambari, there are three repositories at play: one for Ambari – which primarily hosts the Ambari Server and Ambari Agent packages) and two repositories for the Hortonworks Data Platform – which hosts the HDP Hadoop Stack packages and other related utilities.…

Update! – The final phase of improvements from the Stinger Initiative were released as part of Hive 0.13 on Apr 21, 2014 – Read the announcement

While just a preview by moniker, the release marks a significant milestone in the transformation of Hadoop from a batch-oriented system to a data platform capable of interactive data processing at scale and delivering on the aims of the Stinger Initiative.

Apache Tez and SQL: Interactive Query-IN-Hadoop

Tez is a low-level runtime engine not aimed directly at data analysts or data scientists.…

Encryption is applied to electronic information in order to ensure its privacy and confidentiality.  Typically, we think of protecting data as it rests or in motion.  Wire Encryption protects the latter as data moves through Hadoop over RPC, HTTP, Data Transfer Protocol (DTP), and JDBC.

Let’s cover the configuration required to encrypt each of these protocols. To see the step-by-step instructions please see the HDP 2.0 documentation.

RPC Encryption

The most common way for a client to interact with a Hadoop cluster is through RPC.  …

Last week was a busy week for shipping code, so here’s a quick recap on the new stuff to keep you busy over the holiday season.

Apache Hadoop has always been very fussy about Java versions. It’s a big application running across tens of thousands of processes across thousands of machines in a single datacenter. This makes it almost inevitable that any race conditions and deadlock bugs in the code will eventually surface – be it in the Java JVM and libraries, in Hadoop itself, or in one of the libraries on which it depends.

Hence the phrase “there are no corner cases in a datacenter”.…

Go to page:« First...678910...20...Last »