The Hortonworks Blog

Posts categorized by : Apache Hadoop

When the term scientific computing comes up in a conversation it’s usually just the occasional science geek who shows signs of recognition. But although most people have little or no knowledge of the field’s existence, it has been around since the second half of the twentieth century and has played an increasingly important role in many technological and scientific developments. Internet search engines, DNA analysis, weather forecasting, seismic analysis, renewable energy, and aircraft modeling are just a small number of examples where scientific computing is nowadays indispensible.…

We are pleased to announce the the release of Apache Hive version 0.10.0. More than 350 JIRA issues have been fixed with this release. A few of the most important fixes include:

Cube and Rollup: Hive now has support for creating cubes with rollups. Thanks to Namit!

List Bucketing: This is an optimization that lets you better handle skew in your tables. Thanks to Gang!

Better Windows Support: Several Hive 0.10.0 fixes support running Hive natively on Windows.…

We are pleased to announce that Apache Pig 0.10.1 was recently released. This is primarily a maintenance release focused on stability and bug fixes. In fact, Pig 0.10.1 includes 42 new JIRA fixes since the Pig 0.10.0 release.

Some of the notable changes include:

  • Source code-only distribution

In the download section for Pig 10.0.1, you will now find a source-only tarball (pig-0.10.1-src.tar.gz) alongside the traditional full tarball, rpm and deb distributions.…

Introduction

This is part three of a Big Data Security blog series. You can read the previous two posts here: Part One / Part Two.

When Russell Jurney and I first teamed up to write these posts we wanted to do something that no one had done before to demonstrate the power of Big Data, the simplicity of Pig and the kind of Big Data Security Analytics we perform at Packetloop.…

In a recent blog post, Hortonworks’ Steve Loughran discussed Apache Hadoop’s preference for JBOD-configured storage vs. the allure of RAID-0. As more enterprises are beginning to move beyond the science experiment stage and begin deploying Hadoop into their production environments, they are learning that Hadoop is quite different than other services in their data centers, such as web, mail, and database servers.They are learning that to achieve optimal performance, you need to pay particular attention to configuring the underlying hardware.…

Hadoop Summit North America 2013, the premier Apache Hadoop community event, will take place at the San Jose Convention Center, June 26-27, 2013. Hosted by Hortonworks, a leading contributor to Apache Hadoop, and Yahoo!, Hadoop Summit brings together the community of developers, architects, administrators, data analysts, data scientists and vendors interested in advancing, extending and implementing Apache Hadoop as the next-generation enterprise data platform.

This 6th Annual Hadoop Summit North America will feature seven tracks and more than 80 sessions focused on building, managing and operating Apache Hadoop from some of the most influential speakers in the industry.…

At Thanksgiving we took a moment to reflect on the past and give thanks for all that has happened to Hortonworks the past year.  With the New Year approaching we now take time to look forward and provide our predictions for the Hadoop community in 2013.  To compile this list, we queried and collected big data from our team of Hadoop committers and members of the community.

We asked a few luminaries as well and we surfaced many expert opinions and while we had our hearts set on five predictions, we ended up with SEVEN. …

This blog is a follow up on our previous blog “Snapshots for HDFS

In June we had posted an early prototype of snapshots that allowed us to experiment with a few ideas in HDFS-2802. Since then we have added more details to the design document and made significant progress on a brand new implementation (over 40 subtasks in HDFS-2802).

Some of the highlights of this new design include:

  • Read-Only Copy-on-Write (COW) snapshots (but can be extended RW later)
  • Snapshots for entire namespace or sub directories
  • Snapshots are managed by Admin, but users are allowed to take snapshots
  • Snapshots are efficient
  • Creation is instantaneous with O(1) cost.

Over the course of 2012, through Hortonworks’ leadership within the Apache Ambari community we have seen the rapid creation of an enterprise-class management platform required for enabling Apache Hadoop to be an enterprise viable data platform.  Hortonworks engineers and the broader Ambari community have been working hard on their latest release, and we’d like to highlight the exciting progress that’s been made to Ambari, a 100% open and free solution that delivers the features required from an enterprise-class management platform for Apache Hadoop.…

The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple.…

The Hadoop Summit Europe official call for papers ends this Friday, November 30th – so be sure to get your session submissions in this week!

Hadoop Summit Europe is March 20, 21 at the Beurs van Berlage in Amsterdam, Netherlands. You still have time to submit an abstract now!

The four content tracks are:

Applied Hadoop

Sessions in this track focus on applications, tools, algorithms and data science as well as areas of advanced research and emerging applications that use and extend the Hadoop platform.…

Track Chairs have been named for Hadoop Summit Europe. Track Chairs will, in turn, select their track committees who, as a team, will decide which sessions are to be presented at Hadoop Summit Europe. They are as follow:

Operating Hadoop – Evert Lammerts, SARA

I joined Sara as a technical consultant in October 2008. In 2009 I started experimenting with non-traditional distributed processing and storage platforms, mainly Hadoop. I’m currently the lead Hadoop and related big data services.…

Introduction

Packetpig is the tool behind Packetloop. In Part One of the Introduction to Packetpig I discussed the background and motivation behind the Packetpig project and problems Big Data Security Analytics can solve. In this post I want to focus on the code and teach you how to use our building blocks to start writing your own jobs.

The ‘building blocks’ are the Packetpig custom loaders that allow you to access specific information in packet captures.…

Apache ZooKeeper™ release 3.4.5 is now available. This is a bug fix release including 3 bug fixes. Following is a summary of the critical issues fixed in the release.

ZOOKEEPER-1550: ZooKeeperSaslClient does not finish anonymous login on OpenJDK

ZOOKEEPER-1376: zkServer.sh does not correctly check for $SERVER_JVMFLAGS

ZOOKEEPER-1560: Zookeeper client hangs on creation of large nodes.

Stability of 3.4.5

Note that Apache ZooKeeper™ 3.4.5 is marked as the current stable release.…

A recurrent question on the various Hadoop mailing lists is “why does Hadoop prefer a set of separate disks to the same set managed as a RAID-0 disks array?”

It’s about time and snowflakes.

JBOD and the Allure of RAID-0

In Hadoop clusters, we recommend treating each disk separately, in a configuration that is known, somewhat disparagingly as “JBOD”: Just a Box of Disks.

In comparison RAID-0, which is a bit of misnomer, there being no redundancy, stripes data across all the disks in the array.…

Go to page:« First...10...1415161718...Last »

Thank you for subscribing!