The Hortonworks Blog

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.…

As the release manager for the Apache Hadoop 2.0 release, it gives me great pleasure to share that the Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)! While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:

Series Introduction

This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.…

The latest video in the Hortonworks Executive Video Series features Sanjay Radia, Hortonworks co-founder and Apache Hadoop PMC member. Sanjay is well known in the HDFS circles, having contributed to Hadoop for the past 4+ years. In this video, Sanjay gives a good overview of HDFS, the primary storage system for Hadoop, and provides some insight into both the 0.23 release as well as what can be expected from HDFS over the rest of 2012.…

In case you didn’t see the news, I wanted to share the announcement that HCatalog 0.4.0 is now available.

For those of you that are new to the project, HCatalog provides a metadata and table management system that simplifies data sharing between Apache Hadoop and other enterprise data systems. You can learn more about the project on the Apache project site.

The highlights of the 0.4.0 release include:

- Full support for reading from and writing to Hive.…

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of “all your data are belong to us”. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.…

I attended the Goldman Sachs Cloud Conference and participated on a panel focused on “Data: The New Competitive Advantage”. The panel covered a wide range of questions, but kicked off covering two basic questions:

“What is Big Data?” and “What are the drivers behind the Big Data market?”

While most definitions of Big Data focus on the new forms of unstructured data flowing through businesses with new levels of “volume, velocity, variety, and complexity”, I tend to answer the question using a simple equation:

Big Data = Transactions + Interactions + Observations

The following graphic illustrates what I mean:

We just added a video to the Hortonworks Executive Video library that features Alan Gates, Hortonworks co-founder and Apache PMC member. In this video, Alan discusses HCatalog, one of the most compelling projects in the Apache Hadoop ecosystem.

HCatalog is a metadata and table management system that provides a consistent data model and schema for users of tools such as MapReduce, Hive and Pig. When you consider that there are often users accessing Hadoop clusters using different tools that independently don’t agree on schema, data types, how and where data is stored, etc., then you can understand the value of having a tool such as HCatalog.…

In case you didn’t see the news today, Hadoop Summit announced record ecosystem support for this year’s conference. The original and world’s largest Apache Hadoop conference, now in its fifth year, is being sponsored this year by more than 40 traditional and open source software and services companies.

Hortonworks and our co-host Yahoo! would like to thank the following companies for helping to make Hadoop Summit possible:

The third installment of the Hortonworks executive video series features Arun C. Murthy, co-founder of Hortonworks and VP of Apache Hadoop for the Apache Software Foundation. In this video, Arun shares his view of the power of Apache Hadoop and provides some insight into the future direction of MapReduce, including the ability to support alternate programming paradigms.

As part of Big Data Week, Dan Harvey of the London Hadoop User Group organised an afternoon session for the usergroup, which we were glad to sponsor, along with Canonical and Facegroup. I had the pleasure of presenting my view of the current and future status of Apache Hadoop to an audience that ranged from those curious about Hadoop to heavy users.

Every talk of the day was excellent, from the use cases by Datasift, Mendeley and MusicMetric, to the talk by Francine Bennett of MastodonC on the CO2 footprint of different cloud computing infrastructures, including a live dashboard on the current CO2/hour of many cloud infrastructure sites.…

Another important milestone for Apache Pig was reached this week with the release of Pig 0.10. The purpose of this blog is to summarize the new features in Pig 0.10.

Boolean Data Type

Pig 0.10 introduces boolean data type as a first-class Pig data type. Users can use the keyword “boolean” anywhere where a data type is expected, such as load-as clause, type cast clause, etc.

Here are some sample use cases:

a = load ‘input’ as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2);

b = foreach a generate a0, a1, (boolean)a2;

c = group b by a2; — group by a boolean field

When loading boolean data using PigStorage, Pig expects the text “true” (ignore case) for a true value, and “false” (ignore case) for a false value; while other values map to null.…

This blog covers our on-going work on Snapshots in Apache Hadoop HDFS. In this blog, I will cover the motivations for the work, a high level design and some of the design choices we made. Having seen snapshots in use with various filesystems, I believe that adding snapshots to Apache Hadoop will be hugely valuable to the Hadoop community. With luck this work will be available to Hadoop users in late 2012 or 2013.…

We just released the second video in the Hortonworks Executive Series. This one features Matt Foley, Test and Release Engineering Manager for Hortonworks.

In this video, Matt provides an overview of Hortonworks Data Platform (HDP), including a summary of the Apache Hadoop components included in the distribution and the testing processes involved in the release process. Matt also provides an overview of Apache Ambari, an open source project that is adding monitoring and management capabilities to Apache Hadoop.…

We are pleased to support today’s announcement from Citrix that they have contributed CloudStack to the Apache community. For those new to CloudStack, it is an open source cloud computing software that helps organizations build and manage cloud infrastructures. It is similar to Amazon Web Services EC2 environment except that it enables organizations to build public, private or hybrid cloud environments using their own pooled computing resources.

Citrix announced today that they were reaffirming their commitment to open source by working with the Apache Software Foundation to make CloudStack 3 an Apache project, released under Apache Software License 2.0.…

Go to page:« First...102030...3435363738...Last »