We believe the fastest path to innovation is the open community and we work hard to help deliver this innovation from the community to the enterprise. However, this is a two way street. We are also hearing very distinct requirements being voiced by the broad enterprise as they integrate Hadoop into their data architecture.
Over the past year, a set of enterprise requirements has emerged for dataset management. Organizations need to process and move datasets (whether HDFS files or Hive Tables) in, around and between a clusters. This task can start innocently enough but usually (and quickly) becomes very complex. Dataset locations, cluster interfaces, replication and retention policies can all change over time. And hand-coding this logic into your applications — along with general retry and late data arrival logic — can become a slippery-slope of complexity. Getting it right the first time can be a challenge but maintaining the end result can be downright impossible.
To meet these requirements, we will, as always, work within the community to deliver them and we have introduced a Hortonworks Labs initiative to make dataset management easier. This initiative outlines a public roadmap that will deliver the features that will help Hadoop users avoid the complexity of processing and managing datasets. Much of the work is outlined in Apache Falcon which provides a declarative framework for describing data pipelines to simplify the development of processing solutions. By using Falcon, users can describe dataset processing pipelines in a way that maximizes reuse and consistency, while insulating from the implementation details across datasets and clusters.
We invite you to review and follow the roadmap in our Labs area and also encourage you to get involved in the community.
If you want get started today using some of these tools, we have made a Falcon Technical Preview available.