Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
December 09, 2013
prev slideNext slide

Data Management with Revelytix Loom and Hortonworks Data Platform

This is a guest post from our partner, Revelytix who recently created a step-by-step tutorial on using Loom with the Hortonworks Sandbox. 

Enterprises are excited about the Hortonworks Data Platform (HDP) for many reasons, such as low cost, scalability, and flexibility. The latter in particular holds out new possibilities for data science. The Hadoop Distributed File System (HDFS) accepts files of any type and format, unlike traditional data warehouses which require a schema up front. With this flexibility, HDP lends itself to a potentially revolutionary use case known as the data lake. The question is, how do the enterprise and the analyst actually make sense of the files pouring into the data lake and manage the data effectively? The same flexible file system that makes the data lake possible can create a hard-to-manage proliferation of files and directories.

Data Management

Loom’s extensible registry and Activescan service provide part of the solution with metadata management capabilities found nowhere else in the Hadoop ecosystem. The Loom framework of sources, datasets, transforms, and jobs gives the enterprise and data scientist an integrated view of the workflow. Custom metadata enables enterprises to tailor the registry to meet business requirements.Loom Architecture

Data science often calls for the application of a variety of tools, such as HDP, Hive, and R. As data scientists work in HDFS, Loom provides an integrated workflow from one tool to another, capturing and storing metadata in its extensible registry. Loom’s Activescan service automatically calculates basic statistics for new tables, and the lineage graph provides a record of inputs and outputs for Hive queries. All of the data, metadata, and functionality in Loom is also exposed through Loom’s RESTful API, and the RLoom package provides convenient functions for accessing Loom from the R statistical programming environment.

Data Preparation

For the analyst and data scientist, Loom allows for faster discovery and understanding. Once an analyst has the right data for the task, much of the remaining time in the data science workflow is spent on data preparation. Practitioners testify that getting the data in the right form often takes up seventy, eighty, or even ninety percent of their time. In addition to exploring the data and developing an approach, it can also be time-consuming just to find the right tool for the job.

Having established a strong foundation in data management, Loom will soon provide a new approach for data preparation with a feature called Weaver: an interactive method for preparing big data incrementally and iteratively. Loom Weaver is a power tool for transformations, including built-in functions for column- and row-based operations. To create new tables from multiple tables through join or union operations, Loom leverages Hive. Loom automatically tracks and displays the lineage of these transforms.

With the addition of Weaver, Loom provides the first complete data management solution for Hadoop. Loom enables data workers to find, structure, explore, and transform data faster while maintaining clear records of provenance, lineage, and other metadata. As a result, enterprises receive better and faster insights from a continuous data science workflow. Hadoop has never been more enterprise-ready.


In this tutorial, learn how to install and get started with Loom, register and transform data in HDFS through the Loom Workbench, and import transformed data into R for analysis. By the end of the tutorial, we will see what airports saw the most rain during the sample period. This tutorial is only an example of what can be done with this data using Loom, Hadoop, and R. Check out the accompanying video for an extended demonstration.


Leave a Reply

Your email address will not be published. Required fields are marked *