Data Management with Revelytix Loom and Hortonworks Data Platform

This is a guest post from our partner, Revelytix who recently created a step-by-step tutorial on using Loom with the Hortonworks Sandbox. 

Enterprises are excited about the Hortonworks Data Platform (HDP) for many reasons, such as low cost, scalability, and flexibility. The latter in particular holds out new possibilities for data science. The Hadoop Distributed File System (HDFS) accepts files of any type and format, unlike traditional data warehouses which require a schema up front. With this flexibility, HDP lends itself to a potentially revolutionary use case known as the data lake. The question is, how do the enterprise and the analyst actually make sense of the files pouring into the data lake and manage the data effectively? The same flexible file system that makes the data lake possible can create a hard-to-manage proliferation of files and directories.

Data Management

Loom’s extensible registry and Activescan service provide part of the solution with metadata management capabilities found nowhere else in the Hadoop ecosystem. The Loom framework of sources, datasets, transforms, and jobs gives the enterprise and data scientist an integrated view of the workflow. Custom metadata enables enterprises to tailor the registry to meet business requirements.Loom Architecture

Data science often calls for the application of a variety of tools, such as HDP, Hive, and R. As data scientists work in HDFS, Loom provides an integrated workflow from one tool to another, capturing and storing metadata in its extensible registry. Loom’s Activescan service automatically calculates basic statistics for new tables, and the lineage graph provides a record of inputs and outputs for Hive queries. All of the data, metadata, and functionality in Loom is also exposed through Loom’s RESTful API, and the RLoom package provides convenient functions for accessing Loom from the R statistical programming environment.

Data Preparation

For the analyst and data scientist, Loom allows for faster discovery and understanding. Once an analyst has the right data for the task, much of the remaining time in the data science workflow is spent on data preparation. Practitioners testify that getting the data in the right form often takes up seventy, eighty, or even ninety percent of their time. In addition to exploring the data and developing an approach, it can also be time-consuming just to find the right tool for the job.

Having established a strong foundation in data management, Loom will soon provide a new approach for data preparation with a feature called Weaver: an interactive method for preparing big data incrementally and iteratively. Loom Weaver is a power tool for transformations, including built-in functions for column- and row-based operations. To create new tables from multiple tables through join or union operations, Loom leverages Hive. Loom automatically tracks and displays the lineage of these transforms.

With the addition of Weaver, Loom provides the first complete data management solution for Hadoop. Loom enables data workers to find, structure, explore, and transform data faster while maintaining clear records of provenance, lineage, and other metadata. As a result, enterprises receive better and faster insights from a continuous data science workflow. Hadoop has never been more enterprise-ready.

Tutorial

In this tutorial, learn how to install and get started with Loom, register and transform data in HDFS through the Loom Workbench, and import transformed data into R for analysis. By the end of the tutorial, we will see what airports saw the most rain during the sample period. This tutorial is only an example of what can be done with this data using Loom, Hadoop, and R. Check out the accompanying video for an extended demonstration.

Categorized by :
Big Data Business Analytics Data Analyst & Scientist Get Started Hadoop Ecosystem Sandbox

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

Big Data Virtual Meetup Chennai
Wednesday, October 29, 2014
9:00 pm India Time / 8:30 am Pacific Time / 4:30 pm Europe Time (Paris)

More Webinars »

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.