Enterprises are excited about the Hortonworks Data Platform (HDP) for many reasons, such as low cost, scalability, and flexibility. The latter in particular holds out new possibilities for data science. The Hadoop Distributed File System (HDFS) accepts files of any type and format, unlike traditional data warehouses which require a schema up front. With this flexibility, HDP lends itself to a potentially revolutionary use case known as the data lake. The question is, how do the enterprise and the analyst actually make sense of the files pouring into the data lake and manage the data effectively? The same flexible file system that makes the data lake possible can create a hard-to-manage proliferation of files and directories.
Loom’s extensible registry and Activescan service provide part of the solution with metadata management capabilities found nowhere else in the Hadoop ecosystem. The Loom framework of sources, datasets, transforms, and jobs gives the enterprise and data scientist an integrated view of the workflow. Custom metadata enables enterprises to tailor the registry to meet business requirements.
Data science often calls for the application of a variety of tools, such as HDP, Hive, and R. As data scientists work in HDFS, Loom provides an integrated workflow from one tool to another, capturing and storing metadata in its extensible registry. Loom’s Activescan service automatically calculates basic statistics for new tables, and the lineage graph provides a record of inputs and outputs for Hive queries. All of the data, metadata, and functionality in Loom is also exposed through Loom’s RESTful API, and the RLoom package provides convenient functions for accessing Loom from the R statistical programming environment.
For the analyst and data scientist, Loom allows for faster discovery and understanding. Once an analyst has the right data for the task, much of the remaining time in the data science workflow is spent on data preparation. Practitioners testify that getting the data in the right form often takes up seventy, eighty, or even ninety percent of their time. In addition to exploring the data and developing an approach, it can also be time-consuming just to find the right tool for the job.
Having established a strong foundation in data management, Loom will soon provide a new approach for data preparation with a feature called Weaver: an interactive method for preparing big data incrementally and iteratively. Loom Weaver is a power tool for transformations, including built-in functions for column- and row-based operations. To create new tables from multiple tables through join or union operations, Loom leverages Hive. Loom automatically tracks and displays the lineage of these transforms.
With the addition of Weaver, Loom provides the first complete data management solution for Hadoop. Loom enables data workers to find, structure, explore, and transform data faster while maintaining clear records of provenance, lineage, and other metadata. As a result, enterprises receive better and faster insights from a continuous data science workflow. Hadoop has never been more enterprise-ready.
In this tutorial, learn how to install and get started with Loom, register and transform data in HDFS through the Loom Workbench, and import transformed data into R for analysis. By the end of the tutorial, we will see what airports saw the most rain during the sample period. This tutorial is only an example of what can be done with this data using Loom, Hadoop, and R. Check out the accompanying video for an extended demonstration.