One of the most attractive qualities of Hadoop is its flexibility to require schema on read, not on write. Much of the promise of its ability to excel at analyzing unstructured content often is rooted in this key characteristic. HCatalog helps Hadoop deliver on this promise. It is a metadata and table management system for Hadoop.
HCatalog, based on the metadata layer found in Hive, provides a relational view through SQL like language to data within Hadoop. HCatalog allows users to share data and metadata across Hive, Pig, and MapReduce. It also allows users to write their applications without being concerned how or where the data is stored, and insulates users from schema and storage format changes.
This flexibility ultimately decouples data producers, consumers, and administrators. Data producers can add a new column to the data without breaking their consumers’ data reading applications. Administrators can relocate data or change the format it is stored in without requiring changes on the part of the producers or consumers.
HCatalog makes the Hive metastore available to users of other tools on Hadoop. It provides connectors for Map Reduce and Pig so that users of those tools can read data from and write data to Hive’s warehouse. It has a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements. It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the warehouse.
Organizations today are using HCatalog in a variety of different, however, the uses can be summarized as the following:
Complex Data Processing
The majority of heavy Hadoop users do not use a single tool for data processing. Often users and teams will begin with a single tool: Hive, Pig, Map Reduce, or another tool. As their use of Hadoop deepens they will discover that the tool they chose is not optimal for the new tasks they are taking on. Users who start with analytics queries using Hive discover they would like to use Pig for ETL processing or constructing their data models. Users who start with Pig discover they would like to use Hive for analytics type queries. While tools such as Pig and Map Reduce do not require metadata, they can benefit from it when it is present. Sharing a metadata store also enables users across tools to share data more easily. A workflow where data is loaded and normalized using Map Reduce or Pig and then analyzed via Hive is very common. When all these tools share one metastore users of each tool have immediate access to data created with another tool. No loading or transfer steps are required.
Data Discovery Checkpoints
When used for analytics, users will discover information using Hadoop. Again, they will often use Hive, Pig and Map Reduce to uncover information. The information is valuable but typically only in the context of a larger analysis. With HCatalog you can publish results so they can be accessed by your analytics platform via REST. In this case, the schema defines the discovery. These discoveries are also useful to other data scientists. Often they will want to build on what others have created or use results as input into a subsequent discovery. In this case, the schema defines a checkpoint and can be reused.
Integrate Hadoop with everything
Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption it must work with and augment existing tools. Hadoop should serve as input into your analytics platform or integrate with your operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services as provided by Templeton opens up the platform to the enterprise with a familiar API and SQL-like language. It opens up the platform.