Unstructured data, semi-structured data, structured data… it is all very interesting and we are in conversations about big and small versions of each of these data types every day. We love it… we are data geeks at Hortonworks. We passionately understand that if you want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it. Within Hadoop, this critical metadata service is provided by HCatalog.
As a key component of Apache Hive, HCatalog is a metadata and table management system for the broader Hadoop platform. It enables the storage of data in any format regardless of structure. Hadoop can then process both structured and unstructured data and it can store and share information about data’s structure in HCatalog. This capability combined with the ‘schema on read’ nature of Hadoop versus traditional EDW ‘schema on write’ reduces cycle time for data scientists seeking insight as it encourages exploration and discovery on a continuous basis.
Similarly, Hive/HCatalog also enables sharing of data structure with external systems including traditional data management tools. It is the glue that enables these systems to interact effectively and efficiently and is a key component in helping Hadoop fit into the enterprise.
Since 2008, Hive has reigned as the defacto SQL interface for Hadoop as it provides a relational view through SQL like language to data within Hadoop. HCatalog publishes this same interface but abstracts it for data beyond Hive. It also publishes a REST interface for external use so that your existing tools can interact with Hadoop in the way you expect… via ODBC and JDBC into SQL!
HCatalog intends to enable the ecosystem to more general SQL interaction to Hadoop. Our partners are building dedicated interfaces on top of this key interaction point to drive a Hadoop strategy within their products. For instance, Teradata has created SQL-H on top of HCatalog as their default interface to Hadoop, enabling their users to query across this big data resource from existing tools. So now, as performance enhancements of Hive through the Stinger initiative progresses, their tools get better and better.
HCatalog also allows developers to share data and metadata across internal Hadoop tools such as Hive, Pig, and MapReduce. It allows them to create applications without being concerned how or where the data is stored, and insulates users from schema and storage format changes. It is a repository for schema that can be referred to in these programming models so that you don’t have to explicitly type your structures in each program. It provides a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements. It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the warehouse.
So how might you use HCatalog? Organizations today are using HCatalog in a variety of different ways, however, the key uses could be summarized as the following:
HCatalog is just one of many components of Apache Hadoop and the Hortonworks Data Platform. You can find out more here, including further integration points, and how Hortonworks provides the enterprise rigor to Apache Hadoop.