Hive/HCatalog – Data Geeks & Big Data Glue
Unstructured data, semi-structured data, structured data… it is all very interesting and we are in conversations about big and small versions of each of these data types every day. We love it… we are data geeks at Hortonworks. We passionately understand that if you want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it. Within Hadoop, this critical metadata service is provided by HCatalog.
As a key component of Apache Hive, HCatalog is a metadata and table management system for the broader Hadoop platform. It enables the storage of data in any format regardless of structure. Hadoop can then process both structured and unstructured data and it can store and share information about data’s structure in HCatalog. This capability combined with the ‘schema on read’ nature of Hadoop versus traditional EDW ‘schema on write’ reduces cycle time for data scientists seeking insight as it encourages exploration and discovery on a continuous basis.
Similarly, Hive/HCatalog also enables sharing of data structure with external systems including traditional data management tools. It is the glue that enables these systems to interact effectively and efficiently and is a key component in helping Hadoop fit into the enterprise.
SQL Interface for Hadoop? HCatalog as enabler…
Since 2008, Hive has reigned as the defacto SQL interface for Hadoop as it provides a relational view through SQL like language to data within Hadoop. HCatalog publishes this same interface but abstracts it for data beyond Hive. It also publishes a REST interface for external use so that your existing tools can interact with Hadoop in the way you expect… via ODBC and JDBC into SQL!
Good for the ecosystem is good for you
HCatalog intends to enable the ecosystem to more general SQL interaction to Hadoop. Our partners are building dedicated interfaces on top of this key interaction point to drive a Hadoop strategy within their products. For instance, Teradata has created SQL-H on top of HCatalog as their default interface to Hadoop, enabling their users to query across this big data resource from existing tools. So now, as performance enhancements of Hive through the Stinger initiative progresses, their tools get better and better.
Hadoop Developer productivity and HCatalog
HCatalog also allows developers to share data and metadata across internal Hadoop tools such as Hive, Pig, and MapReduce. It allows them to create applications without being concerned how or where the data is stored, and insulates users from schema and storage format changes. It is a repository for schema that can be referred to in these programming models so that you don’t have to explicitly type your structures in each program. It provides a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements. It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the warehouse.
HCatalog in Use
So how might you use HCatalog? Organizations today are using HCatalog in a variety of different ways, however, the key uses could be summarized as the following:
- Enabling the Right Tool for the Right Job
The majority of heavy Hadoop users do not use a single tool for data processing. Often users and teams will begin with a single tool: Hive, Pig, MapReduce, or another tool. As their use of Hadoop deepens they will discover that the tool they chose is not optimal for the new tasks they are taking on. Users who start with analytics queries using Hive discover they would like to use Pig for ETL processing or constructing their data models. Users who start with Pig discover they would like to use Hive for analytics type queries. While tools such as Pig and MapReduce do not require metadata, they can benefit from it when it is present. Sharing a metadata store also enables users across tools to share data more easily. A workflow where data is loaded and normalized using Map Reduce or Pig and then analyzed via Hive is very common. When all these tools share one metastore users of each tool have immediate access to data created with another tool. No loading or transfer steps are required.
- Capture Processing States to Enable Sharing
When used for analytics, users will discover information using Hadoop. Again, they will often use Hive, Pig and Map Reduce to uncover information. The information is valuable but typically only in the context of a larger analysis. With HCatalog you can publish results so they can be accessed by your analytics platform via REST. In this case, the schema defines the discovery. These discoveries are also useful to other data scientists. Often they will want to build on what others have created or use results as input into a subsequent discovery.
- Integrate Hadoop with everything
Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption it must work with and augment existing tools. Hadoop should serve as input into your analytics platform or integrate with your operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services opens up the platform to the enterprise with a familiar API and SQL-like language. Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform. By tieing in more closely they can hide complexity from users and create a better experience. A great example of this is the SQL-H integration from Teradata Aster. SQL-H queries the structure of data stored in HCatalog and exposes that back through to Aster enabling Aster to access just the relevant data stored within the Hortonworks Data Platform.
HCatalog is just one of many components of Apache Hadoop and the Hortonworks Data Platform. You can find out more here, including further integration points, and how Hortonworks provides the enterprise rigor to Apache Hadoop.
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.