Cheat Sheet: How To Work with Hive Functions in Hadoop
Just a couple of weeks ago we published our simple SQL to Hive Cheat Sheet. That has proven immensely popular with a lot of folk to understand the basics of querying with Hive. Our friends at Qubole were kind enough to work with us to extend and enhance the original cheat sheet with more advanced features of Hive: User Defined Functions (UDF). In this post, Gil Allouche of Qubole takes us from the basics of Hive through to getting started with more advanced uses, which we’ve compiled into another cheat sheet you can download here.
The original SQL to Hive cheat sheet is a great resource for anyone with SQL (Structured Query Language) skills to get started with Hive. It illustrates how similar HiveQL (Hive Query Language) is to the SQL-92 standard. Consequently, everyone with SQL experience can use Hive for big data analytics and ETL (Extract, Transform, Load) immediately.
Of course, some differences between SQL and HiveQL remain and our friends at Hortonworks and the wider community are working on closing the last gaps between SQL and HiveQL. This work is part of the Stinger Initiative to speedup Hive 100x to make big data analytics interactive and easy to use. For example, the most recent Hive version, 0.11, introduced further improvements supporting windowing functions (OVER, RANK, …), the DECIMAL data type, and the ORC data format.
The great extent to which SQL is supported by Hive makes most queries exchangeable with none or only minor adjustments. This includes the important string, mathematical, date, and aggregation functions. Hive furthermore adds some non-SQL functions especially around string operations and is also highly extendable.
Useful functions to get started with and oriented yourself are around metadata, i.e. describing the data and functions before accessing and using them. They are very much alike commands used in SQL:
SHOW DATABASES; and
USE database_name; lists all databases and changes to database_name.
SHOW TABLES; prints all tables in the database and
DESCRIBE table_name; shows basic information about a table. Hive supports extensions to the last command with
DESCRIBE EXTENDED table_name; and
DESCRIBE FORMATTED table_name; displaying detailed information about a table with latter nicely arranged to make it more readable.
A list of all functions can be obtained with
SHOW FUNCTIONS;. Each can be described with
DESCRIBE FUNCTION function_name; or more detailed with
DESCRIBE FUNCTION EXTENDED function_name.
The following examples are simple, common data access queries that work on Hive as well as most SQL compatible RDBMS.
SELECT * FROM table; retrieves all columns from all rows from a table.
SELECT col1 FROM table WHERE (col1=”value1” AND col2 > 2) OR (col3 LIKE “%value%” AND col4 IS NOT NULL); retrieves col1 from all rows of table, which satisfy a combination of conditions.
The two examples should be unsurprising to a SQL user. Where clauses and conditions, and logical operators are used in Hive as usual.
Similarly aggregations with
GROUP BY and ordering of results are supported as expected, i.e.
SELECT col1, col2 FROM table GROUP BY col1, col2 ORDER BY col2 DESC;
These examples show how easy it is for anyone with SQL knowledge to log into a Hive command line interface or a Beeswax web site and start querying data. Importantly, transparent to the user, without any additional effort, the queries are transformed into one or a series of map-reduce jobs. They are executed by Hadoop against potentially petabytes of data stored on HDFS or on S3.
Data is commonly stored de-normalised in Hive to reduce the need for slow joins. Nevertheless, one of the most powerful aspects of Hive is its ability to join data of arbitrary size. HiveQL supports inner, outer, left, and right joins. The join queries are straightforward and very similar to most common SQL dialects.
SELECT * FROM table1 JOIN table2 ON table1.col1 = table2.col1; results in an inner join between table1 and table2 combining the two on matches on col1.
SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.col1 = table2.col1; creates a full outer join, which returns all rows of both tables with null values added for either table where a match was not found.
SELECT * FROM table1 LEFT OUTER JOIN table2 ON table1.col1 = table2.col1; and
SELECT * FROM table1 RIGHT OUTER JOIN table2 ON table1.col1 = table2.col1; are left and right outer joins, which return either all rows from the left or right table in the join. Similarly to the full outer join missing matches on the right or left side of the join are filled with null values.
Hive supports a vast variety of string manipulation functions covering SQL functionality and beyond. This makes Hive a powerful tool when it comes to regular tasks of extracting, loading, and transforming text and text encoded data. This can be as simple as loading data or as complex as analysing text and pre-processing it for advanced tasks like machine learning. The following two examples illustrate Hive’s benefits in these cases.
A common operation performed with Hive is linking tables to folders. They generally contain logs or data files, e.g. database exports. Sometimes loading and transforming the data is as simple as defining the field delimiters in the table creation statement. In other cases, e.g. logs, regular expressions are needed to extract data like IP addresses, date and time.
Another popular file format to exchange data between systems, especially with the emergence of NoSQL, is JSON. Hive can easily ingest these files. For example, if each line in a text file is a JSON object then a table mapped to the file(s) in a directory reads each line as a single text column. It can be parsed with Hive’s get_json_object function to extract data.
SELECT get_json_object(json_column, '$.product_id) AS product_id, get_json_object(json_column, '$.price) AS price FROM json_table; reads the json_column from json_table and extracts the fields product_id and price from each JSON object.
Another example is processing text for analysis or pre-processing it for machine learning. Hive provides simple yet powerful functions to process any size of dataset. Imagine you would want to extract the most prominent trigrams from the whole of Wikipedia. This is a complex task and made more difficult by the scale of data. However, with Hive it takes only one query.
SELECT ngrams(sentences(lower(article_text)), 3, 5) AS trigrams FROM wikipedia; is all that is needed. The query applies three functions; it lowercases the string column article_text in the table wikipedia; splits each article into sentences and tokenizes them into words; the resulting array of arrays is analysed by the ngrams function and returns the top 5 tri-grams. Importantly, for Hive it is irrelevant if the data to process in this manner is 100MB, 100GB, or 100TB.
User Defined Functions
Beyond the multitude of string functions, the usual mathematical and date functions are available as well as conditionals and collections in form of arrays and maps. The most powerful functionality though is the ability for users to write their own functions to extend Hive.
Traditionally, many map-reduce programs had little original logic. They emerged from the need to execute a small piece of logic on a huge amount of data. The resulting Java programs were a lot of scaffolding to access and move data around. Today, in many cases Hive can do away with trivial map-reduce programs for ETL and analytics in favour of simple queries as we have seen.
There are a few cases where additional functionality beyond what Hive offers is needed. Reading and writing data in an exclusive format is one case. Hive can be extended to support any format by writing a SerDe, which is short for Serializer/Deserializer.
Other cases are; transformation of one row value into another one, which can be added with UDFs (User Defined Function); transformation of multiple row values into one, which can be added with UDAFs (User Defined Aggregate Functions); transformation of one row value into many, which can be added with UDTFs (User Defined Table Functions).
The method to add these is to use the same programming model Hive uses to implement the functions it ships with. These functions are in fact themselves just UDF/UDAF/UDTFs and the only difference to user functions are that they are shipped with Hive. A user can add them by writing the functions in Java, compiling them into a JAR and loading the JAR in Hive. From that point on the user’s functions are accessible in Hive queries just like any other function, e.g. LENGTH (UDF), COUNT (UDAF), or EXPLODE (UDTF).
Let’s explore the LENGTH example, an UDF, which is available in Hive. The query
SELECT LENGTH(col1) FROM table; applies the function LENGTH on strings like ‘abcdef’ and returns ‘6’, for example. The code is open source and can be found online.
The implementation of an UDF extends the UDF class and overwrites the evaluate method with its own functionality. UDAFs are a little more complex and require a basic understanding of the map-reduce framework. You can learn more about creating your own UDF/UDAF/UDTF in this free Qubole webinar.
One of the key benefits of Hive is using existing SQL knowledge, which is a common skill found across business analysts, data analysts, software engineers, data scientist and others. Hive has nearly no barriers for new users to start exploring and analyzing data.
Hive pushes the complexity of big data processing and map-reduce down the stack. At the same time Hive exposes an API for programmers to extend it for domain specific functionality with UDFs. Its beauty is that the encapsulation of logic in functions makes them accessible to non-developers across a business and reusable across the data of the company.
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.