Simple Hive ‘Cheat Sheet’ for SQL Users

UPDATE: This cheat sheet was so popular, we’ve created a PDF of the content below so you can print it and use it more easily. Download here.

 

If you’re already familiar with SQL then you may well be thinking about how to add Hadoop skills to your toolbelt as an option for data processing.

From a querying perspective, using Apache Hive provides a familiar interface to data held in a Hadoop cluster and is a great way to get started. Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Naturally, there are a bunch of differences between SQL and HiveQL, but on the other hand there are a lot of similarities too, and recent releases of Hive bring that SQL-92 compatibility closer still.

To highlight that – and as a bit of fun to get started – below is a simple ‘cheat sheet’ (based on a simple MySQL reference such as this one) for getting started with basic querying for Hive. Here, we’ve done a direct comparison to MySQL, but given the simplicity of these particular functions, then it should be the same in essentially any SQL dialect.

Of course, if you really want to get to grips with Hive, then take a look at the full language manual.

Retrieving Information

Function MySQL Hive
Retrieving Information (General) SELECT from_columns FROM table WHERE conditions; SELECT from_columns FROM table WHERE conditions;
Retrieving All Values SELECT * FROM table; SELECT * FROM table;
Retrieving Some Values SELECT * FROM table WHERE rec_name = "value"; SELECT * FROM table WHERE rec_name = "value";
Retrieving With Multiple Criteria SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2"; SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";
Retrieving Specific Columns SELECT column_name FROM table; SELECT column_name FROM table;
Retrieving Unique Output SELECT DISTINCT column_name FROM table; SELECT DISTINCT column_name FROM table;
Sorting SELECT col1, col2 FROM table ORDER BY col2; SELECT col1, col2 FROM table ORDER BY col2;
Sorting Reverse SELECT col1, col2 FROM table ORDER BY col2 DESC; SELECT col1, col2 FROM table ORDER BY col2 DESC;
Counting Rows SELECT COUNT(*) FROM table; SELECT COUNT(*) FROM table;
Grouping With Counting SELECT owner, COUNT(*) FROM table GROUP BY owner; SELECT owner, COUNT(*) FROM table GROUP BY owner;
Maximum Value SELECT MAX(col_name) AS label FROM table; SELECT MAX(col_name) AS label FROM table;
Selecting from multiple tables (Join same table using alias w/”AS”) SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name; SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)

Metadata

Function MySQL Hive
Selecting a database USE database; USE database;
Listing databases SHOW DATABASES; SHOW DATABASES;
Listing tables in a database SHOW TABLES; SHOW TABLES;
Describing the format of a table DESCRIBE table; DESCRIBE (FORMATTED|EXTENDED) table;
Creating a database CREATE DATABASE db_name; CREATE DATABASE db_name;
Dropping a database DROP DATABASE db_name; DROP DATABASE db_name (CASCADE);

Current SQL Compatibility

Command Line

Function Hive
Run Query hive -e 'select a.col from tab1 a'
Run Query Silent Mode hive -S -e 'select a.col from tab1 a'
Set Hive Config Variables hive -e 'select a.col from tab1 a' -hiveconf hive.root.logger=DEBUG,console
Use Initialization Script hive -i initialize.sql
Run Non-Interactive Script hive -f script.sql

Categorized by :
Hive

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Recently in the Blog

Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.

Thank you for subscribing!