The Stinger.next initiative, with its focus on transactions, sub-second queries and SQL:2011 Analytics evolves Apache Hive to allow it to run most of the analytical workloads that are typical within a data warehouse, but now at petabyte scale. The first phase of Stinger.Next, delivered in Apache Hive 0.14 and in HDP 2.2, delivers transactions with ACID semantics a critical step in the evolution of the Hive as the defacto standard for SQL in Hadoop.
The JIRA (HIVE-5317) ticket to add transactions with ACID semantics was the most watched JIRA in the history of Hive, and for good reason: it adds the ability to modify data in Hive with the guarantee that your data will be safe.
The traditional model in Hive is to have partitions for high volume data tables and then periodically add more partitions. A common example is to have data partitioned by date and adding data monthly into a new partition. The data in the partition is then read-only. To be able to update the data (done as an entire partition), users were forced to have many small partitions affecting performance. Transactions allow users to have fewer partitions and faster queries.
With transactions, the SQL commands insert, update and delete are now available to modify the data. Some example statements are:
But merely allowing the users to edit data is not sufficient. The guarantees provided by ACID semantics (Atomicity, Consistency, Isolation and Durability) is what Enterprises need to ensure they can rely on their data being correct and consistent even when multiple users are editing the database at the same time.
The transaction implementation is designed for an analytic workload as opposed to an operational workload. This means that it is designed for low concurrency – such as 10 – 50 users updating the system at the same time. While operational databases use fine grained row-level locking for single row edits, for Hive the right choice is to lock a table or partition. Multiple writers to same partition of the same table will be serialized and wait behind each other. In other words, Hive is not meant for low latency updates and deletes, a solution like Apache HBase is more appropriate.
The transactions implementation is designed for very high throughput. While the numbers will depend on the cluster size and other workload on the cluster, we expect that millions of records can be added every hour. The key here is that a single transaction is expected to do bulk editing with a high volume of data modified per transaction. This is the case that performs well with throughput. You need to combine your inserts into different sets and then apply each set as a separate transaction.
INSERT INTO TABLE T VALUES (1, 2, 3), (2, 3, 4), (3, 4, 5) …..;
You can also do bulk deletes or updates if the where clause is the same using a single statement. Examples would be deleting records older than a certain date or updating all occurrences of a value such as customer_id, if the customer_id changes. UPDATE and DELETE statements are meant for bulk updates rather than point updates since they always scan full partitions.
For running Analytics queries while inserts are happening, the impact will be minimal. We use multi-version concurrency control that maintains multiple versions of the data (temporarily). This means that writes and reads do not slow down each other. While a read is reading a particular version, the write can come and write a newer version of the data (without needing to wait on locks). This is important to ensure analytics do not slow down when data is being edited.
However, there are overheads, including a very minor overhead due to extra work of reading delta files (modifications) when you read the main ORCFile. When delta files accumulate, a compaction job is run to merge them with the base ORCFile and that will take some resources in the cluster. The impact depends on the number of updates, but is generally expected to be limited.
With HDP 2.2, users can edit their data after loading it. This is often needed as Enterprises often have end of the day corrections to their fact tables. Also, sometimes dimension tables change (such as a store relocating to different zip code). For these use cases, the ability to edit your data safely is critical. These use cases require few updates daily and it is what Hive transactions support right now.
Over the next coming months, the Hive community plans to support the following enhancements to transactions and its use cases next: