Hortonworks is always pleased to see new contributions come into the open-source community. We worked with our customer, Hotels.com, to help them develop libraries and utilities around Apache Hive, the Apache ORC format and Cascading. It’s great to see the results released for the community. In this guest blog, Adrian Woodhead, Big Data Engineering Team Lead at Hotels.com, discusses the CORC project.
Hotels.com is pleased to announce the open source release of Corc, a library for reading and writing files in the Apache ORC file format using Cascading. Corc provides a Cascading Scheme and various other classes that allow developers to access the full range of unique features provided by ORC from within Cascading applications. Corc is freely available on GitHub under the Apache 2.0 license.
The ability to read only the columns required by a job (as opposed to reading in all data and subsequently filtering out any unneeded columns) is a key feature of the ORC file format. Corc exposes this functionality to Cascading jobs so that a sub-set of Fields can be passed into a Tap and then only the respective columns on HDFS will be read. This can lead to significant performance improvements in Cascading applications as the amount of data read from HDFS is reduced.
Corc provides the ability to read and write the full set of types supported by ORC and maps them to the standard Java types used by Cascading. Types supported include: STRING, BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, TIMESTAMP, DATE, BINARY, CHAR, VARCHAR, DECIMAL, ARRAY, MAP, STRUCT, and UNION. This allows Cascading applications to take advantage of ORC’s self-describing nature, indexes, and column encoding optimisations. Corc also provides an extension point so that these mappings can be customised.
Corc provides the ability to access ORC’s underlying predicate pushdown functionality. This provides Cascading applications with the ability to skip stripes of data that do not contain pertinent values by supplying criteria to determine what data can be skipped. This in turn can lead to performance gains.
Corc supports the reading of ACID datasets that underpin transactional Hive tables. For this to work effectively you must provide your own lock management and coordinate with Hive’s meta store. We intend to make this functionality available via changes to the cascading-hive project in the near future.
We aim to closely follow future developments in the ORC file format and expose new features as they are released. We will also closely monitor the upcoming 3.0.0 release of Cascading and ensure Corc can be used with this soon after it is released. We also intend to continue work on adding ACID support to Corc and related Cascading projects so that Cascading applications can seamlessly read and write data using Hive transactions.