Two weeks ago, Apache ORC became an Apache top-level project within the Apache Software Foundation (ASF). This step represents a major step forward for the project, and it is representative of its momentum been built by a broad community of developers.
Back in January 2013, we created ORC files as part of the Stinger initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. We added it as a feature of Hive for two reasons:
In the last two years, many of the features that we’ve added to Hive, such as vectorization, ACID, predicate push down and LLAP, support ORC first, and follow up with other storage formats later.
ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.
Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.
The growing use and acceptance of ORC has encouraged additional Hadoop execution engines, such as Apache Pig, Map-Reduce, Cascading, and Apache Spark to support reading and writing ORC. However, there are concerns that depending on the large Hive jar that contains ORC pulls in a lot of other projects that Hive depends on. To better support these non-Hive users, we decided to split off from Hive and become a separate project. This will not only allow us to support Hive, but also provide a much more streamlined jar, documentation and help for users outside of Hive.
Although Hadoop and its ecosystem are largely written in Java, there are a lot of applications in other languages that would like to natively access ORC files in HDFS. Hortonworks, HP, and Microsoft are developing a pure C++ ORC reader and writer that enables C++ applications to read and write ORC files efficiently without Java. That code will also be moved into Apache ORC and released together with the Java implementation.
Next steps for ORC include more powerful indexes, such as bloom filters that let the ORC reader quickly narrow the search when looking for specific values in an unsorted column. We are also working on column encryption for ORC files that will let users encrypt sensitive columns while leaving other columns in the clear.
Finally, look for more performance optimizations as we make reading and writing ORC files even faster.