cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
May 07, 2015
prev slideNext slide

Apache ORC Launches as a Top-Level Project

Two weeks ago, Apache ORC became an Apache top-level project within the Apache Software Foundation (ASF). This step represents a major step forward for the project, and it is representative of its momentum been built by a broad community of developers.

What is ORC and why is it useful?

Back in January 2013, we created ORC files as part of the Stinger initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. We added it as a feature of Hive for two reasons:

  1. To ensure that it would be well integrated with Hive
  2. To ensure that storing data in ORC format would be as simple as stating “stored as ORC” to your table definition.

In the last two years, many of the features that we’ve added to Hive, such as vectorization, ACID, predicate push down and LLAP, support ORC first, and follow up with other storage formats later.

ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.

Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.

orc_1

What does this mean for current and new ORC users?

Many large Hadoop users have adopted ORC. For instance, Facebook uses ORC to save tens of petabytes in their data warehouse and demonstrated that ORC is significantly faster than RC File or Parquet.

The growing use and acceptance of ORC has encouraged additional Hadoop execution engines, such as Apache Pig, Map-Reduce, Cascading, and Apache Spark to support reading and writing ORC. However, there are concerns that depending on the large Hive jar that contains ORC pulls in a lot of other projects that Hive depends on. To better support these non-Hive users, we decided to split off from Hive and become a separate project. This will not only allow us to support Hive, but also provide a much more streamlined jar, documentation and help for users outside of Hive.

Although Hadoop and its ecosystem are largely written in Java, there are a lot of applications in other languages that would like to natively access ORC files in HDFS. Hortonworks, HP, and Microsoft are developing a pure C++ ORC reader and writer that enables C++ applications to read and write ORC files efficiently without Java. That code will also be moved into Apache ORC and released together with the Java implementation.

What’s next for ORC?

Next steps for ORC include more powerful indexes, such as bloom filters that let the ORC reader quickly narrow the search when looking for specific values in an unsorted column. We are also working on column encryption for ORC files that will let users encrypt sensitive columns while leaving other columns in the clear.

Finally, look for more performance optimizations as we make reading and writing ORC files even faster.

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>