Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
September 15, 2014
prev slideNext slide

Using ORCFile in Cascading and Apache Crunch

Summary

This blog covers how recent developments have made it easy to use ORCFile from Cascading or Apache Crunch and that doing so can accelerate data processing more than 5x. Code samples are provided so that you can start integrating ORCFile into your Cascading or Crunch projects today.

What are Cascading and Apache Crunch?

Cascading and Apache Crunch are high-level frameworks that make it easy to process large amounts of data in distributed clusters. Cascading and Crunch both began as high-level abstractions to MapReduce, but recently we are beginning to see these frameworks ported to other scalable processing engines that offer higher performance, such as Apache Tez and Apache Spark. The frameworks offer different models that tend to be good for different use cases. Cascading gives users tuples and operators such as filtering, sorting and merging. Crunch users deal in Java objects rather than tuples, so Crunch is often a better fit for very complex data schemas. Cascading offers even higher level interfaces on top of it, Scalding for Scala users and Cascalog for Clojure users.

What is an ORCFile?

The Optimized Row Columnar (ORC) file format is a columnar storage format offering good read performance, high compression, and is tightly integrated with HDFS and Hive. ORCFile uses a wide variety of compression schemes, allowing data in ORCFile to achieve ratios of 10x or more in some cases. Because ORCFile is columnar, related columns are stored together on disk. This benefits analytical workloads, which tend to access only certain columns rather than the entire dataset.

Why combine ORCFile and Cascading/Crunch?

There are many reasons to use ORCFile alongside Cascading/Crunch, including:

  1. You want an easy way to convert a lot of data into ORCFile format to reduce storage footprint.
  2. You want to process Hive data with Cascading or Crunch and the Hive data is stored using ORCFile.
  3. You need to join data in ORCFile against data in some other format.

And so on.

Using ORCFile with Cascading

Cascading / ORCFile support existed since Hive 12 but was recently improved to support Hive 13, schema inference and column pruning. These features present a few new ways of opening ORCFile:

Regardless of which you choose, it all starts by importing


import cascading.hive.ORCFile;

Create an ORC Scheme fully specifying schema (existed since Hive 12):


Scheme orcScheme = new ORCFile("col1 int, col2 string, col3 string, col4 long")

Create an ORC Scheme with schema inference (schema inference):


Scheme orcScheme = new ORCFile();

Fully specify the schema, but prune so that only columns col1 and col4 are read (column pruning):


Scheme orcScheme = new ORCFile("col1 int, col2 string, col3 string, col4 long", "0,3");

Read only col1 and col4, but this time open the file without specifying the schema (both new features combined):


Scheme orcScheme = new ORCFile(null, "0,3");

Any of these Schemes can be fed into a Cascading Tap as follows, allowing the path to the ORCFile to be introduced.


Tap myTab = new Hfs(orcScheme, pathToORCFile);

Cascading provides a mechanism for 3rd parties to publish extensions to a repository called conjars.org which functions as a Maven repository. In your build manager, just add conjars as a repository and add a dependency to com.ebay.cascading-hive.

For a complete example of using ORCFile with Cascading, including setting up dependencies, see these Cascading and ORCFile Examples.

Using ORCFile with Apache Crunch

Similar to the above, you need a couple of imports to start working with ORCFile in Crunch:


import org.apache.crunch.io.orc.OrcFileSource;
import org.apache.crunch.types.orc.Orcs;

Here are the various ways of reading ORCFile data:

// Read an ORCFile using reflection-based serialization (slowest):
OrcFileSource<Person> source = new OrcFileSource<Person>(new Path(inputPath),
Orcs.reflection(Person.class));
PCollection<Person> persons = pipeline.read(source);

// Read an ORCFile using tuple-based serialization:
OrcFileSource<TupleN> source = new OrcFileSource<TupleN>(new Path(inputPath),
Orcs.tuples(Writables.strings(), Writables.ints(),
Writables.collections(Writables.strings())));
PCollection<TupleN> tuples = pipeline.read(source);

// Read an ORCFile using struct-based serialization (fastest):
String typeStr = "struct<name:string,age:int,number:array<string>>";
TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(typeStr);
OrcFileSource<OrcStruct> source = new OrcFileSource<Person>(new Path(inputPath),
Orcs.orcs(typeInfo));
PCollection<OrcStruct> structs = pipeline.read(source);

And writing:


// Write a file using OrcFileTarget.
OrcFileTarget target = new OrcFileTarget(new Path(outputPath));
PCollection<TupleN> tuples = pipeline.read(source);
pipeline.write(result, target);

For a complete example of using ORCFile with Crunch, see these Crunch and ORCFile Examples.

Benchmarks

Performance when using ORCFile is substantially better than using flat text formats like text or Avro, which is not surprising since the columnar layout allows us to avoid a large amount of I/O.

The graphics below compare execution times of the Cascading/Crunch equivalents of a SQL query inspired by TPC-DS. The query used was:


SELECT s_state, count(*)
FROM store_sales
JOIN store ON ss_store_sk = s_store_sk
WHERE ss_sold_date_sk = 2452474
GROUP BY s_state;

The jobs were executed on a TPC-DS dataset of scale factor 10,000 (10 TB). The raw data size involved in the query is greater than 3.5 TB. The jobs were executed on a 20 node cluster, each node having 6 disks and 64GB of RAM.

First are the results for Cascading. We see that ORCFile easily outperforms text and avro due to more efficient deserialization. The orc-opt column is the execution time when we use column pruning to read only the columns we need. In this way, ORCFile is more than 3x faster than avro and about 3x faster than text.

orc_1
The results are similar for Crunch, but using Crunch’s ability to deserialize directly to structures we see an even more impressive speedup, close to 8x better than using Avro.

orc_2

These results show that ORCFile offers a considerable performance boost to your Cascading and Crunch jobs.

Learn More

About the Authors

Zhong Wang is an intern at Hortonworks, working on open source columnar storage solutions for Big Data systems. He is also a graduate student at Duke University, where he’s advised by Shivnath Babu. Zhong has had experience in Hadoop since 2009. He helped develop the petabyte-scale data infrastructure at Baidu, the leading Chinese search engine.

Carter Shanklin does product stuff at Hortonworks.

Tags:

Comments

  • I read a lot of interesting articles here. Probably you spend a lot of time writing,
    i know how to save you a lot of time, there is an online tool
    that creates readable, google friendly articles in seconds, just search
    in google – rewriter creates an unique article in a minute

  • I read a lot of interesting articles here. Probably you spend a lot of time writing,
    i know how to save you a lot of time, there is an online tool that creates readable, SEO friendly articles
    in minutes, just search in google – masagaltas free
    content

  • Hey, trying to write an Orc file. Where and what is the ‘result’ variable in the code below?

    pipeline.write(result, target)

    It doesn’t make sense. We can read in an Orc file, but when we ever try to write one to Orc, it’s just plain text.
    Please help! We’re a customer! 😉

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>