This blog covers how recent developments have made it easy to use ORCFile from Cascading or Apache Crunch and that doing so can accelerate data processing more than 5x. Code samples are provided so that you can start integrating ORCFile into your Cascading or Crunch projects today.
Cascading and Apache Crunch are high-level frameworks that make it easy to process large amounts of data in distributed clusters. Cascading and Crunch both began as high-level abstractions to MapReduce, but recently we are beginning to see these frameworks ported to other scalable processing engines that offer higher performance, such as Apache Tez and Apache Spark. The frameworks offer different models that tend to be good for different use cases. Cascading gives users tuples and operators such as filtering, sorting and merging. Crunch users deal in Java objects rather than tuples, so Crunch is often a better fit for very complex data schemas. Cascading offers even higher level interfaces on top of it, Scalding for Scala users and Cascalog for Clojure users.
The Optimized Row Columnar (ORC) file format is a columnar storage format offering good read performance, high compression, and is tightly integrated with HDFS and Hive. ORCFile uses a wide variety of compression schemes, allowing data in ORCFile to achieve ratios of 10x or more in some cases. Because ORCFile is columnar, related columns are stored together on disk. This benefits analytical workloads, which tend to access only certain columns rather than the entire dataset.
There are many reasons to use ORCFile alongside Cascading/Crunch, including:
And so on.
Cascading / ORCFile support existed since Hive 12 but was recently improved to support Hive 13, schema inference and column pruning. These features present a few new ways of opening ORCFile:
Regardless of which you choose, it all starts by importing
Create an ORC Scheme fully specifying schema (existed since Hive 12):
Scheme orcScheme = new ORCFile("col1 int, col2 string, col3 string, col4 long")
Create an ORC Scheme with schema inference (schema inference):
Scheme orcScheme = new ORCFile();
Fully specify the schema, but prune so that only columns col1 and col4 are read (column pruning):
Scheme orcScheme = new ORCFile("col1 int, col2 string, col3 string, col4 long", "0,3");
Read only col1 and col4, but this time open the file without specifying the schema (both new features combined):
Scheme orcScheme = new ORCFile(null, "0,3");
Any of these Schemes can be fed into a Cascading Tap as follows, allowing the path to the ORCFile to be introduced.
Tap myTab = new Hfs(orcScheme, pathToORCFile);
Cascading provides a mechanism for 3rd parties to publish extensions to a repository called conjars.org which functions as a Maven repository. In your build manager, just add conjars as a repository and add a dependency to com.ebay.cascading-hive.
For a complete example of using ORCFile with Cascading, including setting up dependencies, see these Cascading and ORCFile Examples.
Similar to the above, you need a couple of imports to start working with ORCFile in Crunch:
// Read an ORCFile using reflection-based serialization (slowest):
OrcFileSource<Person> source = new OrcFileSource<Person>(new Path(inputPath),
PCollection<Person> persons = pipeline.read(source);
// Read an ORCFile using tuple-based serialization:
OrcFileSource<TupleN> source = new OrcFileSource<TupleN>(new Path(inputPath),
PCollection<TupleN> tuples = pipeline.read(source);
// Read an ORCFile using struct-based serialization (fastest):
String typeStr = "struct<name:string,age:int,number:array<string>>";
TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(typeStr);
OrcFileSource<OrcStruct> source = new OrcFileSource<Person>(new Path(inputPath),
PCollection<OrcStruct> structs = pipeline.read(source);
// Write a file using OrcFileTarget.
OrcFileTarget target = new OrcFileTarget(new Path(outputPath));
PCollection<TupleN> tuples = pipeline.read(source);
Performance when using ORCFile is substantially better than using flat text formats like text or Avro, which is not surprising since the columnar layout allows us to avoid a large amount of I/O.
The graphics below compare execution times of the Cascading/Crunch equivalents of a SQL query inspired by TPC-DS. The query used was:
SELECT s_state, count(*)
JOIN store ON ss_store_sk = s_store_sk
WHERE ss_sold_date_sk = 2452474
GROUP BY s_state;
First are the results for Cascading. We see that ORCFile easily outperforms text and avro due to more efficient deserialization. The orc-opt column is the execution time when we use column pruning to read only the columns we need. In this way, ORCFile is more than 3x faster than avro and about 3x faster than text.
These results show that ORCFile offers a considerable performance boost to your Cascading and Crunch jobs.
Zhong Wang is an intern at Hortonworks, working on open source columnar storage solutions for Big Data systems. He is also a graduate student at Duke University, where he’s advised by Shivnath Babu. Zhong has had experience in Hadoop since 2009. He helped develop the petabyte-scale data infrastructure at Baidu, the leading Chinese search engine.
Carter Shanklin does product stuff at Hortonworks.