MapReduce Forum

How to specify Orc as the input format in a MapReduce Job

  • #40799

    If I use OrcInputFormat.class in either


    MultipleInputs.addInputPath(job, path, OrcInputFormat.class);
    I get an error saying that OrcInputFormat.class does not extend InputFormat. My question is what is the correct way to specify OrcInputFormat (if there is one) for these cases.

    Here’s the SO thread

to create new topics or reply. | New User Registration

  • Author
  • #40854
    Koelli Mungee

    Hi Marko,

    What version of HDP are you using? Can you paste in an extract of your code along with the import statements you are using?



    We are using HDP 2.0.5.

    I have:

    import java.util.ArrayList;
    import java.util.Date;
    import java.text.DateFormat;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileStatus;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.mapred.JobStatus;
    import org.apache.hadoop.mapred.TaskStatus;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    at the top.

    I don’t feel comfortable posting the code that surrounds those statements – it is filled with proprietary language. That being said, what other statements could affect this?

    Koelli Mungee

    Hi Marko,

    This can be passed through the hive property hive.input.format.


    As the error suggests, you can only pass Format classes that extend InputFormat based on the API for org.apache.hadoop.mapreduce.Job

    setInputFormatClass(Class cls)
    Set the InputFormat for the job.



    Can you please clarify? I don’t understand what you mean by hive.input.format.


    Hi Marko,

    The OrcInputFormat is not to be used with the MapReduce package classes and it is just for the mapred package API. Have you tried to use HCatalog API to achieve the use of OrcInputFormat?



    We started using the HCatalog API. The problem we are encountering now is that HCatInputFormat seems to be null. I checked some source code online, and I think it’s a singleton class. It doesn’t make sense to me why setInput would not work since its a static method.


    Scratch that. I now have the error:

    Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/DefaultStorageHandler


    Hi Marko,

    Do you still need help with the error?



    What happened in the end is that we were getting a “table not found” message. If we did a simple “select * from liimit 10” in hive, the hive console returned something, so there was either something wrong with the HCat API or we were calling it improperly. At that point, we switched our schemas to plaintext and decided to mess with ORC and non-plaintext formats later. If we could get help on this issue, it would be awesome.

    Also to note, in order to get the jar we have running, we had to copy over the entire contents of the hive library along with it in order to get a string of “ClassNotFoundException”s to go away, so there’s that too. We’re most likely doing something wrong, badly, but I’m not sure what that is.

    Koelli Mungee

    Hi Marko,

    Its good to hear you are making progress. Can you provide us a stack trace of the problem you are encountering now so that we can take a look?


You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.