Home Forums MapReduce How to specify Orc as the input format in a MapReduce Job

Tagged: 

This topic contains 10 replies, has 3 voices, and was last updated by  Koelli Mungee 5 months, 3 weeks ago.

  • Creator
    Topic
  • #40799

    If I use OrcInputFormat.class in either

    job.setInputFormatClass(OrcInputFormat.class);
    or

    MultipleInputs.addInputPath(job, path, OrcInputFormat.class);
    I get an error saying that OrcInputFormat.class does not extend InputFormat. My question is what is the correct way to specify OrcInputFormat (if there is one) for these cases.

    Here’s the SO thread

Viewing 10 replies - 1 through 10 (of 10 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #42586

    Koelli Mungee
    Moderator

    Hi Marko,

    Its good to hear you are making progress. Can you provide us a stack trace of the problem you are encountering now so that we can take a look?

    Regards
    Koelli

    Collapse
    #41878

    What happened in the end is that we were getting a “table not found” message. If we did a simple “select * from liimit 10″ in hive, the hive console returned something, so there was either something wrong with the HCat API or we were calling it improperly. At that point, we switched our schemas to plaintext and decided to mess with ORC and non-plaintext formats later. If we could get help on this issue, it would be awesome.

    Also to note, in order to get the jar we have running, we had to copy over the entire contents of the hive library along with it in order to get a string of “ClassNotFoundException”s to go away, so there’s that too. We’re most likely doing something wrong, badly, but I’m not sure what that is.

    Collapse
    #41419

    abdelrahman
    Moderator

    Hi Marko,

    Do you still need help with the error?

    Thanks
    -Rahman

    Collapse
    #41357

    Scratch that. I now have the error:

    Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/DefaultStorageHandler

    Collapse
    #41350

    We started using the HCatalog API. The problem we are encountering now is that HCatInputFormat seems to be null. I checked some source code online, and I think it’s a singleton class. It doesn’t make sense to me why setInput would not work since its a static method.

    Collapse
    #41170

    abdelrahman
    Moderator

    Hi Marko,

    The OrcInputFormat is not to be used with the MapReduce package classes and it is just for the mapred package API. Have you tried to use HCatalog API to achieve the use of OrcInputFormat?

    Thanks
    -Abdelrahman

    Collapse
    #40878

    Can you please clarify? I don’t understand what you mean by hive.input.format.

    Collapse
    #40877

    Koelli Mungee
    Moderator

    Hi Marko,

    This can be passed through the hive property hive.input.format.

    hive.input.format
    org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

    As the error suggests, you can only pass Format classes that extend InputFormat based on the API for org.apache.hadoop.mapreduce.Job

    setInputFormatClass(Class cls)
    Set the InputFormat for the job.

    Thanks,
    Koelli

    Collapse
    #40867

    We are using HDP 2.0.5.

    I have:

    import java.io.*;
    import java.util.ArrayList;
    import java.util.Date;
    import java.text.DateFormat;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileStatus;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.hive.ql.io.orc.OrcInputFormat;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.io.WritableComparable;
    import org.apache.hadoop.mapred.JobStatus;
    import org.apache.hadoop.mapred.TaskStatus;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    at the top.

    I don’t feel comfortable posting the code that surrounds those statements – it is filled with proprietary language. That being said, what other statements could affect this?

    Collapse
    #40854

    Koelli Mungee
    Moderator

    Hi Marko,

    What version of HDP are you using? Can you paste in an extract of your code along with the import statements you are using?

    Thanks,
    Koelli

    Collapse
Viewing 10 replies - 1 through 10 (of 10 total)