Home Forums HBase HBase + MultithreadedMapper.class

This topic contains 9 replies, has 4 voices, and was last updated by  Larry Liu 1 year, 8 months ago.

  • Creator
    Topic
  • #16998

    petri koski
    Member

    Hello! I have a problem: I use MultithreadedMapper.class and timerange scan. I know table has over 20K > rows, but all I get is 500 rows. If I use HBase Shell and give same timestamps (Max and Min) I get right result: Over 20K > rows, but when I do M/R Job (No pig used etc. used .. just Java drectly . JobSetup in Main class etc ..) I get 500 rows. Could this be a mattere of fact that MultiThreadedMapper do not work well, or what it is .. ? Its Not Nodes out of Time Sync neither because HBase Shell gives correct results with same timestamps .. This one beats me! Any help is welcome!

Viewing 9 replies - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #17522

    Larry Liu
    Moderator

    Hi, Petri,

    Thanks for your clarification.

    Let’s take it offline. I will email you and let’s continue in email or phone or other collaboration.

    Thanks
    Larry

    Collapse
    #17521

    petri koski
    Member

    Hello!

    Larry: Uhm, no. I am using MapReduce job to load webpages, millions of them. Urls needed are coming from HBase table and Reducer is just saving new loaded webpage to another table. (Another M/R is used to search more links from loaded webpages etc. So we are talking about webcrawler here ..) I just want to make fast webpage loader and trying to do it using M/R .. M/R because in that way I can distribute loadings between nodes. I just used that HBase shell scan to compare how much rows (urls) I would get with same timestamps as I am using in M/R jobconfig .. those results doesnt match for some reason. One guess is MultiThreadedMapper.class is not “Telling” all exceptions made by my WebLoaderMapper (Named MapperSeeder in my code ..)

    Collapse
    #17467

    Larry Liu
    Moderator

    Hi, Petri

    Thanks for getting code for us.

    From your original post, your are trying to create a mapreduce job to compare the result from Hbase shell scanning with mapreduce output? I think it doesn’t need a mapreduce job to scan the table from java code. If this is the purpose, you can just use Java code to scan table rather than a mapreduce job.

    If you think you get timeout while running mapreduce job, it could be different issue. We can look at this issue as well.

    Thanks for clarification

    Larry

    Collapse
    #17319

    tedr
    Member

    Hi Petri,

    Ok, we’re looking into your code to see if we can see any thing.

    Thanks,
    Ted.

    Collapse
    #17300

    petri koski
    Member

    And here comes the mapper part ..

    static class Mapperseeder extends TableMapper {

    private int numRecords = 0;
    private Text sivu = new Text();

    private Text linkki = new Text();
    String url = null;
    int ii;
    //String p = "empty";
    // URL yahoo = null;
    String base = null;
    Vector baseadress = new Vector();
    Vector baselink = new Vector();
    Vector ur = new Vector();
    Vector stopurl = new Vector();
    int counter = 0;
    int check = -1;
    Configuration config = null;
    HTable table = null;
    StringBuilder stop = new StringBuilder();
    String sw[];
    String notfile;
    StringBuilder botpagetosearch = new StringBuilder();
    URL yahoo = null;
    URLConnection yc = null;
    String baseosoite = null;
    BufferedReader in = null;
    String inputLine = null;
    String p = null;

    @Override
    public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {

    // for(KeyValue kv : values.raw()) {
    // stop.setLength(0);

    url = Bytes.toString(values.getRow());
    p = loadpage(url);
    if (p != null) {
    // long fetcherstart = new Date().getTime();
    // savecontent(p,yahoo,config,table);

    sivu.set(p);
    linkki.set(url);

    try {
    context.write(linkki, sivu);
    } catch (InterruptedException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }

    So its just a plain loader of webpage. It will check that String p actually contains page and use it as value, and its url as key .. in Reducer -part is saved in HBase table. ..

    Like I said I got those scanner timeouts. My LoadPage routine has this 10 seconds read timeout and Hbase Cache was default 100 .. so I understand if I get many nonworking urls and that 10 seconds is spent in every url I will get that timeout, now its 2 seconds and Hbase scanner timeout is set to 6 minutes ..

    I just need to know HOW to load those pages as fast as possible .. MultiThreading seems to be the best way to utilize the whole CPU and other nodes when that fetchlist goes big enough but I am open minded to all suggestions.

    Collapse
    #17298

    petri koski
    Member

    Hello! Here comes JobConfig:

    fetcherstart = new Date().getTime();
    Configuration config = HBaseConfiguration.create();
    config.addResource(new Path("/public_ftp/hbase-site.xml"));
    Job job = null;
    try {
    job = new Job(config,"mptesti");
    } catch (IOException e1) {

    e1.printStackTrace();
    }
    job.setJarByClass(mptesti.class);
    Scan scan = new Scan();
    if (fetchlistmakerstart != 0 && fetchlistmakerstop != 0) {
    try {
    fetchlistmakerstop = fetchlistmakerstop + 90;
    fetchlistmakerstart = fetchlistmakerstart - 90;
    scan.setTimeRange(fetchlistmakerstart, fetchlistmakerstop);
    } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }
    }

    // job.setJarByClass(Mapper.class);
    // job.setOutputKeyClass(ImmutableBytesWritable.class);
    // job.setOutputValueClass(Put.class);
    job.setJobName("Loading links from fetchlist Phase #1, Loop " + i + " / " + level + " " + blogmode + " Start: " + fetchlistmakerstart + " Stop: " + fetchlistmakerstop);
    job.setNumReduceTasks(4);

    try {
    TableMapReduceUtil.initTableMapperJob(
    "fetchlist", // input HBase table name
    scan, // Scan instance to control CF and attribute selection
    MultithreadedTableMapper.class,// Mapperseeder.class,// MultithreadedTableMapper.class, // mapper
    Text.class, // mapper output key
    Text.class, // mapper output value
    job);
    } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }

    MultithreadedTableMapper.setMapperClass(job, Mapperseeder.class);
    MultithreadedTableMapper.setNumberOfThreads(job, 50);

    try {
    TableMapReduceUtil.initTableReducerJob("url", Reducer1.class, job);
    } catch (IOException e) {

    e.printStackTrace();
    }

    boolean b = false;
    try {
    b = job.waitForCompletion(true);
    } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (InterruptedException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (ClassNotFoundException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }

    if (!b) {
    try {
    throw new IOException("error with job!");
    } catch (IOException e) {
    }
    And in my Next mail comes Mapper itself ..

    Collapse
    #17244

    tedr
    Member

    Hi Petri,

    Could you post the code to your MultiThreadedMapper here? It could be that there is something in it that is limiting the results. Also if possible could you post the MapReduce Job configs?

    Thanks,
    Ted.

    Collapse
    #17242

    petri koski
    Member

    Hello and Thanks for replying ..

    I have one Mapper which takes its input from Hbase -table (Urls) and inside that Mapper loads those Urls and save them in reduce phase. I tried to run that code without MultiThreadedMapper.Class, using directly my own Mapper “single threaded” I got ScannerTimeouts (I have Default 60secs) .. So I think if I use MultiThreadedMapper it dont show those Exceptions same way, so you dont know something is wrong “behind the scenes” ..
    Reason why I am using Hadoop Map as loading Urls is bcoz of distributing that job between nodes ..

    Collapse
    #17011

    Seth Lyubich
    Keymaster

    Hi Petri,

    Can you please provide code example on what you are try to do with Java?

    Thanks,
    Seth

    Collapse
Viewing 9 replies - 1 through 9 (of 9 total)