The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

MapReduce Forum

HDP map/reduce fast performance

  • #48728
    Dharanikumar Bodla

    hi to all,
    Good Morning,
    I had a set of 22documents in the form of text and loaded in hdfs,when running a map/reduce funtion from command line of hdfs ,it took 4mins 31 secs for streaming the 22 text files.How do increase the map/reduce process as fast as possible so that these text files should complete the process by 5-10 seconds.
    What changes I need to do on ambari hadoop.
    Allocated 2GB of data for Yarn,and 400GB for HDFS
    default virtual memory for a job map-task = 341 MB
    default virtual memory for a job reduce-task = 683 MB
    MAP side sort buffer memory = 136 MB
    And when running a job ,Hbase error with Region server goes down,Hive metastore status service check timed out.

    Thanks & regards,
    Bodla Dharani Kumar,

  • Author
  • #50717
    Rupert Bailey

    You might need to advise
    how big these files are
    how many nodes your cluster
    now many processors per node
    ram per node.

    Details of the source Machine.

    This will indicate a good block size. you could consider (size of file)/(number of nodes * number of processors)
    It will be a map only process without a sort so make sure the max number of mappers is increased to at least: number of nodes * number of processors
    You may be trying to execute these sequentially, consider spawning child processes (in unix use an “&” at the end) and looping trough the files. this might mean your speed is increased at the source by several processors reading each file. If the source has multiple disks consider a file on each disk and spawing a process per disk, as you’ll be speed bound pulling from disk.
    Reduce your replication factor to 1

The forum ‘MapReduce’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.