HDP map/reduce fast performance

to create new topics or reply. | New User Registration

This topic contains 1 reply, has 2 voices, and was last updated by  Rupert Bailey 1 year ago.

  • Creator
  • #48728

    Dharanikumar Bodla

    hi to all,
    Good Morning,
    I had a set of 22documents in the form of text and loaded in hdfs,when running a map/reduce funtion from command line of hdfs ,it took 4mins 31 secs for streaming the 22 text files.How do increase the map/reduce process as fast as possible so that these text files should complete the process by 5-10 seconds.
    What changes I need to do on ambari hadoop.
    Allocated 2GB of data for Yarn,and 400GB for HDFS
    default virtual memory for a job map-task = 341 MB
    default virtual memory for a job reduce-task = 683 MB
    MAP side sort buffer memory = 136 MB
    And when running a job ,Hbase error with Region server goes down,Hive metastore status service check timed out.

    Thanks & regards,
    Bodla Dharani Kumar,

Viewing 1 replies (of 1 total)

You must be to reply to this topic. | Create Account

  • Author
  • #50717

    Rupert Bailey

    You might need to advise
    how big these files are
    how many nodes your cluster
    now many processors per node
    ram per node.

    Details of the source Machine.

    This will indicate a good block size. you could consider (size of file)/(number of nodes * number of processors)
    It will be a map only process without a sort so make sure the max number of mappers is increased to at least: number of nodes * number of processors
    You may be trying to execute these sequentially, consider spawning child processes (in unix use an “&” at the end) and looping trough the files. this might mean your speed is increased at the source by several processors reading each file. If the source has multiple disks consider a file on each disk and spawing a process per disk, as you’ll be speed bound pulling from disk.
    Reduce your replication factor to 1

Viewing 1 replies (of 1 total)
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.