To deploy, configure, manage and scale Hadoop clusters in a way that optimizes performance and resource utilization there is a lot to consider. Here are 6 key things to think about as part of your planning:
- Operating system: Using a 64-bit operating system helps to avoid constraining the amount of memory that can be used on worker nodes. For example, 64-bit Red Hat Enterprise Linux 6.1 or greater is often preferred, due to better ecosystem support, more comprehensive functionality for components such as RAID controllers.
- Computation: Computational (or processing) capacity is determined by the aggregate number of Map/Reduce slots available across all nodes in a cluster. Map/Reduce slots are configured on a per-server basis. I/O performance issues can arise from sub-optimal disk-to-core ratios (too many slots and too few disks). HyperThreading improves process scheduling, allowing you to configure more Map/Reduce slots.
- Memory: Depending on the application, your system’s memory requirements will vary. They differ between the management services and the worker services. For the worker services, sufficient memory is needed to manage the TaskTracker and FileServer services in addition to the sum of all the memory assigned to each of the Map/Reduce slots. If you have a memory-bound Map/Reduce Job, you may need to increase the amount of memory on all the nodes running worker services. When increasing memory, you should always populate all the memory channels available to ensure optimum performance.
- Storage: A Hadoop platform that’s designed to achieve performance and scalability by moving the compute activity to the data is preferable. Using this approach, jobs are distributed to nodes close to the associated data, and tasks are run against data on local disks. Data storage requirements for the worker nodes may be best met by direct attached storage (DAS) in a Just a Bunch of Disks (JBOD) configuration and not as DAS with RAID or Network Attached Storage (NAS).
- Capacity: The number of disks and their corresponding storage capacity determines the total amount of the FileServer storage capacity for your cluster. Large Form Factor (3.5”) disks cost less and store more, compared to Small Form Factor disks. A number of block copies should be available to provide redundancy. The more disks you have, the less likely it is that you will have multiple tasks accessing a given disk at the same time. More tasks will be able to run against node-local data, as well.
- Network: Configuring only a single Top of Rack (TOR) switch per rack introduces a single point of failure for each rack. In a multi-rack system, such a failure will result in a flood of network traffic as Hadoop rebalances storage. In a single-rack system, this type of failure can bring down the whole cluster. Configuring two TOR switches per rack provides better redundancy, especially if link aggregation is configured between the switches. This way, if either switch fails, the servers will still have full network functionality. Not all switches have the ability to do link aggregation from individual servers to multiple switches. Incorporating dual power supplies for the switches can also help mitigate failures.
Thanks to HP for pulling this information together and testing Hortonworks Data Platform on HP hardware. For the full report, download the “HP Reference Architecture for Hortonworks Data Platform” whitepaper.