Thank you for reading our Data Lake 3.0 series! We are encouraged by the positive responses to our blogs (part 1, part 2, part 3, part 4, part 5). In Data Lake 3.0, we are envisioning a large data lake shared between multiple tenants and dockerized applications (ranging from real-time to batch). In a shared data lake, it is important to isolate faulty device and network links to provide the SLA guarantee. In this blog, we will discuss “Device Behavior Analytics” introduced in the recently released HDP 2.6 to proactively diagnose the data lake.
HDFS is architectured to automatically handle datanode and disk failures by taking corrective actions like moving blocks to other datanodes/ disks. Slow datanodes or disks can affect the overall performance of the cluster as they are assigned tasks by the namenode.
Device behavior analytics improves the diagnosability of datanode performance problems by flagging slow datanodes in the cluster. This feature will be available in HDP-2.6.1.
This feature incorporates code, designs and reviews from various Apache Hadoop community members including Xiaoyu Yao, Arpit Agarwal, Xiaobing Zhou, Jitendra Pandey and Anu Engineer.
A healthy DataNode does all of the following:
The complete failure of a DataNode process means that it stops exhibiting all of the above signs of life. Examples include process crash, node restart, process hang due to a long stop-the-world JVM Garbage Collection, or a network link failure which partitions the node from the rest of the cluster.
HDFS is designed to recover from datanode failures. Recovery from a complete datanode failure takes just a few minutes in most clusters since block re-replication is massively parallelized.
There is a large class of partial failures which are harder to detect, specially disk failures. Examples of such failures from real HDFS clusters include:
… and many more.
In each of these instances the service continues to generate heartbeats and respond to health check messages albeit with higher latency. Hence it is assumed to be working. However the degraded performance affects tasks that do I/O on that node.
Device behavior analytics allows administrators to detect straggler nodes which reduce the cluster performance. Finding and fixing these nodes is the simple most important action that improves the cluster throughput, not only by improving the performance of datanodes, but other services like HBase and Hive become more responsive and faster.
To enable Device Behavior Analytics, the following configurations must be set.
Set to true to enable Slow Datanode Detection.
Set to a value between 1 and 100 to enable Slow Disk detection. This setting controls the fraction of file IO events which will be sampled for profiling disk statistics. Setting this value to 100 will sample all disk IO. Sampling a large fraction of disk IO events might have a small performance impact.
This setting allows you to control how frequently datanodes will report their peer latencies to the Namenode via heartbeat and the frequency of disk outlier detection by the datanode. The default value for this setting is 30 minutes. The default time unit is milliseconds but multiple time unit suffixes are supported.
These settings should be added to hdfs-site.xml. If it is an Ambari installed cluster, then the settings can be added via custom hdfs-site.xml.
Our latest release (HDP 2.6.1) has device behavior analytics, which for the first time allows administrators to quickly identify straggler nodes and disks. This feature can be separated into two parts – Slow Datanode Detection and Slow Disk Detection.
Datanode will optionally collect latency statistics about their peers during the normal operation of the datanode write pipeline.
Figure 1 shows the datanode write pipeline. Applications write to a datanode directly which is replicated to other datanodes. Each datanode is constantly aware of the write performance of its peer. e.g. DN2 measures how long it takes for DN3 to respond to write requests.
These latency stats are used to detect outliers among the peers. Hence slow peer detection may not be available in very small or very lightly loaded clusters where datanodes are not actively communicating with enough peers.
In Figure 2, DN 5 is a slow node. DN 2, 6 and 8 detect that DN 5 is a slow peer based on the mean write latencies they measure from DN 5 and other datanodes. Each datanode builds a report listing the outliers among its peers and sends it the namenode via heartbeats periodically. DN 2, 6 and 8 will report DN 5 as a slow peer to the namenode. The reporting interval is configurable.
Slow peer detection is not performed unless the datanode has statistics of at least 10 peers. Namenode aggregates the statistics received from all datanodes and computes the top 5 slowest datanodes in the cluster. Namenode maintains the list of slow peers and administrators can read it via JMX. Datanodes also expose the average write latency of its peers through Datanode JMX.
Datanode will collect I/O statistics from all its disks. Device behavior analytics uses sampling to reduce the performance impact of I/O profiling. We can configure the percentage of file I/O events to be sampled.
Datanode detects slow disks using I/O latencies. Slow disk detection is not performed unless the datanode has at least 5 disks. Slow disk information is available via Datanode JMX.
Namenode maintains a list of slowest disks in the cluster. This list is update periodically. This information is available via Namenode JMX.
The Datanode JMX output above reports disk3 and disk4 as outliers for the Datanode. The JMX also reports the latencies and number of operations per volume in another metric. The sample JMX output below for DataNodeVolume information for disk3 shows high average latencies for metadata and write IO operations.