Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
June 06, 2017
prev slideNext slide

Part 6 of Data Lake 3.0: A Self-Diagnosing Data Lake

Thank you for reading our Data Lake 3.0 series! We are encouraged by the positive responses to our blogs (part 1, part 2, part 3, part 4, part 5). In Data Lake 3.0, we are envisioning a large data lake shared between multiple tenants and dockerized applications (ranging from real-time to batch). In a shared data lake, it is important to isolate faulty device and network links to provide the SLA guarantee. In this blog, we will discuss “Device Behavior Analytics” introduced in the recently released HDP 2.6 to proactively diagnose the data lake.

Introduction of Device Behavior Analytics

HDFS is architectured to automatically handle datanode and disk failures by taking corrective actions like moving blocks to other datanodes/ disks. Slow datanodes or disks can affect the overall performance of the cluster as they are assigned tasks by the namenode.

Device behavior analytics improves the diagnosability of datanode performance problems by flagging slow datanodes in the cluster. This feature will be available in HDP-2.6.1. 

This feature incorporates code, designs and reviews from various Apache Hadoop community members including Xiaoyu Yao, Arpit Agarwal, Xiaobing Zhou, Jitendra Pandey and Anu Engineer.

Complete and Partial Failures

A healthy DataNode does all of the following:

  1. Sends regular heartbeats to all NameNodes and processes NameNode commands sent in response to heartbeats.
  2. Sends regular incremental and full block reports to each NameNode.
  3. Responds to application read and write requests in a timely manner, usually within a few milliseconds.
  4. Responds to HTTP requests on its web UI port.

The complete failure of a DataNode process means that it stops exhibiting all of the above signs of life. Examples include process crash, node restart, process hang due to a long stop-the-world JVM Garbage Collection, or a network link failure which partitions the node from the rest of the cluster.

HDFS is designed to recover from datanode failures. Recovery from a complete datanode failure takes just a few minutes in most clusters since block re-replication is massively parallelized.

There is a large class of partial failures which are harder to detect, specially disk failures. Examples of such failures from real HDFS clusters include:

  1. Failure of a single disk on a datanode that is not detected by existing disk checks.
  2. Slow disk – a disk that completes IO operations successfully but with long latencies.
  3. Network connection to a specific node performs poorly resulting in low throughput but not complete loss of connectivity.
  4. Misconfigured DN heap settings on a node resulting in high CPU usage and poor performance.

… and many more.

In each of these instances the service continues to generate heartbeats and respond to health check messages albeit with higher latency. Hence it is assumed to be working. However the degraded performance affects tasks that do I/O on that node.

Device behavior analytics allows administrators to detect straggler nodes which reduce the cluster performance. Finding and fixing these nodes is the simple most important action that improves the cluster throughput, not only by improving the performance of datanodes, but other services like HBase and Hive become more responsive and faster.

There was work done previously to improve diagnosability of slow nodes and disks via better logging, e.g. HBASE-6110, HDFS-11363 and HDFS-11603.

Enabling Device Behavior Analytics

To enable Device Behavior Analytics, the following configurations must be set.

  • dfs.datanode.peer.stats.enabled

Set to true to enable Slow Datanode Detection.

  • dfs.datanode.fileio.profiling.sampling.percentage

Set to a value between 1 and 100 to enable Slow Disk detection. This setting controls the fraction of file IO events which will be sampled for profiling disk statistics. Setting this value to 100 will sample all disk IO. Sampling a large fraction of disk IO events might have a small performance impact.

  • dfs.datanode.outliers.report.interval

This setting allows you to control how frequently datanodes will report their peer latencies to the Namenode via heartbeat and the frequency of disk outlier detection by the datanode. The default value for this setting is 30 minutes. The default time unit is milliseconds but multiple time unit suffixes are supported.

These settings should be added to hdfs-site.xml. If it is an Ambari installed cluster, then the settings can be added via custom hdfs-site.xml.

Device Behavior Analytics

Our latest release (HDP 2.6.1) has device behavior analytics, which for the first time allows administrators to quickly identify straggler nodes and disks. This feature can be separated into two parts – Slow Datanode Detection and Slow Disk Detection.

Slow Datanode (Peer) Detection

Datanode will optionally collect latency statistics about their peers during the normal operation of the datanode write pipeline.

Datanode Write Pipeline
Figure 1.

Figure 1 shows the datanode write pipeline. Applications write to a datanode directly which is replicated to other datanodes. Each datanode is constantly aware of the write performance of its peer. e.g. DN2 measures how long it takes for DN3 to respond to write requests.

Slow Datanode Detection
Figure 2.

These latency stats are used to detect outliers among the peers. Hence slow peer detection may not be available in very small or very lightly loaded clusters where datanodes are not actively communicating with enough peers.

In Figure 2, DN 5 is a slow node. DN 2, 6 and 8 detect that DN 5 is a slow peer based on the mean write latencies they measure from DN 5 and other datanodes. Each datanode builds a report listing the outliers among its peers and sends it the namenode via heartbeats periodically. DN 2, 6 and 8 will report DN 5 as a slow peer to the namenode. The reporting interval is configurable.

Slow peer detection is not performed unless the datanode has statistics of at least 10 peers. Namenode aggregates the statistics received from all datanodes and computes the top 5 slowest datanodes in the cluster. Namenode maintains the list of slow peers and administrators can read it via JMX. Datanodes also expose the average write latency of its peers through Datanode JMX. 

Slow Disk Detection

Datanode will collect I/O statistics from all its disks. Device behavior analytics uses sampling to reduce the performance impact of I/O profiling. We can configure the percentage of file I/O events to be sampled.

Slow Disk Detection
Figure 3.

Datanode detects slow disks using I/O latencies. Slow disk detection is not performed unless the datanode has at least 5 disks. Slow disk information is available via Datanode JMX. 
Namenode maintains a list of slowest disks in the cluster. This list is update periodically. This information is available via Namenode JMX.

Appendix

  • Sample Namenode JMX output reporting slow nodes

  • Sample Datanode JMX output showing average write latency of peers

  • Sample Datanode JMX output showing slow disks

The Datanode JMX output above reports disk3 and disk4 as outliers for the Datanode. The JMX also reports the latencies and number of operations per volume in another metric. The sample JMX output below for DataNodeVolume information for disk3 shows high average latencies for metadata and write IO operations.

  • Sample Namenode JMX output reporting slow disks and their latencies

Tags:

Comments

  • Informative article and I am looking forward to leveraging the new diagnostics to get more reliable performance from my cluster. However I take issue with a line in the 2nd paragraph:
    ‘HDFS is designed to recover from datanode failures. Recovery from a complete datanode failure takes just a few minutes in most clusters since block re-replication is massively parallelized.’

    By default each DN only replicates at most 2 blocks[1] per internal (3s by default). With DNs with a number of large disks it can take very long (hours) to resolve the missing blocks from a DN failure. Would be interested if there is work underway to may this better[2] or more tunable[3] given the circumstances

    [1] https://github.com/apache/hadoop/blob/branch-2.8.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java#L211

    [2] HDFS-2537 https://issues.apache.org/jira/browse/HDFS-2537

    [3] HDFS-9943 https://issues.apache.org/jira/browse/HDFS-9943

  • Leave a Reply

    Your email address will not be published. Required fields are marked *