cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
May 06, 2015
prev slideNext slide

Improving Operations with Ambari User Views

Apache Ambari 2.0 User Views introduce two functional tools to help you understand and optimize your cluster resources to get the best performance in a multitenant Hadoop environment.

am_tabl_1

Tez View: Understand and Optimize Jobs in your Cluster

The Tez View gives you visibility into all the jobs on your cluster, allowing you to quickly identify which jobs consume the most resources and which are the best candidates to optimize.

am_tabl_2

With the Tez View you can quickly spot Hive or Pig jobs that are taking the longest, writing the most data or consuming the most CPU. Once you’ve identified these big jobs, the Tez View lets you drill in to see exactly how the job is running and helps you identify ways to optimize it

Optimizing Hive SQL Queries or Pig Jobs

Important job running slow? You need to drill down and see what’s happening in the job. The Tez View lets you see exactly how the job is executed and the resources it uses at every step of the way.

am_tabl_3

One common performance bottleneck in SQL is doing a reduce-side join when you could do a map-side join instead. A reduce-side join requires large amounts of data to move over the network and lots of temporary data to be written. With a map-side join, small amounts of data move over the network and SQL processing happens in-place. Map-side joins can be more than 10 times faster than reduce-side joins so you want to do them whenever you can, even if it means making a few special configurations for that big job.

With the Tez View you can spot this problem easily and correct it all within Ambari. Let’s look at an example.

am_tabl_4

Using the Tez View we quickly spot a shuffle join, which we want to avoid if possible. Hive tries to convert joins to map-side joins automatically but this is constrained by the size of a Tez container. If you have some extremely large dimension tables it may make sense to use custom settings for the job and increase both the container size and the variable that controls Hive’s map-side join threshold (see Hive’s Join Optimization page for more info). When we do that the plan looks quite different:

am_tabl_5

Why does this help? A map join minimizes the need to write massive amounts of meaningless temporary data, in this case less than 1% as much temporary data is written after the switch.

am_tabl_6

It’s not uncommon for a conversion to map join to accelerate a large job 10x or more.

Another common wasteful scenario is queries that try to join 2 fact tables together. Queries like this should be optimized either by enabling Hive’s Cost-Based Optimizer or manually changing the join order. The Tez View makes it easy to find these big queries and fix them.

Fine-Grained Tenet Level Controls with Capacity Scheduler View

The YARN Capacity Scheduler allows Hadoop to be shared among multiple independent tenants while providing guaranteed capacity and predictable SLAs. The Capacity Scheduler divides resources through use of YARN queues, which are sized based on the relative allocations given to various tenants.

Until now, configuring queues has required hand-editing XML files, so the process was error-prone and it was difficult to get an overall visibility of how the capacity scheduler was dividing resources. As well, configuring a queue comes with a lot of rules: all queues at a given level must utilize all capacity, max capacity cannot be less than capacity, removing a queue requires a ResourceManager restart, and the ACL syntax for job submission + queue administration must be formatted exactly right. Follow all that? No? Install the View!

The Capacity Scheduler View solves these by providing a simple UI that lets you create and modify YARN queues and see their distribution at-a-glance. The UI enforces configuration rules, highlights invalid conditions and hides the complex syntax of setting ACLs. The View is also smart enough to know if a disruptive ResourceManager restart is needed or if you can simply refresh the configuration with no downtime.

am_tabl_8

For instance here we see that 60% of cluster resources are dedicated to Engineering, and within that, QE gets the majority of resources. Despite this, Development has a max capacity of 100%, meaning that if QE is not using its resources, Development is free to take advantage of them.

With the Capacity Scheduler View you can easily:

  • Partition Hadoop resources among tenants.
  • Define, view and modify queue definitions.
  • Establish fine-grained control on who can run jobs in queues.

Try Ambari User View Technical Preview!
Try Ambari User View Technical Preview!

Ambari User Views are designed to provide capabilities that assist with the operational aspects of data application development and workload management. All the new Ambari Views have been pre-installed in the newly updated Hortonworks Sandbox, so just download and you’re ready to go. Want to try these on an existing cluster? To download and configure the Ambari User Views Technical Preview use this document. If you have questions or feedback on the User Views please post them to the Ambari User View Forum.

Tech Preview User Views Description
Hive Hive View allows the user to write & execute SQL queries on the cluster. It shows the history of all Hive queries executed on the cluster whether run from Hive View or another source such as JDBC/ODBC or CLI. It also provides graphical view of the query execution plan. This helps the user debug the query for correctness and for tuning the performance. It integrates Tez View that allows the user to debug any Tez job, including monitoring the progress of a job (whether from Hive or Pig) while it is running. This View contribution can be found here.
Pig Pig View is similar to the Hive View. It allows writing and running a Pig script. It has support for saving scripts, and loading and using existing UDFs in scripts. This View contribution can be found here.
Capacity Scheduler Capacity Scheduler View helps a Hadoop operator setup YARN workload management easily to enable multi-tenant and multi-workload processing. This View provisions cluster resources by creating and managing YARN queues. This View contribution can be found here.
Files Files View allows the user to manage, browse and upload files and folders in HDFS. This View contribution can be found here.

 

Comments

  • Finally – Ambari Views can’t come soon enough to replace Hue and also to give the kind of insights in to Tez job processing that we used to get in MRv1.

    The immediate questions that follow on from delivering Ambari Views are around HA:

    – does the Files View follow Active/Standby NN failovers?
    – does the Hive view follow recent HiveServer2 HA failover? (I’m assuming it’s using HiveServer2 like Hue)
    – when is Ambari itself going to get stateful HA failover given that it now becomes a user facing UI with the Ambari Views providing the development interface to the cluster for users?
    – given that metastore DB is still a single point of failure can all Hive metadata be migrated to be stored to HBase (or better Cassandra) since doing HA DBs just to back the MetaStore is non-trivial and HBase is going to be used for metrics in future HDP releases anyway.

  • Thanks for the detailed article. I have a query in the way of accessing the Views. Is there a way a user can access the View (for which he already have access) without logging into ambari server. Or creating a view that can be accessed by all.

    Thanks
    Guru

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>