New in HDP 2: More Powerful Scheduling Options in Oozie

In this post we’ll cover some new scheduling options available via Apache Oozie in HDP 2. You can try out these capabilities in HDP 2 Beta and HDP 2 Beta Sandbox.

What Is Oozie Again?

Apache Oozie is a workflow engine and scheduler for Hadoop. Oozie allows you to run jobs in Hadoop at pre-defined intervals. The jobs can be simple ones that execute single Hive or Pig commands or can be full directed acyclic graphs representing complex workflows. Workflows and schedules are expressed using XML specification files.

Scheduling in Oozie

Oozie has historically allowed only very basic forms of scheduling: You could choose to run jobs separated by a certain number of minutes, hours, days or weeks. That’s all. This works fine for processes that need to run continuously all year like building a search index to power an online website.

However there are a lot of cases that don’t fit this model. For example, maybe you want to export data to a reporting system used during the day by business analysts. It would be wasteful to run the jobs when no analyst is going to take advantage of the new information, such as overnight. You might want a policy that says “only run these jobs on weekdays between 6AM and 8PM”. Oozie didn’t support this kind of scheduling policy without resorting to unnatural acts.

Better Scheduling for Oozie

If you’ve spent time with Linux or other UNIX-like systems you’ve likely used cron at some point. Bowen Zhang proposed OOZIE-1306 to introduce cron-like scheduling into Oozie, and it received a very positive reaction from the community. With this feature, far more sophisticated scheduling is possible, you can have jobs run only on certain days, only in certain time ranges during days, only on certain days of the month, and much more.

HDP 2.0 Beta includes this feature and is ready for testing now. To help get things started this blog will show you a few examples of the new scheduling in action.

Example: Running A Job Every Weekday at 2AM

Oozie separates specifications for workflow and schedule into a workflow specification and a coordinator specification, respectively. Coordinator specifications are optional, only required if you want to run a job repeatedly on a schedule. By convention you usually see workflow specifications in a file called workflow.xml and a coordinator specification in a file called coordinator.xml. The new cron-like scheduling affects these coordinator specifications. Let’s take a look at a coordinator specification that will cause a workflow to be run every weekday at 2 AM.

<coordinator-app name="weekdays-at-two-am"
frequency="0 2 * * 2-6"
start="${start}" end="${end}" timezone="UTC"

The key thing here is the frequency attribute in the coordinator-app element, here we see a cron-like specification that instructs Oozie when to run the workflow. The value for <app-path> is specified in another properties file. The specification is “cron-like” and you might notice one important difference, days of the week are numbered 1-7 (1 being Sunday) as opposed to the 0-6 numbering used in standard cron.

Try It For Yourself With These Examples

We’ve made several fully-functional examples available that you can try out on the HDP 2 Beta or the HDP 2 Beta Sandbox. The samples are optimized to run out-of-the-box on the Sandbox. The examples create directories within HDFS based on the specified schedules. The examples include a README detailing how to get the files into your cluster and specific commands to run. After you run them, navigate to the Oozie console at your Sandbox IP, port 11000 and click the “Coordinator Jobs” tab to check the status of the Coordinator job, you should see something like this:


If you’re looking to roll your own schedules, I personally found Robert Plank’s Crontab Builder quite useful.

Categorized by :


Diwakar Dhanuskodi
July 7, 2014 at 1:50 am

Is there tutorial for Oozie with Yarn?. Please share!!

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.