Oozie Forum

Restrict materialisation in Oozie

  • #49020
    Neal Kerry
    Participant

    I’m having a problem in oozie where I need to restrict the materialisation of jobs but cannot figure out a way of doing it.

    I have a job that processes input from a dataseries. Each day it takes the current days data, and the output from the previous day, and produces output for the next day to use.
    Some days there is no data, so each task uses coord:latest to request the latest prior output that is available.
    The problem occurs when we run a backlog; tasks can then be run more frequently than once a day, and the coordinator will queue up a series of jobs to run all at once. Say we have a week’s worth of backlog, 7 jobs are materialized, each requests the input for its date, and the ‘latest’ available output folder from a previous task. As the previous task hasn’t run for 6 of these jobs, all 7 jobs see the same ‘latest’ output, and 6 are given the wrong inputs.

    I have tried using the ‘concurrency’ and ‘throttle’ settings to deal with this, but throttle only restricts waiting jobs, and as these jobs are finding valid input, they are not classed as waiting. The concurrency setting does not help either; it holds the materialised jobs in the ‘ready’ state and prevents them from running, but as their inputs are chosen upon materialisation, they still run with the wrong inputs, even though their predecessors will have run and produced the correct output.

    What I need is a way to restrict jobs from even materialising before another job has run, or a way of forcing a job to re-evaluate its inputs before it actually begins execution. So far I have been unable to find a way of doing this – is it possible? I’m also open to suggestions for a totally different approach to achieve the same thing if anyone has any bright ideas!

    Thanks for reading

to create new topics or reply. | New User Registration

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.