Restrict materialisation in Oozie
I’m having a problem in oozie where I need to restrict the materialisation of jobs but cannot figure out a way of doing it.
I have a job that processes input from a dataseries. Each day it takes the current days data, and the output from the previous day, and produces output for the next day to use.
Some days there is no data, so each task uses
coord:latestto request the latest prior output that is available.
The problem occurs when we run a backlog; tasks can then be run more frequently than once a day, and the coordinator will queue up a series of jobs to run all at once. Say we have a week’s worth of backlog, 7 jobs are materialized, each requests the input for its date, and the ‘latest’ available output folder from a previous task. As the previous task hasn’t run for 6 of these jobs, all 7 jobs see the same ‘latest’ output, and 6 are given the wrong inputs.
I have tried using the ‘concurrency’ and ‘throttle’ settings to deal with this, but throttle only restricts waiting jobs, and as these jobs are finding valid input, they are not classed as waiting. The concurrency setting does not help either; it holds the materialised jobs in the ‘ready’ state and prevents them from running, but as their inputs are chosen upon materialisation, they still run with the wrong inputs, even though their predecessors will have run and produced the correct output.
What I need is a way to restrict jobs from even materialising before another job has run, or a way of forcing a job to re-evaluate its inputs before it actually begins execution. So far I have been unable to find a way of doing this – is it possible? I’m also open to suggestions for a totally different approach to achieve the same thing if anyone has any bright ideas!
Thanks for reading