Home Forums Pig How to limit the number of concurrent jobs that pig script starts

This topic contains 1 reply, has 2 voices, and was last updated by  Jianyong Dai 11 months, 1 week ago.

  • Creator
    Topic
  • #46181

    Hi,

    I am trying to merge few files, remove duplicates and store them using following macros:

    DEFINE mergeDateDimension(validDataSet, dimensionFieldName, previousDimensionFile) RETURNS merged {
    dates = FOREACH $validDataSet GENERATE $dimensionFieldName;
    oldDimensions = LOAD ‘$previousDimensionFile’ USING PigStorage(‘|’) AS (
    id:LONG,
    monthName:CHARARRAY,
    monthId:INT,
    year:INT,
    fiscalYear:INT,
    originalDate:CHARARRAY);
    oldOriginalDates = FOREACH oldDimensions GENERATE originalDate;
    allDates = UNION dates, oldOriginalDates;
    uniqueDates = DISTINCT allDates;
    $merged = FOREACH uniqueDates GENERATE toDateDimension($0);
    };

    I call this macros four times in my script:

    billDateDim = mergeDateDimension(validData, BillDate, ‘$atbPrevOutputBase/dimensions/$billDateDimensionName’);
    STORE billDateDim INTO ‘$atbOutputBase/dimensions/$billDateDimensionName';

    admissionDateDim = mergeDateDimension(validData, AdmissionDate, ‘$atbPrevOutputBase/dimensions/$admissionDateDimensionName’);
    STORE admissionDateDim INTO ‘$atbOutputBase/dimensions/$admissionDateDimensionName';

    dischDateDim = mergeDateDimension(validData, DischargeDate, ‘$atbPrevOutputBase/dimensions/$dischargeDateDimensionName’);
    STORE dischDateDim INTO ‘$atbOutputBase/dimensions/$dischargeDateDimensionName';

    arPostDateDim = mergeDateDimension(validData, PeriodDate, ‘$atbPrevOutputBase/dimensions/$arPostDateDimensionName’);
    STORE arPostDateDim INTO ‘$atbOutputBase/dimensions/$arPostDateDimensionName';

    When I run script in sandbox, it starts four parallel map-reduce jobs and they get stuck.
    But if I remove two last lines and run script – everything works fine (i.e. three jobs successfully complete).

    So I am wondering if it is possible to limit number of concurrent jobs (not map/reduce tasks)?

Viewing 1 replies (of 1 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #46780

    Jianyong Dai
    Participant

    You can put “exec” keyword into Pig script to manually create a execution boundary.

    billDateDim = mergeDateDimension(validData, BillDate, ‘$atbPrevOutputBase/dimensions/$billDateDimensionName’);
    STORE billDateDim INTO ‘$atbOutputBase/dimensions/$billDateDimensionName’;

    admissionDateDim = mergeDateDimension(validData, AdmissionDate, ‘$atbPrevOutputBase/dimensions/$admissionDateDimensionName’);
    STORE admissionDateDim INTO ‘$atbOutputBase/dimensions/$admissionDateDimensionName’;

    exec

    dischDateDim = mergeDateDimension(validData, DischargeDate, ‘$atbPrevOutputBase/dimensions/$dischargeDateDimensionName’);
    STORE dischDateDim INTO ‘$atbOutputBase/dimensions/$dischargeDateDimensionName’;

    arPostDateDim = mergeDateDimension(validData, PeriodDate, ‘$atbPrevOutputBase/dimensions/$arPostDateDimensionName’);
    STORE arPostDateDim INTO ‘$atbOutputBase/dimensions/$arPostDateDimensionName’;

    exec
    ……

    Collapse
Viewing 1 replies (of 1 total)