Grokbase Groups Pig user June 2011
FAQ
Hi,

My pig query is roughly the following:

register some_lib.jar
a = load 'somefile' using CustomUDF();
b = foreach a generate CustomProjectionUDF();
c = foreach b generate var1, var2, var3;
d = group b by (var1, var2);
e = foreach d generate flatten(group), SUM(c.var1), SUM(c.var2),
SUM(c.var3);
store e into 'file';

I was expecting to see the combiner being used, but the optimizer did not
use a combiner. The following is the output I see (version 0.8.1)
INFO executionengine.HExecutionEngine: pig.usenewlogicalplan is set to true.
New logical plan will be used.
INFO executionengine.HExecutionEngine: (Name: agg:
Store(hdfs://machine:9000/SomeFile:PigStorage('|')) - scope-4353 Operator
Key: scope-4353)
INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100
optimistic? false
INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1
INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1
INFO mapReduceLayer.AccumulatorOptimizer: Reducer is to run in accumulative
mode.
INFO pigstats.ScriptState: Pig script settings are added to the job
INFO mapReduceLayer.JobControlCompiler: BytesPerReducer=1000000000
maxReducers=999 totalInputFileSize=611579950
INFO mapReduceLayer.JobControlCompiler: Neither PARALLEL nor default
parallelism is set for this job. Setting number of reducers to 1
INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for
submission.

How can I enforce the use of combiner here?

Thanks,
Shubham.

Search Discussions

  • Daniel Dai at Jun 16, 2011 at 6:43 pm
    Do you mean "d = group c by (var1, var2); "? If so, I can see the
    combiner being used. Which version of Pig are you using?

    Daniel
    On 06/16/2011 11:13 AM, Shubham Chopra wrote:
    Hi,

    My pig query is roughly the following:

    register some_lib.jar
    a = load 'somefile' using CustomUDF();
    b = foreach a generate CustomProjectionUDF();
    c = foreach b generate var1, var2, var3;
    d = group b by (var1, var2);
    e = foreach d generate flatten(group), SUM(c.var1), SUM(c.var2),
    SUM(c.var3);
    store e into 'file';

    I was expecting to see the combiner being used, but the optimizer did not
    use a combiner. The following is the output I see (version 0.8.1)
    INFO executionengine.HExecutionEngine: pig.usenewlogicalplan is set to true.
    New logical plan will be used.
    INFO executionengine.HExecutionEngine: (Name: agg:
    Store(hdfs://machine:9000/SomeFile:PigStorage('|')) - scope-4353 Operator
    Key: scope-4353)
    INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100
    optimistic? false
    INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1
    INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1
    INFO mapReduceLayer.AccumulatorOptimizer: Reducer is to run in accumulative
    mode.
    INFO pigstats.ScriptState: Pig script settings are added to the job
    INFO mapReduceLayer.JobControlCompiler: BytesPerReducer=1000000000
    maxReducers=999 totalInputFileSize=611579950
    INFO mapReduceLayer.JobControlCompiler: Neither PARALLEL nor default
    parallelism is set for this job. Setting number of reducers to 1
    INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for
    submission.

    How can I enforce the use of combiner here?

    Thanks,
    Shubham.
  • Shubham Chopra at Jun 16, 2011 at 7:01 pm
    Hi Daniel,

    I am seeing this behaviour with 0.8.1.

    Consider the an input file named a containing the following:
    1|2|3
    3||4

    I start pig in the local mode and then use the following script:
    a = load 'a' using PigStorage('|');
    b = group a by $0;
    c = foreach b generate 'Test' as name, flatten(group), SUM(a.$0) as s0,
    SUM(a.$1) as s1, SUM(a.$2) as s2;
    dump c;

    The above script does not use the combiner.

    However, the following script does:
    a = load 'a' using PigStorage('|');
    b = group a by $0;
    c = foreach b generate flatten(group), SUM(a.$0) as s0, SUM(a.$1) as s1,
    SUM(a.$2) as s2;
    dump c;

    This script uses the combiner.

    I pinpointed the difference to using or not using a constant in the foreach
    statement. Is this an expected behavior? I was thinking the decision to use
    a combiner depends on UDFs implementing the algebraic interface. Why is the
    constant projection stopping the combiner from being used?

    Thanks,
    Shubham.
    On Thu, Jun 16, 2011 at 2:38 PM, Daniel Dai wrote:

    Do you mean "d = group c by (var1, var2); "? If so, I can see the combiner
    being used. Which version of Pig are you using?

    Daniel

    On 06/16/2011 11:13 AM, Shubham Chopra wrote:

    Hi,

    My pig query is roughly the following:

    register some_lib.jar
    a = load 'somefile' using CustomUDF();
    b = foreach a generate CustomProjectionUDF();
    c = foreach b generate var1, var2, var3;
    d = group b by (var1, var2);
    e = foreach d generate flatten(group), SUM(c.var1), SUM(c.var2),
    SUM(c.var3);
    store e into 'file';

    I was expecting to see the combiner being used, but the optimizer did not
    use a combiner. The following is the output I see (version 0.8.1)
    INFO executionengine.**HExecutionEngine: pig.usenewlogicalplan is set to
    true.
    New logical plan will be used.
    INFO executionengine.**HExecutionEngine: (Name: agg:
    Store(hdfs://machine:9000/**SomeFile:PigStorage('|')) - scope-4353
    Operator
    Key: scope-4353)
    INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100
    optimistic? false
    INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size before
    optimization: 1
    INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size after
    optimization: 1
    INFO mapReduceLayer.**AccumulatorOptimizer: Reducer is to run in
    accumulative
    mode.
    INFO pigstats.ScriptState: Pig script settings are added to the job
    INFO mapReduceLayer.**JobControlCompiler: BytesPerReducer=1000000000
    maxReducers=999 totalInputFileSize=611579950
    INFO mapReduceLayer.**JobControlCompiler: Neither PARALLEL nor default
    parallelism is set for this job. Setting number of reducers to 1
    INFO mapReduceLayer.**MapReduceLauncher: 1 map-reduce job(s) waiting for
    submission.

    How can I enforce the use of combiner here?

    Thanks,
    Shubham.
  • Dmitriy Ryaboy at Jun 16, 2011 at 7:14 pm
    I've confirmed this behavior in 8.1 and the fact that it's fixed in
    trunk (didn't check 9).


    On Thu, Jun 16, 2011 at 12:00 PM, Shubham Chopra
    wrote:
    Hi Daniel,

    I am seeing this behaviour with 0.8.1.

    Consider the an input file named a containing the following:
    1|2|3
    3||4

    I start pig in the local mode and then use the following script:
    a = load 'a' using PigStorage('|');
    b = group a by $0;
    c = foreach b generate 'Test' as name, flatten(group), SUM(a.$0) as s0,
    SUM(a.$1) as s1, SUM(a.$2) as s2;
    dump c;

    The above script does not use the combiner.

    However, the following script does:
    a = load 'a' using PigStorage('|');
    b = group a by $0;
    c = foreach b generate flatten(group), SUM(a.$0) as s0, SUM(a.$1) as s1,
    SUM(a.$2) as s2;
    dump c;

    This script uses the combiner.

    I pinpointed the difference to using or not using a constant in the foreach
    statement. Is this an expected behavior? I was thinking the decision to use
    a combiner depends on UDFs implementing the algebraic interface. Why is the
    constant projection stopping the combiner from being used?

    Thanks,
    Shubham.
    On Thu, Jun 16, 2011 at 2:38 PM, Daniel Dai wrote:

    Do you mean "d = group c by (var1, var2); "? If so, I can see the combiner
    being used. Which version of Pig are you using?

    Daniel

    On 06/16/2011 11:13 AM, Shubham Chopra wrote:

    Hi,

    My pig query is roughly the following:

    register some_lib.jar
    a = load 'somefile' using CustomUDF();
    b = foreach a generate CustomProjectionUDF();
    c = foreach b generate var1, var2, var3;
    d = group b by (var1, var2);
    e = foreach d generate flatten(group), SUM(c.var1), SUM(c.var2),
    SUM(c.var3);
    store e into 'file';

    I was expecting to see the combiner being used, but the optimizer did not
    use a combiner. The following is the output I see (version 0.8.1)
    INFO executionengine.**HExecutionEngine: pig.usenewlogicalplan is set to
    true.
    New logical plan will be used.
    INFO executionengine.**HExecutionEngine: (Name: agg:
    Store(hdfs://machine:9000/**SomeFile:PigStorage('|')) - scope-4353
    Operator
    Key: scope-4353)
    INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100
    optimistic? false
    INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size before
    optimization: 1
    INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size after
    optimization: 1
    INFO mapReduceLayer.**AccumulatorOptimizer: Reducer is to run in
    accumulative
    mode.
    INFO pigstats.ScriptState: Pig script settings are added to the job
    INFO mapReduceLayer.**JobControlCompiler: BytesPerReducer=1000000000
    maxReducers=999 totalInputFileSize=611579950
    INFO mapReduceLayer.**JobControlCompiler: Neither PARALLEL nor default
    parallelism is set for this job. Setting number of reducers to 1
    INFO mapReduceLayer.**MapReduceLauncher: 1 map-reduce job(s) waiting for
    submission.

    How can I enforce the use of combiner here?

    Thanks,
    Shubham.
  • Shubham Chopra at Jun 16, 2011 at 7:48 pm
    Thanks! I guess its time to move to the trunk then!

    ~Shubham.
    On Thu, Jun 16, 2011 at 3:13 PM, Dmitriy Ryaboy wrote:

    I've confirmed this behavior in 8.1 and the fact that it's fixed in
    trunk (didn't check 9).


    On Thu, Jun 16, 2011 at 12:00 PM, Shubham Chopra
    wrote:
    Hi Daniel,

    I am seeing this behaviour with 0.8.1.

    Consider the an input file named a containing the following:
    1|2|3
    3||4

    I start pig in the local mode and then use the following script:
    a = load 'a' using PigStorage('|');
    b = group a by $0;
    c = foreach b generate 'Test' as name, flatten(group), SUM(a.$0) as s0,
    SUM(a.$1) as s1, SUM(a.$2) as s2;
    dump c;

    The above script does not use the combiner.

    However, the following script does:
    a = load 'a' using PigStorage('|');
    b = group a by $0;
    c = foreach b generate flatten(group), SUM(a.$0) as s0, SUM(a.$1) as s1,
    SUM(a.$2) as s2;
    dump c;

    This script uses the combiner.

    I pinpointed the difference to using or not using a constant in the foreach
    statement. Is this an expected behavior? I was thinking the decision to use
    a combiner depends on UDFs implementing the algebraic interface. Why is the
    constant projection stopping the combiner from being used?

    Thanks,
    Shubham.
    On Thu, Jun 16, 2011 at 2:38 PM, Daniel Dai wrote:

    Do you mean "d = group c by (var1, var2); "? If so, I can see the
    combiner
    being used. Which version of Pig are you using?

    Daniel

    On 06/16/2011 11:13 AM, Shubham Chopra wrote:

    Hi,

    My pig query is roughly the following:

    register some_lib.jar
    a = load 'somefile' using CustomUDF();
    b = foreach a generate CustomProjectionUDF();
    c = foreach b generate var1, var2, var3;
    d = group b by (var1, var2);
    e = foreach d generate flatten(group), SUM(c.var1), SUM(c.var2),
    SUM(c.var3);
    store e into 'file';

    I was expecting to see the combiner being used, but the optimizer did
    not
    use a combiner. The following is the output I see (version 0.8.1)
    INFO executionengine.**HExecutionEngine: pig.usenewlogicalplan is set
    to
    true.
    New logical plan will be used.
    INFO executionengine.**HExecutionEngine: (Name: agg:
    Store(hdfs://machine:9000/**SomeFile:PigStorage('|')) - scope-4353
    Operator
    Key: scope-4353)
    INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100
    optimistic? false
    INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size before
    optimization: 1
    INFO mapReduceLayer.**MultiQueryOptimizer: MR plan size after
    optimization: 1
    INFO mapReduceLayer.**AccumulatorOptimizer: Reducer is to run in
    accumulative
    mode.
    INFO pigstats.ScriptState: Pig script settings are added to the job
    INFO mapReduceLayer.**JobControlCompiler: BytesPerReducer=1000000000
    maxReducers=999 totalInputFileSize=611579950
    INFO mapReduceLayer.**JobControlCompiler: Neither PARALLEL nor default
    parallelism is set for this job. Setting number of reducers to 1
    INFO mapReduceLayer.**MapReduceLauncher: 1 map-reduce job(s) waiting
    for
    submission.

    How can I enforce the use of combiner here?

    Thanks,
    Shubham.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 16, '11 at 6:13p
activeJun 16, '11 at 7:48p
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase