Grokbase Groups Pig dev November 2007
FAQ
[ https://issues.apache.org/jira/browse/PIG-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539597 ]

Benjamin Reed commented on PIG-16:
----------------------------------

set parallelism would probably be better than set reduce_parallelism because parallelism can be used in other places besides reduce. For example, hopefully, soon we will detect that a dataset is already sorted and do the group by in the map rather than reduce.

It would probably also be better to take a number rather than TRUE or FALSE so that you can deviate from the default when need. I would think it should be

set parallelism number|DEFAULT

In reality, the hadoop configuration file should have the optimal number of reducer tasks for a given configuration. (That is the default, not false.) If you want to override it, you could provide a specific number.

I also don't think we should assume the use of an AlgebraicFunction would want the parallelism set to 1. In general I would think that is not the case. The only case I can think of for automatically setting to 1 would be a group all.
setting parallel from grunt via set command
-------------------------------------------

Key: PIG-16
URL: https://issues.apache.org/jira/browse/PIG-16
Project: Pig
Issue Type: Improvement
Components: grunt
Reporter: Olga Natkovich
Priority: Minor

I'd like to propose a different model which uses the grunt "set" option and/or a command line option which sets reduce
parallelism to the be true and automatic.
set reduce_parallelism TRUE
set reduce_parallelism FALSE [Default - BTW, why is this the default?]
This way I won't have to update my script every single time I try playing with -D"hod=-m N", parallelism for reduce
statements will default, appropriately, to 2*(N-1).
Alternatively, could I just specify PARALLEL with no value or PARALLEL DEFAULT; And any time I needed to force reduce
to be single job, I could write PARALLEL 1.
Basically, this whole thing tripped me up for a long time and I just haven't understood if there is a really good
reason to not make parallelism.
I guess it might be if you have aggregation functions that do not parallelize.
If this is the case, then it seems to me that this should be detectable automagically based on whether the function is
a vanilla EvalFunction or if it is an AlgebraicFunction.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
groupdev @
categoriespig, hadoop
postedNov 2, '07 at 3:16a
activeNov 2, '07 at 2:41p
posts2
users1
websitepig.apache.org

1 user in discussion

Benjamin Reed (JIRA): 2 posts

People

Translate

site design / logo © 2022 Grokbase