already been discussed previously

(here<http://wiki.apache.org/pig/PigTypesFunctionalSpec>).There

seem to be a question of what's the right thing to do. It seems clear to me

though. When an operation like '*' is applied, this is clearly an item-wise

operation that is to be applied to each member of the bag. If a function is

aggregate (SUM), then it operates across an entire bag.

When a COGROUP occurs, just do what SQL does. Which is to say, perform cross

join if an aggregate has been applied across several bags. And do so

automatically, so we don't have to type out the separate FLATTEN's

grouped = COGROUP employee BY name, bonuses BY name;

flattened = FOREACH grouped GENERATE group, *FLATTEN(employee),

FLATTEN(bonuses);grouped_again = GROUP flattened BY group;

total_compensation = FOREACH grouped_again GENERATE group,

SUM(employee:salary * bonuses:multiplier);*

So this should do the same:

grouped = COGROUP employee BY name, bonuses BY name;

total_compensation = FOREACH grouped GENERATE group,

SUM(employee:salary * bonuses:multiplier);

automatically, because that can only have one meaning.

Alternatively, if it is desired to stay with a low-level language, the

solution to all of this confusion around UDF's that take bag's and UDF's

that operate on members of bags can be resolved if we do two things.

1.) Allow UDF's to actually become first class citizens. This way we can

pass UDF's to other UDF's.

2.) introduce the concept of map() and reduce() operator over bags.

This two things allows us more freedom and follows the paradigm of

map-reducing more closely.

grouped = COGROUP employee BY name, bonuses BY name;

total_compensation = FOREACH grouped GENERATE group,

reduce(SUM,map(*,employee::salary,bonuses::multiplier));

Actually, this may deserve a separate keyword. Because map and reduce

operate on single bags where as Pig introduces this concept of co-grouping,

so we should have *comap *and *coreduce* that take functions and operate on

multiple bags that are results of a *cogroup*.

grouped = COGROUP employee BY name, bonuses BY name;

total_compensation = FOREACH grouped GENERATE group,

REDUCE(SUM,COMAP(*, employee::salary,bonuses::multiplier));

This allows us to write efficiently, on one line, what would other wise be

several aliases and unnecessary FLATTENed cross products.

A second thing that I see is the recommendation of implementing looping

constructs. I wonder if I may suggest, as a follow up to the above, that we

beef up UDF's as first class citizens and add the ability to create UDF

functions in Pig Latin with the ability to recurse.

The reason why I think this is a better way to loop than *for(;;)* and *

while(){}* and *do{}while()* statements is that recursive calls are

functional and are more easily optimizable than imperative programming. The

PigJournal <http://wiki.apache.org/pig/PigJournal> has an entry for all of

these constructs and functions under the heading "Extending Pig to Include

Branching, Looping, and Functions", but because map-reduce paradigm is

inherently functional, I would rather think that staying functional would be

a better way to approach this improvement. So the minimal amount of

additional features needed is to implement functions and branching and we

would have loops as a side-effect of those improvements.

In order for the optimizations to be available to PigLatin interpreter, the

functions and branching *must* be implemented within the Pig system. If it

is externalized, or implemented as UDL of some other language, then

opportunities for optimization of the execution vanishes.

Anyways, a couple of cents on a rainy day.

On Wed, Mar 10, 2010 at 10:15 AM, hc busy wrote:

On Wed, Mar 10, 2010 at 9:31 AM, hc busy wrote:

