[
https://issues.apache.org/jira/browse/PIG-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shravan Matthur Narayanamurthy updated PIG-425:
-----------------------------------------------
Status: Patch Available (was: Open)
The MRCompiler currently tries to pack as many operators possible into a single phase. So when we have two cogroups one after the other, the LR in the second cogroup gets pushed into the reducer. Since the store just stores away, LRs output, if we load it and pass it to GR we should be just fine.
However, since IndexedTuple isn't implemented as a new kind of Tuple with a Factory, the Load on the other side tries to load a DefaultTuple from an IndexedTuple and incidentally succeeds due to the way IndexedTuple is serialized. However, this can't be carried any further and when the mapper tries to collect the IndexedTuple, it fails.
The fix I have is three fold. I have modified IndexedTuple's serialization to suit the solution. Second, I have made IndexedTuple a type of tuple by writing a different byte to the marker byte indicating that this is an IndexedTuple(like we identify null and non-null tuples). Third, I have modfied DataReaderWriter's readDatum method to check if we have an IndexedTuple and process it according to IndexedTuple's serialization format.
With this, to try out, I have removed the RearrangeAdjuster from MRCompiler to see if my hypothesis is correct. The unit tests passed except MRCompiler due to GoldenPlan issues. We need to run all the end to end tests against this patch and confirm that it works.
Split -> distinct or order -> cogroup fails
-------------------------------------------
Key: PIG-425
URL:
https://issues.apache.org/jira/browse/PIG-425Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Shravan Matthur Narayanamurthy
Priority: Critical
Fix For: types_branch
Attachments: 425.patch
A script like:
{code}
\a = load 'myfile' as (name:chararray, age:int, gpa:double);
split a into a1 if age > 50, a2 if name < 'm';
b2 = distinct a2;
b1 = order a1 by name;
c = cogroup b2 by name, b1 by name;
d = foreach c generate flatten(group), COUNT($1), COUNT($2);
store d into 'OUTPATH';
{code}
Will abort with the error:
{code}
08/09/09 11:46:50 ERROR mapReduceLayer.Launcher: Error message from task (map) tip_200809080906_0185_m_000000java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast to org.apache.pig.data.IndexedTuple
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:81)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:135)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
{code}
The issue is that the RearrangeAdjuster in MRCompiler is not properly seeing this as a cogroup and moving the localrearrnge out of the reduce and into the
map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.