|| at Jul 20, 2011 at 8:16 pm
It depends on whether you want to do inner or outer (also called
co-group) merge join. If you are doing inner merge join on two tables
PigStorage satisfies all the criteria and can be used. If you want to
do outer merge join (or inner merge join on more then two tables),
then you need CollectableLoadFunc which PigStorage doesn't implement
and only Zebra's TableLoader does.
Hope it helps,
On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky
Not sure if this would be helpful, but docs says that the default
PigStorage does implement that. I guess that your data needs to be
already sorted if you do not want to go through the reduce phase
during the join.
On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain wrote:
Thanks Ashutosh! Right, I too realized that yesterday. So, is there any
other loader that implements
CollectableLoadFunc interface required by the merge join?
On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan wrote:
Zebra's TableLoader works with the data written out using Zebra's
TableStorer. So, you need to write the data first using Zebra and then
subsequently load using TableLoader and do merge-join.
On Tue, Jul 19, 2011 at 14:28, Ankur Jain wrote:
I'm trying to do a map-side only merge join  in pig using Zebra's
TableLoader. (My data allows merge join.) But I'm being unable to use the
TableLoader. Even a simple script that loads a table and just stores it back
doesn't work -
A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('',
store A into 'my_output';
'my_input' is input directory containing a single file with just 1 column -
The error I get is -
"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
error. Failed to find deleted column groupsjava.io.IOException: BT Schema
file doesn't exist: *file:/......./my_input/.btschema*"
I have tried specifying the schema using the 'AS' clause and the DESCRIBE
statement as well, but its fetches me the same error. Is the .btschema file
required? Is there any documentation available on its format? (I tried
comma-separated column names with/without type info)
I am also willing to work with any other loader that satisfies the merge
join constraints. Thanks in anticipation.