You're not missing anything obvious... what you're trying to do, on face
value, is not an easy thing to do. In M/R, joining is done based on
partitioning to the same reducer...how can you do that if you have a case
and foo is sent to reducer 1, bar to reducer 2? There's no way to know
where keys should be sent.
That said, there are options.
Option 1: a cross. Undesirable because of data explosion.
Option 2: If one of the data sets is large enough to fit in memory, you can
make a UDF that brings it in, and does the join for you. This is
essentially option 1.
Option 3: Less generically, exploit the join you're actually doing. In the
dummy example, it looks like you're checking if a token is contained in
another string. You could convert this into a join by tokenizing,
flattening, doing the join, etc. I don't know how close your real use case
is to what you posted.
2012/8/29 Mat Kelcey <firstname.lastname@example.org>
Considering the following two relations...
grunt> querys = load 'query' as (id:int, token:chararray);
grunt> dump querys
grunt> documents = load 'document' as (id:int, text:chararray);
grunt> dump documents;
(21,foo bar frog)
Is is possible to do a join where the query:token is not equal to but
contained in documents:text ?
(11,foo,21,foo bar frog)
(12,bar,21,foo bar frog)
(13,frog,21,foo bar frog)
I can certainly do this in Java map/reduce (as we all had to in the
dark days days before pig) but is there a way to hack this together
with a custom udf or some other weird join backdoor (customer
partitioner for a group or something whacky) ???
It's been a long day, maybe I'm just missing some super obvious..