This might be a Cascading issue rather than Cascalog. When doing a
(cross-join), I observe all data going through 1 reducer.

I expected that when doing "A cross B", I'd get something like a mapper
for each A_i, that emits A_i joined with all B's.

I ended up loading all B's into memory, then emitting them via a
mapcatfn. This is equivalent, right? If so, does Cascalog just not have
the optimization that translates (cross-join) into this pattern?

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Sam Ritchie at Dec 6, 2013 at 9:40 pm
    How would you do the translation if both sides of the join are the same
    size, or if neither side fit in memory?

    Mason wrote:
    This might be a Cascading issue rather than Cascalog. When doing a
    (cross-join), I observe all data going through 1 reducer.

    I expected that when doing "A cross B", I'd get something like a
    mapper for each A_i, that emits A_i joined with all B's.

    I ended up loading all B's into memory, then emitting them via a
    mapcatfn. This is equivalent, right? If so, does Cascalog just not
    have the optimization that translates (cross-join) into this pattern?
    --
    Sam Ritchie (@sritchie)
    Paddleguru Co-Founder
    703.863.8561
    www.paddleguru.com <http://www.paddleguru.com/>
    Twitter <http://twitter.com/paddleguru>// Facebook
    <http://facebook.com/paddleguru>

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Mason at Dec 6, 2013 at 9:50 pm
    I'd split the A's and B's into manageable chunks, then, in a mapper for
    each pair of chunks, emit the results of crossing pair of chunks.

    A little more formally:

    1) Divide

        split (A's) => (A1's, A2's, ... AN's)
        split (B's) => (B1's, B2's, ... BN's)

    2) Conquer

         cross (A1's, B1's)
         cross (A1's, B2's)
         ...
         cross (A1's, BN's)
         cross (A2's, B1's)
         cross (A2's, B2's)
         ...
         cross (A2's, BN's)
         ...

    On 12/6/13 13:39 PM, Sam Ritchie wrote:
    How would you do the translation if both sides of the join are the
    same size, or if neither side fit in memory?

    Mason wrote:
    This might be a Cascading issue rather than Cascalog. When doing a
    (cross-join), I observe all data going through 1 reducer.

    I expected that when doing "A cross B", I'd get something like a
    mapper for each A_i, that emits A_i joined with all B's.

    I ended up loading all B's into memory, then emitting them via a
    mapcatfn. This is equivalent, right? If so, does Cascalog just not
    have the optimization that translates (cross-join) into this pattern?
    --
    Sam Ritchie (@sritchie)
    Paddleguru Co-Founder
    703.863.8561
    www.paddleguru.com <http://www.paddleguru.com/>
    Twitter <http://twitter.com/paddleguru>// Facebook
    <http://facebook.com/paddleguru>
    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Sam Ritchie at Dec 6, 2013 at 9:41 pm
    You've implemented a hash join, which is supported by cascading and by
    the underlying cascading DSL in Cascalog. Check out
    cascalog.cascading.operations for the hash-join function, and the tests
    for some examples.

    Apologies for the lack of docs here. I've been slammed by the new startup.

    Mason wrote:
    This might be a Cascading issue rather than Cascalog. When doing a
    (cross-join), I observe all data going through 1 reducer.

    I expected that when doing "A cross B", I'd get something like a
    mapper for each A_i, that emits A_i joined with all B's.

    I ended up loading all B's into memory, then emitting them via a
    mapcatfn. This is equivalent, right? If so, does Cascalog just not
    have the optimization that translates (cross-join) into this pattern?
    --
    Sam Ritchie (@sritchie)
    Paddleguru Co-Founder
    703.863.8561
    www.paddleguru.com <http://www.paddleguru.com/>
    Twitter <http://twitter.com/paddleguru>// Facebook
    <http://facebook.com/paddleguru>

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Mason at Dec 6, 2013 at 9:51 pm

    On 12/6/13 13:41 PM, Sam Ritchie wrote:
    You've implemented a hash join, which is supported by cascading and by
    the underlying cascading DSL in Cascalog. Check out
    cascalog.cascading.operations for the hash-join function, and the
    tests for some examples.
    Ah, thanks. I'll check it out.
    Apologies for the lack of docs here. I've been slammed by the new startup.
    No need to apologize. I appreciate all the volunteer work you've done on
    Cascalog.

    -Mason

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedDec 6, '13 at 9:33p
activeDec 6, '13 at 9:51p
posts5
users2
websiteclojure.org
irc#clojure

2 users in discussion

Mason: 3 posts Sam Ritchie: 2 posts

People

Translate

site design / logo © 2021 Grokbase