Grokbase Groups Pig user January 2011
FAQ
I'm not sure if this can be done at the UDF level, or if it'd have to be
done lower level. Imagine you have a good candidate for a replicated join,
but beyond that you know most about the structure of one of the pieces of
information you are joining (for example, that you could build a binary
search tree from it and do your comparisons really quickly, or something).
Is there a way to make your own join, or extend the one in pig? I could
imagine a UDF that takes two bags, the left piece and the right piece,
constructs your join, etc, but I don't know that that would be as fast.

Any thoughts?

Search Discussions

  • Alan Gates at Jan 28, 2011 at 4:47 pm
    Depending on the join algorithm you may be able to implement it with
    cogroup, a custom UDF, and possibly a custom partitioner. I haven't
    finished reading the band join algorithm paper I sent a link for, but
    I suspect it requires some records to be duplicated (since records
    within the band will need to be sent to multiple reducers to match
    records from the other side). That you cannot do without implementing
    a custom join.

    For an example of how to implement a custom join take a look at https://issues.apache.org/jira/browse/PIG-792
    This has a lot of sampling code you won't have to worry about. But
    it will give you an idea of the logical and physical operators inside
    Pig that would be needed.

    Also, here's some input from Chris Olston, one of our research
    scientists at Yahoo with expertise in databases:

    >>>
    I have not read the paper you sent but it seems to be about so-called
    “band joins”, which are a special case of non-equijoin that arise
    frequently in practice, and offer obvious opportunities for locality-
    based strategies e.g. using indexes and (distributed) partitioning.
    One approach that would be consistent with the Pig “low-level”
    philosophy would be to expose “BAND JOIN” as an operator and have a
    corresponding implementation along the lines of what that paper
    proposes.

    Also, as you know Utkarsh’s original implementation of CROSS (still
    the same?) performs a “generalized fragment-and-replicate” strategy,
    which is a way to do arbitrary non-equi-joins in a way that spreads
    work onto lots of machines (CROSS can be seen as non-equi-join with a
    very promiscuous join predicate :). There are probably papers that try
    to optimize the NxM grid structure of the generalized f-and-r
    topology, based on the relative sizes of the inputs, the join
    selectivity, data distributions, etc. I think the paper that
    originally surfaced this idea is: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=250116
    . Not sure whether there were follow-on papers that try to do more
    optimization. Fast-forwarding to modern times, I believe the Almaden
    SIGMOD’10 paper might have investigated f-and-r join strategies for
    the map-reduce context:http://portal.acm.org/citation.cfm?doid=1807167.1807273
    . There’s also the Ullman paper that proposes (but does not evaluate
    empirically) some map-reduce join strategies:http://ilpubs.stanford.edu:8090/957/1/mapred-join-report.pdf
    <<<

    Alan.
    On Jan 28, 2011, at 7:35 AM, Jonathan Coveney wrote:

    I'm not sure if this can be done at the UDF level, or if it'd have
    to be
    done lower level. Imagine you have a good candidate for a replicated
    join,
    but beyond that you know most about the structure of one of the
    pieces of
    information you are joining (for example, that you could build a
    binary
    search tree from it and do your comparisons really quickly, or
    something).
    Is there a way to make your own join, or extend the one in pig? I
    could
    imagine a UDF that takes two bags, the left piece and the right piece,
    constructs your join, etc, but I don't know that that would be as
    fast.

    Any thoughts?
  • Renato Marroquín Mogrovejo at Feb 7, 2011 at 6:43 pm
    I found really interesting all those papers. I haven't finished reading the
    band join algorithm paper either, but there are a couple of things that
    intrigue me e.g. the Almaden paper compares its results against Pig version
    0.2 I mean I think the study made by them is great, but Pig is in a stable
    0.8 now, wouldn't Pig perform better now than then? Has Pig embraced any of
    the paper suggestions?
    Anyways, creating a custom join inside a UDF might not be suitable for some
    specialized types of join, but maybe for others such as Parallel
    Set-*Similarity
    Joins *would be easier (flamingo.ics.uci.edu/pub/sigmod10-vernica.pdf),
    don't you think it might be possible? I mean we could take advantage of not
    doing only raw MapReduce

    Renato M.

    2011/1/28 Alan Gates <gates@yahoo-inc.com>
    Depending on the join algorithm you may be able to implement it with
    cogroup, a custom UDF, and possibly a custom partitioner. I haven't
    finished reading the band join algorithm paper I sent a link for, but I
    suspect it requires some records to be duplicated (since records within the
    band will need to be sent to multiple reducers to match records from the
    other side). That you cannot do without implementing a custom join.

    For an example of how to implement a custom join take a look at
    https://issues.apache.org/jira/browse/PIG-792 This has a lot of sampling
    code you won't have to worry about. But it will give you an idea of the
    logical and physical operators inside Pig that would be needed.

    Also, here's some input from Chris Olston, one of our research scientists
    at Yahoo with expertise in databases:
    I have not read the paper you sent but it seems to be about so-called “band
    joins”, which are a special case of non-equijoin that arise frequently in
    practice, and offer obvious opportunities for locality-based strategies e.g.
    using indexes and (distributed) partitioning. One approach that would be
    consistent with the Pig “low-level” philosophy would be to expose “BAND
    JOIN” as an operator and have a corresponding implementation along the lines
    of what that paper proposes.

    Also, as you know Utkarsh’s original implementation of CROSS (still the
    same?) performs a “generalized fragment-and-replicate” strategy, which is a
    way to do arbitrary non-equi-joins in a way that spreads work onto lots of
    machines (CROSS can be seen as non-equi-join with a very promiscuous join
    predicate :). There are probably papers that try to optimize the NxM grid
    structure of the generalized f-and-r topology, based on the relative sizes
    of the inputs, the join selectivity, data distributions, etc. I think the
    paper that originally surfaced this idea is:
    http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=250116. Not sure
    whether there were follow-on papers that try to do more optimization.
    Fast-forwarding to modern times, I believe the Almaden SIGMOD’10 paper might
    have investigated f-and-r join strategies for the map-reduce context:
    http://portal.acm.org/citation.cfm?doid=1807167.1807273. There’s also the
    Ullman paper that proposes (but does not evaluate empirically) some
    map-reduce join strategies:
    http://ilpubs.stanford.edu:8090/957/1/mapred-join-report.pdf
    <<<

    Alan.


    On Jan 28, 2011, at 7:35 AM, Jonathan Coveney wrote:

    I'm not sure if this can be done at the UDF level, or if it'd have to be
    done lower level. Imagine you have a good candidate for a replicated join,
    but beyond that you know most about the structure of one of the pieces of
    information you are joining (for example, that you could build a binary
    search tree from it and do your comparisons really quickly, or something).
    Is there a way to make your own join, or extend the one in pig? I could
    imagine a UDF that takes two bags, the left piece and the right piece,
    constructs your join, etc, but I don't know that that would be as fast.

    Any thoughts?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 28, '11 at 3:36p
activeFeb 7, '11 at 6:43p
posts3
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase