FAQ
It only seems like full outer or full inner joins are supported. I was
hoping to just do a left outer join.

Is this supported or planned?

On the flip side doing the Outer Join is about 8x faster than doing a
map/reduce over our dataset.

Thanks
--
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if
interested

Search Discussions

  • Chris Douglas at Jul 3, 2008 at 1:09 am
    Hi Jason-
    It only seems like full outer or full inner joins are supported. I
    was hoping to just do a left outer join.

    Is this supported or planned?

    The full inner/outer joins are examples, really. You can define your
    own operations by extending o.a.h.mapred.join.JoinRecordReader or
    o.a.h.mapred.join.MultiFilterRecordReader and registering your new
    identifier with the parser by defining a property
    "mapred.join.define.<ident>" as your class.

    For a left outer join, JoinRecordReader is the correct base.
    InnerJoinRecordReader and OuterJoinRecordReader should make its use
    clear.
    On the flip side doing the Outer Join is about 8x faster than doing
    a map/reduce over our dataset.
    Cool! Out of curiosity, how are you managing your splits? -C
  • Jason Venner at Jul 3, 2008 at 4:56 am
    For the data joins, I let the framework do it - which means one
    partition per split - so I have to chose my partition count carefully to
    fill the machines.

    I had an error in my initial outer join mapper, the join map code now
    runs about 40x faster than the old brute force read it all shuffle & sort.

    Chris Douglas wrote:
    Hi Jason-
    It only seems like full outer or full inner joins are supported. I
    was hoping to just do a left outer join.

    Is this supported or planned?

    The full inner/outer joins are examples, really. You can define your
    own operations by extending o.a.h.mapred.join.JoinRecordReader or
    o.a.h.mapred.join.MultiFilterRecordReader and registering your new
    identifier with the parser by defining a property
    "mapred.join.define.<ident>" as your class.

    For a left outer join, JoinRecordReader is the correct base.
    InnerJoinRecordReader and OuterJoinRecordReader should make its use
    clear.
    On the flip side doing the Outer Join is about 8x faster than doing a
    map/reduce over our dataset.
    Cool! Out of curiosity, how are you managing your splits? -C
  • Chris Douglas at Jul 3, 2008 at 6:58 pm
    Forgive me if you already know this, but the correctness of the map-
    side join is very sensitive to partitioning; if your input in sorted
    but equal keys go to different partitions, your results may be
    incorrect. Is your input such that the default partitioning is
    sufficient? Have you verified the correctness of your results? -C
    On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:

    For the data joins, I let the framework do it - which means one
    partition per split - so I have to chose my partition count
    carefully to fill the machines.

    I had an error in my initial outer join mapper, the join map code
    now runs about 40x faster than the old brute force read it all
    shuffle & sort.

    Chris Douglas wrote:
    Hi Jason-
    It only seems like full outer or full inner joins are supported. I
    was hoping to just do a left outer join.

    Is this supported or planned?

    The full inner/outer joins are examples, really. You can define
    your own operations by extending o.a.h.mapred.join.JoinRecordReader
    or o.a.h.mapred.join.MultiFilterRecordReader and registering your
    new identifier with the parser by defining a property
    "mapred.join.define.<ident>" as your class.

    For a left outer join, JoinRecordReader is the correct base.
    InnerJoinRecordReader and OuterJoinRecordReader should make its use
    clear.
    On the flip side doing the Outer Join is about 8x faster than
    doing a map/reduce over our dataset.
    Cool! Out of curiosity, how are you managing your splits? -C
  • Jason Venner at Jul 3, 2008 at 7:07 pm
    We are using the default partitioner. I am just about to start verifying
    my result as it took quite a while to work my way through the in-obvious
    issues of hand writing MapFiles, thinks like the key and value class are
    extracted from the jobconf, output key/value.

    Question: I looked at the HashPartitioner (which we are using) and a
    key's partition is simply based on the key.hashCode() %
    conf.getNumReduces().
    How will I get equal keys going to different partitions - clearly I
    have an understanding gap.

    Thanks!


    Chris Douglas wrote:
    Forgive me if you already know this, but the correctness of the
    map-side join is very sensitive to partitioning; if your input in
    sorted but equal keys go to different partitions, your results may be
    incorrect. Is your input such that the default partitioning is
    sufficient? Have you verified the correctness of your results? -C
    On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:

    For the data joins, I let the framework do it - which means one
    partition per split - so I have to chose my partition count carefully
    to fill the machines.

    I had an error in my initial outer join mapper, the join map code now
    runs about 40x faster than the old brute force read it all shuffle &
    sort.

    Chris Douglas wrote:
    Hi Jason-
    It only seems like full outer or full inner joins are supported. I
    was hoping to just do a left outer join.

    Is this supported or planned?

    The full inner/outer joins are examples, really. You can define your
    own operations by extending o.a.h.mapred.join.JoinRecordReader or
    o.a.h.mapred.join.MultiFilterRecordReader and registering your new
    identifier with the parser by defining a property
    "mapred.join.define.<ident>" as your class.

    For a left outer join, JoinRecordReader is the correct base.
    InnerJoinRecordReader and OuterJoinRecordReader should make its use
    clear.
    On the flip side doing the Outer Join is about 8x faster than doing
    a map/reduce over our dataset.
    Cool! Out of curiosity, how are you managing your splits? -C
    --
    Jason Venner
    Attributor - Program the Web <http://www.attributor.com/>
    Attributor is hiring Hadoop Wranglers and coding wizards, contact if
    interested
  • Chris Douglas at Jul 3, 2008 at 8:11 pm
    Sorry, I meant splits (partitions of input data). If you have n
    datasets and m splits per dataset, m_i must contain the same keys for
    all n. So if you're joining two datasets A and B sharing a key k, if
    split i from A contains any instances of k, (a) split i from A must
    contain all instances of k from A and (b) split i from B must contain
    all instances of k from B. -C
    On Jul 3, 2008, at 12:06 PM, Jason Venner wrote:

    We are using the default partitioner. I am just about to start
    verifying my result as it took quite a while to work my way through
    the in-obvious issues of hand writing MapFiles, thinks like the key
    and value class are extracted from the jobconf, output key/value.

    Question: I looked at the HashPartitioner (which we are using) and a
    key's partition is simply based on the key.hashCode() %
    conf.getNumReduces().
    How will I get equal keys going to different partitions - clearly I
    have an understanding gap.

    Thanks!


    Chris Douglas wrote:
    Forgive me if you already know this, but the correctness of the map-
    side join is very sensitive to partitioning; if your input in
    sorted but equal keys go to different partitions, your results may
    be incorrect. Is your input such that the default partitioning is
    sufficient? Have you verified the correctness of your results? -C
    On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:

    For the data joins, I let the framework do it - which means one
    partition per split - so I have to chose my partition count
    carefully to fill the machines.

    I had an error in my initial outer join mapper, the join map code
    now runs about 40x faster than the old brute force read it all
    shuffle & sort.

    Chris Douglas wrote:
    Hi Jason-
    It only seems like full outer or full inner joins are supported.
    I was hoping to just do a left outer join.

    Is this supported or planned?

    The full inner/outer joins are examples, really. You can define
    your own operations by extending
    o.a.h.mapred.join.JoinRecordReader or
    o.a.h.mapred.join.MultiFilterRecordReader and registering your
    new identifier with the parser by defining a property
    "mapred.join.define.<ident>" as your class.

    For a left outer join, JoinRecordReader is the correct base.
    InnerJoinRecordReader and OuterJoinRecordReader should make its
    use clear.
    On the flip side doing the Outer Join is about 8x faster than
    doing a map/reduce over our dataset.
    Cool! Out of curiosity, how are you managing your splits? -C
    --
    Jason Venner
    Attributor - Program the Web <http://www.attributor.com/>
    Attributor is hiring Hadoop Wranglers and coding wizards, contact if
    interested
  • Jason Venner at Jul 3, 2008 at 8:29 pm
    Ahh yes, it is absolutely critical that the partitioning in all of the
    sets be the same :)
    We are currently assuming that:
    1) Default Partitioner
    2) Same Reduce Count
    3) Text keys
    will guarantee that, but as I mentioned above, we are assuming ;)

    Testing is just starting

    Chris Douglas wrote:
    Sorry, I meant splits (partitions of input data). If you have n
    datasets and m splits per dataset, m_i must contain the same keys for
    all n. So if you're joining two datasets A and B sharing a key k, if
    split i from A contains any instances of k, (a) split i from A must
    contain all instances of k from A and (b) split i from B must contain
    all instances of k from B. -C
    On Jul 3, 2008, at 12:06 PM, Jason Venner wrote:

    We are using the default partitioner. I am just about to start
    verifying my result as it took quite a while to work my way through
    the in-obvious issues of hand writing MapFiles, thinks like the key
    and value class are extracted from the jobconf, output key/value.

    Question: I looked at the HashPartitioner (which we are using) and a
    key's partition is simply based on the key.hashCode() %
    conf.getNumReduces().
    How will I get equal keys going to different partitions - clearly I
    have an understanding gap.

    Thanks!


    Chris Douglas wrote:
    Forgive me if you already know this, but the correctness of the
    map-side join is very sensitive to partitioning; if your input in
    sorted but equal keys go to different partitions, your results may
    be incorrect. Is your input such that the default partitioning is
    sufficient? Have you verified the correctness of your results? -C
    On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:

    For the data joins, I let the framework do it - which means one
    partition per split - so I have to chose my partition count
    carefully to fill the machines.

    I had an error in my initial outer join mapper, the join map code
    now runs about 40x faster than the old brute force read it all
    shuffle & sort.

    Chris Douglas wrote:
    Hi Jason-
    It only seems like full outer or full inner joins are supported.
    I was hoping to just do a left outer join.

    Is this supported or planned?

    The full inner/outer joins are examples, really. You can define
    your own operations by extending
    o.a.h.mapred.join.JoinRecordReader or
    o.a.h.mapred.join.MultiFilterRecordReader and registering your new
    identifier with the parser by defining a property
    "mapred.join.define.<ident>" as your class.

    For a left outer join, JoinRecordReader is the correct base.
    InnerJoinRecordReader and OuterJoinRecordReader should make its
    use clear.
    On the flip side doing the Outer Join is about 8x faster than
    doing a map/reduce over our dataset.
    Cool! Out of curiosity, how are you managing your splits? -C
    --
    Jason Venner
    Attributor - Program the Web <http://www.attributor.com/>
    Attributor is hiring Hadoop Wranglers and coding wizards, contact if
    interested
    --
    Jason Venner
    Attributor - Program the Web <http://www.attributor.com/>
    Attributor is hiring Hadoop Wranglers and coding wizards, contact if
    interested

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 1, '08 at 4:25a
activeJul 3, '08 at 8:29p
posts7
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Jason Venner: 4 posts Chris Douglas: 3 posts

People

Translate

site design / logo © 2022 Grokbase