Grokbase Groups Hive user August 2010
FAQ

[Hive-user] How HIVE manages a join

Cappa Roberto
Aug 6, 2010 at 6:46 am
Hi,

I cannot find any documentation about what algorithm performs HIVE to translate JOIN clauses to Map-Reduce tasks.

In particular, if I have two tables A and B, each table is written on a separate file and each file is splitted on hadoop nodes. When I perform a JOIN with A.column = B.column, the framework has to compare full data from the first file and full data from the second file. In order to perform a full scan of all possibile combinations of values, how can hadoop perform it? If each node contains a portion of each file, it seems not possible to have a complete comparison. Does one of the two files enterely replicated on each node? Or, does HIVE use another kind of strategy/optimization?

Thanks.
reply

Search Discussions

14 responses

  • Bharath vissapragada at Aug 6, 2010 at 10:44 am
    Roberto ..

    You can find these links useful ..

    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551-
    Simple joins and optimizations..

    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad
    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto wrote:

    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on a
    separate file and each file is splitted on hadoop nodes. When I perform a
    JOIN with A.column = B.column, the framework has to compare full data from
    the first file and full data from the second file. In order to perform a
    full scan of all possibile combinations of values, how can hadoop perform
    it? If each node contains a portion of each file, it seems not possible to
    have a complete comparison. Does one of the two files enterely replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Jeff Hammerbacher at Aug 7, 2010 at 3:33 am
    Yongqiang mentioned he was going to update the wiki with this information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff
    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada wrote:

    Roberto ..

    You can find these links useful ..


    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551- Simple joins and optimizations..

    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad


    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto <
    roberto.cappa@guest.telecomitalia.it> wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on a
    separate file and each file is splitted on hadoop nodes. When I perform a
    JOIN with A.column = B.column, the framework has to compare full data from
    the first file and full data from the second file. In order to perform a
    full scan of all possibile combinations of values, how can hadoop perform
    it? If each node contains a portion of each file, it seems not possible to
    have a complete comparison. Does one of the two files enterely replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Yongqiang he at Aug 7, 2010 at 6:46 am
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.
    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher wrote:
    Yongqiang mentioned he was going to update the wiki with this information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..


    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..

    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team  -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on a
    separate file and each file is splitted on hadoop nodes. When I perform a
    JOIN with A.column = B.column, the framework has to compare full data from
    the first file and full data from the second file. In order to perform a
    full scan of all possibile combinations of values, how can hadoop perform
    it? If each node contains a portion of each file, it seems not possible to
    have a complete comparison. Does one of the two files enterely replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Yongqiang he at Aug 10, 2010 at 9:12 pm
    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?
    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.
    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher wrote:
    Yongqiang mentioned he was going to update the wiki with this information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..


    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..

    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team  -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on a
    separate file and each file is splitted on hadoop nodes. When I perform a
    JOIN with A.column = B.column, the framework has to compare full data from
    the first file and full data from the second file. In order to perform a
    full scan of all possibile combinations of values, how can hadoop perform
    it? If each node contains a portion of each file, it seems not possible to
    have a complete comparison. Does one of the two files enterely replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Carl Steinbach at Aug 10, 2010 at 9:16 pm
    Hi Yongqiang,

    Please go ahead and update the wiki page. I will copy it over to version
    control when you are done.

    Thanks.

    Carl
    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he wrote:

    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?
    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.
    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..

    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full data
    from
    the first file and full data from the second file. In order to perform
    a
    full scan of all possibile combinations of values, how can hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Edward Capriolo at Aug 10, 2010 at 9:57 pm
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward
    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to version
    control when you are done.
    Thanks.
    Carl
    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he wrote:

    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?

    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he <heyongqiangict@gmail.com>
    wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.

    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher <hammer@cloudera.com>
    wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..



    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..


    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team  -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Edward Capriolo at Aug 10, 2010 at 9:58 pm
    Sorry.
    $hive_root/docs/xdocs/language_manual/joins.xml
    On Tue, Aug 10, 2010 at 5:57 PM, Edward Capriolo wrote:
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward
    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to version
    control when you are done.
    Thanks.
    Carl

    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he <heyongqiangict@gmail.com>
    wrote:
    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?

    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he <heyongqiangict@gmail.com>
    wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.

    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher <hammer@cloudera.com>
    wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..



    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..


    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team  -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Joydeep Sen Sarma at Aug 12, 2010 at 7:19 am
    i hate this message: 'THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax'

    why must edits to the wiki be banned if there are xdocs? hadoop has both.

    there will always be things that are not captured in xdocs. it's pretty sad to discourage free form edits by people who want to contribute without checking out source. (what is this - the 80s?)
    ________________________________________
    From: Edward Capriolo [edlinuxguru@gmail.com]
    Sent: Tuesday, August 10, 2010 2:57 PM
    To: hive-user@hadoop.apache.org
    Cc: hive-dev@hadoop.apache.org
    Subject: Re: How HIVE manages a join

    Sorry.
    $hive_root/docs/xdocs/language_manual/joins.xml
    On Tue, Aug 10, 2010 at 5:57 PM, Edward Capriolo wrote:
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward
    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to version
    control when you are done.
    Thanks.
    Carl

    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he <heyongqiangict@gmail.com>
    wrote:
    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?

    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he <heyongqiangict@gmail.com>
    wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.

    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher <hammer@cloudera.com>
    wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..



    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..


    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Edward Capriolo at Aug 12, 2010 at 8:37 pm
    Joydeep,

    I am sorry. I put that when I thought we were going to actively move
    to xdocs. You an remove that if you like.

    As i said in a thread before the problem with the wiki is that no one
    actively updates it. Example:

    http://wiki.apache.org/hadoop/Hive/LanguageManual/Select
    oopse: Really what about "in support"...
    https://issues.apache.org/jira/browse/HIVE-801

    Which is why I hold the option that all patches except bug fixes
    should probably come with xdocs, People are free to disagree.

    Edward
    On Thu, Aug 12, 2010 at 3:16 AM, Joydeep Sen Sarma wrote:
    i hate this message: 'THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax'

    why must edits to the wiki be banned if there are xdocs? hadoop has both.

    there will always be things that are not captured in xdocs. it's pretty sad to discourage free form edits by people who want to contribute without checking out source. (what is this - the 80s?)
    ________________________________________
    From: Edward Capriolo [edlinuxguru@gmail.com]
    Sent: Tuesday, August 10, 2010 2:57 PM
    To: hive-user@hadoop.apache.org
    Cc: hive-dev@hadoop.apache.org
    Subject: Re: How HIVE manages a join

    Sorry.
    $hive_root/docs/xdocs/language_manual/joins.xml
    On Tue, Aug 10, 2010 at 5:57 PM, Edward Capriolo wrote:
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward
    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to version
    control when you are done.
    Thanks.
    Carl

    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he <heyongqiangict@gmail.com>
    wrote:
    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?

    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he <heyongqiangict@gmail.com>
    wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.

    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher <hammer@cloudera.com>
    wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..



    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..


    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team  -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Akshaya iyengar at Aug 12, 2010 at 9:05 pm
    Hello,
    I apologize if this is out of context for the current thread. I was looking
    for the Hive architecture diagram on this page
    http://wiki.apache.org/hadoop/Hive/Design . The pdf link doesnt seem to work
    for me as well.

    It would be a great help if someone could direct me to this information.

    Thanks,
    Akshaya
    On Thu, Aug 12, 2010 at 4:37 PM, Edward Capriolo wrote:

    Joydeep,

    I am sorry. I put that when I thought we were going to actively move
    to xdocs. You an remove that if you like.

    As i said in a thread before the problem with the wiki is that no one
    actively updates it. Example:

    http://wiki.apache.org/hadoop/Hive/LanguageManual/Select
    oopse: Really what about "in support"...
    https://issues.apache.org/jira/browse/HIVE-801

    Which is why I hold the option that all patches except bug fixes
    should probably come with xdocs, People are free to disagree.

    Edward
    On Thu, Aug 12, 2010 at 3:16 AM, Joydeep Sen Sarma wrote:
    i hate this message: 'THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT
    EDIT!Join Syntax'
    why must edits to the wiki be banned if there are xdocs? hadoop has both.

    there will always be things that are not captured in xdocs. it's pretty
    sad to discourage free form edits by people who want to contribute without
    checking out source. (what is this - the 80s?)
    ________________________________________
    From: Edward Capriolo [edlinuxguru@gmail.com]
    Sent: Tuesday, August 10, 2010 2:57 PM
    To: hive-user@hadoop.apache.org
    Cc: hive-dev@hadoop.apache.org
    Subject: Re: How HIVE manages a join

    Sorry.
    $hive_root/docs/xdocs/language_manual/joins.xml
    On Tue, Aug 10, 2010 at 5:57 PM, Edward Capriolo wrote:
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward
    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to
    version
    control when you are done.
    Thanks.
    Carl

    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he <
    heyongqiangict@gmail.com>
    wrote:
    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?

    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he <
    heyongqiangict@gmail.com>
    wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance
    to
    look at another skew join Namit mentioned to me. But definitely
    should
    update the wiki earlier. My bad.

    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher <
    hammer@cloudera.com>
    wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge
    bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..


    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..

    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs
    HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is
    written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full
    data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of
    strategy/optimization?
    Thanks.
  • Edward Capriolo at Aug 12, 2010 at 9:08 pm

    On Thu, Aug 12, 2010 at 5:05 PM, akshaya iyengar wrote:
    Hello,
    I apologize if this is out of context for the current thread. I was looking
    for the Hive architecture diagram on this page
    http://wiki.apache.org/hadoop/Hive/Design . The pdf link doesnt seem to work
    for me as well.

    It would be a great help if someone could direct me to this information.

    Thanks,
    Akshaya
    On Thu, Aug 12, 2010 at 4:37 PM, Edward Capriolo wrote:

    Joydeep,

    I am sorry. I put that when I thought we were going to actively move
    to xdocs. You an remove that if you like.

    As i said in a thread before the problem with the wiki is that no one
    actively updates it. Example:

    http://wiki.apache.org/hadoop/Hive/LanguageManual/Select
    oopse: Really what about "in support"...
    https://issues.apache.org/jira/browse/HIVE-801

    Which is why I hold the option that all patches except bug fixes
    should probably come with xdocs, People are free to disagree.

    Edward

    On Thu, Aug 12, 2010 at 3:16 AM, Joydeep Sen Sarma <jssarma@facebook.com>
    wrote:
    i hate this message: 'THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT
    EDIT!Join Syntax'

    why must edits to the wiki be banned if there are xdocs? hadoop has
    both.

    there will always be things that are not captured in xdocs. it's pretty
    sad to discourage free form edits by people who want to contribute without
    checking out source. (what is this - the 80s?)
    ________________________________________
    From: Edward Capriolo [edlinuxguru@gmail.com]
    Sent: Tuesday, August 10, 2010 2:57 PM
    To: hive-user@hadoop.apache.org
    Cc: hive-dev@hadoop.apache.org
    Subject: Re: How HIVE manages a join

    Sorry.
    $hive_root/docs/xdocs/language_manual/joins.xml

    On Tue, Aug 10, 2010 at 5:57 PM, Edward Capriolo <edlinuxguru@gmail.com>
    wrote:
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward

    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach <carl@cloudera.com>
    wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to
    version
    control when you are done.
    Thanks.
    Carl

    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he
    <heyongqiangict@gmail.com>
    wrote:
    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?

    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he
    <heyongqiangict@gmail.com>
    wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance
    to
    look at another skew join Namit mentioned to me. But definitely
    should
    update the wiki earlier. My bad.

    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher
    <hammer@cloudera.com>
    wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge
    bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..




    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..



    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team  -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs
    HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is
    written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full
    data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can
    hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of
    strategy/optimization?

    Thanks.
    The wiki had all attachments removed because crackers will using it to
    post STUFF (use your imagination)

    Edward
  • John Sichi at Aug 12, 2010 at 9:09 pm
    All the attachments on the wiki disappeared a while ago. I noticed it recently on my design doc for views. I think probably somebody disabled attachment upload due to spam, and this made the existing attachments inaccessible as well. Probably need to open an INFRA ticket in Apache JIRA.

    JVS

    On Aug 12, 2010, at 2:05 PM, akshaya iyengar wrote:

    Hello,
    I apologize if this is out of context for the current thread. I was looking for the Hive architecture diagram on this page http://wiki.apache.org/hadoop/Hive/Design . The pdf link doesnt seem to work for me as well.

    It would be a great help if someone could direct me to this information.

    Thanks,
    Akshaya

    On Thu, Aug 12, 2010 at 4:37 PM, Edward Capriolo wrote:
    Joydeep,

    I am sorry. I put that when I thought we were going to actively move
    to xdocs. You an remove that if you like.

    As i said in a thread before the problem with the wiki is that no one
    actively updates it. Example:

    http://wiki.apache.org/hadoop/Hive/LanguageManual/Select
    oopse: Really what about "in support"...
    https://issues.apache.org/jira/browse/HIVE-801

    Which is why I hold the option that all patches except bug fixes
    should probably come with xdocs, People are free to disagree.

    Edward
    On Thu, Aug 12, 2010 at 3:16 AM, Joydeep Sen Sarma wrote:
    i hate this message: 'THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax'

    why must edits to the wiki be banned if there are xdocs? hadoop has both.

    there will always be things that are not captured in xdocs. it's pretty sad to discourage free form edits by people who want to contribute without checking out source. (what is this - the 80s?)
    ________________________________________
    From: Edward Capriolo [edlinuxguru@gmail.com Sent: Tuesday, August 10, 2010 2:57 PM
    To: hive-user@hadoop.apache.org Cc: hive-dev@hadoop.apache.org Subject: Re: How HIVE manages a join

    Sorry.
    $hive_root/docs/xdocs/language_manual/joins.xml
    On Tue, Aug 10, 2010 at 5:57 PM, Edward Capriolo wrote:
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward
    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to version
    control when you are done.
    Thanks.
    Carl
    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he <heyongqiangict@gmail.com >> wrote:

    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?
    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he <heyongqiangict@gmail.com >>> wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.
    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher <hammer@cloudera.com >>> > wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..



    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..


    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Raghu Murthy at Aug 12, 2010 at 9:13 pm
    The hive.pdf link in the Design page is this one:
    http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009

    A later paper in ICDE'10 is available here:
    http://i.stanford.edu/~ragho/hive-icde2010.pdf

    Both of these papers and others are linked from:
    http://wiki.apache.org/hadoop/Hive/Presentations

    Hope this helps.


    On Aug 12, 2010, at 2:05 PM, akshaya iyengar wrote:

    Hello,
    I apologize if this is out of context for the current thread. I was looking for the Hive architecture diagram on this page http://wiki.apache.org/hadoop/Hive/Design . The pdf link doesnt seem to work for me as well.

    It would be a great help if someone could direct me to this information.

    Thanks,
    Akshaya

    On Thu, Aug 12, 2010 at 4:37 PM, Edward Capriolo wrote:
    Joydeep,

    I am sorry. I put that when I thought we were going to actively move
    to xdocs. You an remove that if you like.

    As i said in a thread before the problem with the wiki is that no one
    actively updates it. Example:

    http://wiki.apache.org/hadoop/Hive/LanguageManual/Select
    oopse: Really what about "in support"...
    https://issues.apache.org/jira/browse/HIVE-801

    Which is why I hold the option that all patches except bug fixes
    should probably come with xdocs, People are free to disagree.

    Edward
    On Thu, Aug 12, 2010 at 3:16 AM, Joydeep Sen Sarma wrote:
    i hate this message: 'THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax'

    why must edits to the wiki be banned if there are xdocs? hadoop has both.

    there will always be things that are not captured in xdocs. it's pretty sad to discourage free form edits by people who want to contribute without checking out source. (what is this - the 80s?)
    ________________________________________
    From: Edward Capriolo [edlinuxguru@gmail.com Sent: Tuesday, August 10, 2010 2:57 PM
    To: hive-user@hadoop.apache.org Cc: hive-dev@hadoop.apache.org Subject: Re: How HIVE manages a join

    Sorry.
    $hive_root/docs/xdocs/language_manual/joins.xml
    On Tue, Aug 10, 2010 at 5:57 PM, Edward Capriolo wrote:
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward
    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to version
    control when you are done.
    Thanks.
    Carl
    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he <heyongqiangict@gmail.com >> wrote:

    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?
    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he <heyongqiangict@gmail.com >>> wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance to
    look at another skew join Namit mentioned to me. But definitely should
    update the wiki earlier. My bad.
    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher <hammer@cloudera.com >>> > wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..



    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..


    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of strategy/optimization?

    Thanks.
  • Akshaya iyengar at Aug 12, 2010 at 9:20 pm
    Thank you very much for the links.

    Akshaya
    On Thu, Aug 12, 2010 at 5:16 PM, Raghu Murthy wrote:

    The hive.pdf link in the Design page is this one:
    http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009

    A later paper in ICDE'10 is available here:
    http://i.stanford.edu/~ragho/hive-icde2010.pdf<http://i.stanford.edu/%7Eragho/hive-icde2010.pdf>

    Both of these papers and others are linked from:
    http://wiki.apache.org/hadoop/Hive/Presentations

    Hope this helps.


    On Aug 12, 2010, at 2:05 PM, akshaya iyengar wrote:

    Hello,
    I apologize if this is out of context for the current thread. I was looking
    for the Hive architecture diagram on this page
    http://wiki.apache.org/hadoop/Hive/Design . The pdf link doesnt seem to
    work for me as well.

    It would be a great help if someone could direct me to this information.

    Thanks,
    Akshaya
    On Thu, Aug 12, 2010 at 4:37 PM, Edward Capriolo wrote:

    Joydeep,

    I am sorry. I put that when I thought we were going to actively move
    to xdocs. You an remove that if you like.

    As i said in a thread before the problem with the wiki is that no one
    actively updates it. Example:

    http://wiki.apache.org/hadoop/Hive/LanguageManual/Select
    oopse: Really what about "in support"...
    https://issues.apache.org/jira/browse/HIVE-801

    Which is why I hold the option that all patches except bug fixes
    should probably come with xdocs, People are free to disagree.

    Edward

    On Thu, Aug 12, 2010 at 3:16 AM, Joydeep Sen Sarma <jssarma@facebook.com>
    wrote:
    i hate this message: 'THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT
    EDIT!Join Syntax'
    why must edits to the wiki be banned if there are xdocs? hadoop has both.
    there will always be things that are not captured in xdocs. it's pretty
    sad to discourage free form edits by people who want to contribute without
    checking out source. (what is this - the 80s?)
    ________________________________________
    From: Edward Capriolo [edlinuxguru@gmail.com]
    Sent: Tuesday, August 10, 2010 2:57 PM
    To: hive-user@hadoop.apache.org
    Cc: hive-dev@hadoop.apache.org
    Subject: Re: How HIVE manages a join

    Sorry.
    $hive_root/docs/xdocs/language_manual/joins.xml

    On Tue, Aug 10, 2010 at 5:57 PM, Edward Capriolo <edlinuxguru@gmail.com>
    wrote:
    This page is is already in version control..

    /home/edward/cassandra-handler/docs/xdocs/language_manual/joins.xml

    Edward

    On Tue, Aug 10, 2010 at 5:15 PM, Carl Steinbach <carl@cloudera.com>
    wrote:
    Hi Yongqiang,
    Please go ahead and update the wiki page. I will copy it over to
    version
    control when you are done.
    Thanks.
    Carl

    On Tue, Aug 10, 2010 at 2:11 PM, yongqiang he <
    heyongqiangict@gmail.com>
    wrote:
    In the Hive Join wiki page, it says
    "THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax"

    Where should i do the update?

    On Fri, Aug 6, 2010 at 11:46 PM, yongqiang he <
    heyongqiangict@gmail.com>
    wrote:
    Yeah. The sort merge bucket mapjoin has been finished for sometime,
    and seems stable now. I did one skew join but haven't get a chance
    to
    look at another skew join Namit mentioned to me. But definitely
    should
    update the wiki earlier. My bad.

    On Fri, Aug 6, 2010 at 8:32 PM, Jeff Hammerbacher <
    hammer@cloudera.com>
    wrote:
    Yongqiang mentioned he was going to update the wiki with this
    information in
    the thread at http://hadoop.markmail.org/thread/hxd4uwwukuo46lgw.

    Yongqiang, have you gotten a chance to complete the sort merge
    bucket
    map
    join and the other skew join you mention in the above thread?

    Thanks,
    Jeff

    On Fri, Aug 6, 2010 at 3:43 AM, bharath vissapragada
    wrote:
    Roberto ..

    You can find these links useful ..


    http://www.slideshare.net/ragho/hive-icde-2010?src=related_normal&rel=2374551
    - Simple joins and optimizations..

    http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team -
    New kind of joins / features of hive ..

    Thanks

    Bharath.V
    4th year Undergraduate..
    IIIT Hyderabad

    On Fri, Aug 6, 2010 at 12:16 PM, Cappa Roberto
    wrote:
    Hi,

    I cannot find any documentation about what algorithm performs
    HIVE to
    translate JOIN clauses to Map-Reduce tasks.

    In particular, if I have two tables A and B, each table is
    written on
    a
    separate file and each file is splitted on hadoop nodes. When I
    perform a
    JOIN with A.column = B.column, the framework has to compare full
    data
    from
    the first file and full data from the second file. In order to
    perform a
    full scan of all possibile combinations of values, how can
    hadoop
    perform
    it? If each node contains a portion of each file, it seems not
    possible to
    have a complete comparison. Does one of the two files enterely
    replicated on
    each node? Or, does HIVE use another kind of
    strategy/optimization?
    Thanks.

Related Discussions