Grokbase Groups Hive user March 2012
FAQ

[Hive-user] non-equality joins

Mahsa mofidpoor
Mar 13, 2012 at 4:17 pm
Hello,

Is there a reason behind not implementing non-equality joins in Hive? In
other words, is there any usage for theta-join, if implemented?

Thank you in advance for your response,
Mahsa
reply

Search Discussions

8 responses

  • Keith Wiley at Mar 13, 2012 at 4:50 pm
    https://cwiki.apache.org/Hive/languagemanual-joins.html

    "Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job."

    I admit, that isn't a very detailed answer, but it gives some indication of the reason for the discrepancy between Hive and other databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we all know this, I'm just reiterating the train of thought). The problem is that certain algorithms are exceedingly difficult to wedge into the MapReduce framework.

    That is as detailed as my personal insight can get. I've done a lot of MapReduce programming in Hadoop but I'm not a database expert and I don't really understand the steps involved in various kinds of table-joins, so I don't understand the particular ways in which certain database operations do or do not fit into MapReduce...but presumably nonequality joins (whatever those are :-D ) are particularly difficult to MapReduceify.

    Cheers!
    On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote:

    Hello,

    Is there a reason behind not implementing non-equality joins in Hive? In other words, is there any usage for theta-join, if implemented?

    Thank you in advance for your response,
    Mahsa

    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com

    "It's a fine line between meticulous and obsessive-compulsive and a slippery
    rope between obsessive-compulsive and debilitatingly slow."
    -- Keith Wiley
    ________________________________________________________________________________
  • Mahsa mofidpoor at Mar 13, 2012 at 5:02 pm
    Hi Keith,

    Do you know exactly how an algorithm should be in order to fit in the
    MapReduce framework? Could you refer me to some references?

    Thanks and Regards,
    Mahsa


    On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley wrote:

    https://cwiki.apache.org/Hive/languagemanual-joins.html

    "Hive does not support join conditions that are not equality conditions as
    it is very difficult to express such conditions as a map/reduce job."

    I admit, that isn't a very detailed answer, but it gives some indication
    of the reason for the discrepancy between Hive and other databases. Hive
    fundamentally operates on Hadoop, namely on MapReduce (we all know this,
    I'm just reiterating the train of thought). The problem is that certain
    algorithms are exceedingly difficult to wedge into the MapReduce framework.

    That is as detailed as my personal insight can get. I've done a lot of
    MapReduce programming in Hadoop but I'm not a database expert and I don't
    really understand the steps involved in various kinds of table-joins, so I
    don't understand the particular ways in which certain database operations
    do or do not fit into MapReduce...but presumably nonequality joins
    (whatever those are :-D ) are particularly difficult to MapReduceify.

    Cheers!
    On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote:

    Hello,

    Is there a reason behind not implementing non-equality joins in Hive? In
    other words, is there any usage for theta-join, if implemented?
    Thank you in advance for your response,
    Mahsa


    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com
    music.keithwiley.com

    "It's a fine line between meticulous and obsessive-compulsive and a
    slippery
    rope between obsessive-compulsive and debilitatingly slow."
    -- Keith Wiley

    ________________________________________________________________________________
  • Tucker, Matt at Mar 13, 2012 at 5:14 pm
    For theta joins, you'll have to convert the query to an equi-join, and then filter for non-equality in the WHERE clause. Depending upon the size of each table, you might consider looking at map-side joins, which will allow for doing non-equality filters during a join before it's passed to the reducers.

    Matt Tucker

    From: mahsa mofidpoor
    Sent: Tuesday, March 13, 2012 1:02 PM
    To: user@hive.apache.org
    Subject: Re: non-equality joins


    Hi Keith,

    Do you know exactly how an algorithm should be in order to fit in the MapReduce framework? Could you refer me to some references?

    Thanks and Regards,
    Mahsa



    On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley wrote:
    https://cwiki.apache.org/Hive/languagemanual-joins.html

    "Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job."

    I admit, that isn't a very detailed answer, but it gives some indication of the reason for the discrepancy between Hive and other databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we all know this, I'm just reiterating the train of thought). The problem is that certain algorithms are exceedingly difficult to wedge into the MapReduce framework.

    That is as detailed as my personal insight can get. I've done a lot of MapReduce programming in Hadoop but I'm not a database expert and I don't really understand the steps involved in various kinds of table-joins, so I don't understand the particular ways in which certain database operations do or do not fit into MapReduce...but presumably nonequality joins (whatever those are :-D ) are particularly difficult to MapReduceify.

    Cheers!
    On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote:

    Hello,

    Is there a reason behind not implementing non-equality joins in Hive? In other words, is there any usage for theta-join, if implemented?

    Thank you in advance for your response,
    Mahsa
    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com<http://keithwiley.com> music.keithwiley.com<http://music.keithwiley.com>

    "It's a fine line between meticulous and obsessive-compulsive and a slippery
    rope between obsessive-compulsive and debilitatingly slow."
    -- Keith Wiley
    ________________________________________________________________________________
  • Keith Wiley at Mar 13, 2012 at 5:29 pm
    Sounds like Matt possesses the proper combination of expertise in both databases and MapReduce to assist you. I'm bowing out as I honestly don't know advanced database concepts at all. In addition, hive offers hive-specific tools like Matt suggested (map-side joins) to help out, which I'm too new too to speculate on. I'm just starting hive this week as a matter of fact.

    The short answer on MapReduce algorithms is that the individual computational units can't communicate with each other (each mapper or each map() in fact cannot communicate with the others, likewise for reducers). That's one of the major distinctions between MapReduce and more general parallel processing frameworks like MPI. This is the wrong mailing list to go much deeper than that however.

    Thanks Matt.

    Best of luck Mahsa.
    On Mar 13, 2012, at 10:13 , Tucker, Matt wrote:

    For theta joins, you’ll have to convert the query to an equi-join, and then filter for non-equality in the WHERE clause. Depending upon the size of each table, you might consider looking at map-side joins, which will allow for doing non-equality filters during a join before it’s passed to the reducers.

    Matt Tucker

    From: mahsa mofidpoor
    Sent: Tuesday, March 13, 2012 1:02 PM
    To: user@hive.apache.org
    Subject: Re: non-equality joins


    Hi Keith,

    Do you know exactly how an algorithm should be in order to fit in the MapReduce framework? Could you refer me to some references?

    Thanks and Regards,
    Mahsa
    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com

    "Luminous beings are we, not this crude matter."
    -- Yoda
    ________________________________________________________________________________
  • Alan Gates at Mar 17, 2012 at 12:42 am
    There are algorithms for doing general theta-joins in parallel. Search Google on "theta joins parallel database" and you will find some interesting references. I am not aware of any tools that implement these yet. You can also do it via a cross join followed by a filter, but again you need special algorithms to do a cross in MapReduce, which Hive doesn't implement yet. See http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html (search for the section on Cross) for a discussion of how to do cross in MapReduce.

    Alan.
    On Mar 13, 2012, at 10:13 AM, Tucker, Matt wrote:

    For theta joins, you’ll have to convert the query to an equi-join, and then filter for non-equality in the WHERE clause. Depending upon the size of each table, you might consider looking at map-side joins, which will allow for doing non-equality filters during a join before it’s passed to the reducers.

    Matt Tucker

    From: mahsa mofidpoor
    Sent: Tuesday, March 13, 2012 1:02 PM
    To: user@hive.apache.org
    Subject: Re: non-equality joins


    Hi Keith,

    Do you know exactly how an algorithm should be in order to fit in the MapReduce framework? Could you refer me to some references?

    Thanks and Regards,
    Mahsa



    On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley wrote:
    https://cwiki.apache.org/Hive/languagemanual-joins.html

    "Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job."

    I admit, that isn't a very detailed answer, but it gives some indication of the reason for the discrepancy between Hive and other databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we all know this, I'm just reiterating the train of thought). The problem is that certain algorithms are exceedingly difficult to wedge into the MapReduce framework.

    That is as detailed as my personal insight can get. I've done a lot of MapReduce programming in Hadoop but I'm not a database expert and I don't really understand the steps involved in various kinds of table-joins, so I don't understand the particular ways in which certain database operations do or do not fit into MapReduce...but presumably nonequality joins (whatever those are :-D ) are particularly difficult to MapReduceify.

    Cheers!
    On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote:

    Hello,

    Is there a reason behind not implementing non-equality joins in Hive? In other words, is there any usage for theta-join, if implemented?

    Thank you in advance for your response,
    Mahsa

    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com

    "It's a fine line between meticulous and obsessive-compulsive and a slippery
    rope between obsessive-compulsive and debilitatingly slow."
    -- Keith Wiley
    ________________________________________________________________________________
  • Buddhika chamith at Mar 17, 2012 at 6:03 am
    Hi,

    I think matt's solution is the way to go for now. If you need some basic
    understanding on how reduce and map side joins work see [1] whether if it
    helps you.

    Regards
    Buddhika

    [1] http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/
    On Sat, Mar 17, 2012 at 6:41 AM, Alan Gates wrote:

    There are algorithms for doing general theta-joins in parallel. Search
    Google on "theta joins parallel database" and you will find some
    interesting references. I am not aware of any tools that implement these
    yet. You can also do it via a cross join followed by a filter, but again
    you need special algorithms to do a cross in MapReduce, which Hive doesn't
    implement yet. See
    http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html(search for the section on Cross) for a discussion of how to do cross in
    MapReduce.

    Alan.
    On Mar 13, 2012, at 10:13 AM, Tucker, Matt wrote:

    For theta joins, you’ll have to convert the query to an equi-join, and
    then filter for non-equality in the WHERE clause. Depending upon the size
    of each table, you might consider looking at map-side joins, which will
    allow for doing non-equality filters during a join before it’s passed to
    the reducers.
    Matt Tucker

    From: mahsa mofidpoor
    Sent: Tuesday, March 13, 2012 1:02 PM
    To: user@hive.apache.org
    Subject: Re: non-equality joins


    Hi Keith,

    Do you know exactly how an algorithm should be in order to fit in the
    MapReduce framework? Could you refer me to some references?
    Thanks and Regards,
    Mahsa



    On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley wrote:
    https://cwiki.apache.org/Hive/languagemanual-joins.html

    "Hive does not support join conditions that are not equality conditions
    as it is very difficult to express such conditions as a map/reduce job."
    I admit, that isn't a very detailed answer, but it gives some indication
    of the reason for the discrepancy between Hive and other databases. Hive
    fundamentally operates on Hadoop, namely on MapReduce (we all know this,
    I'm just reiterating the train of thought). The problem is that certain
    algorithms are exceedingly difficult to wedge into the MapReduce framework.
    That is as detailed as my personal insight can get. I've done a lot of
    MapReduce programming in Hadoop but I'm not a database expert and I don't
    really understand the steps involved in various kinds of table-joins, so I
    don't understand the particular ways in which certain database operations
    do or do not fit into MapReduce...but presumably nonequality joins
    (whatever those are :-D ) are particularly difficult to MapReduceify.
    Cheers!
    On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote:

    Hello,

    Is there a reason behind not implementing non-equality joins in Hive?
    In other words, is there any usage for theta-join, if implemented?
    Thank you in advance for your response,
    Mahsa

    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com
    music.keithwiley.com
    "It's a fine line between meticulous and obsessive-compulsive and a slippery
    rope between obsessive-compulsive and debilitatingly slow."
    -- Keith Wiley
    ________________________________________________________________________________
  • Edward Capriolo at Mar 17, 2012 at 4:04 pm
    in general hive does not offer features it can not do well. Cross joins on
    any data set where one table is not very small do not scale in map reduce.
    So there is not a big win for offering syntax for it.

    Not talking about pig but one very common unnamed map reduce framework
    offers Many features that do not paralize into map reduce. I find this
    framework a total 'tease'.
    On Saturday, March 17, 2012, buddhika chamith wrote:
    Hi,

    I think matt's solution is the way to go for now. If you need some basic
    understanding on how reduce and map side joins work see [1] whether if it
    helps you.
    Regards
    Buddhika

    [1] http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/
    On Sat, Mar 17, 2012 at 6:41 AM, Alan Gates wrote:

    There are algorithms for doing general theta-joins in parallel. Search
    Google on "theta joins parallel database" and you will find some
    interesting references. I am not aware of any tools that implement these
    yet. You can also do it via a cross join followed by a filter, but again
    you need special algorithms to do a cross in MapReduce, which Hive doesn't
    implement yet. See
    http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html(search
    for the section on Cross) for a discussion of how to do cross in
    MapReduce.
    Alan.
    On Mar 13, 2012, at 10:13 AM, Tucker, Matt wrote:

    For theta joins, you’ll have to convert the query to an equi-join, and
    then filter for non-equality in the WHERE clause. Depending upon the size
    of each table, you might consider looking at map-side joins, which will
    allow for doing non-equality filters during a join before it’s passed to
    the reducers.
    Matt Tucker

    From: mahsa mofidpoor
    Sent: Tuesday, March 13, 2012 1:02 PM
    To: user@hive.apache.org
    Subject: Re: non-equality joins


    Hi Keith,

    Do you know exactly how an algorithm should be in order to fit in the
    MapReduce framework? Could you refer me to some references?
    Thanks and Regards,
    Mahsa



    On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley wrote:
    https://cwiki.apache.org/Hive/languagemanual-joins.html

    "Hive does not support join conditions that are not equality
    conditions as it is very difficult to express such conditions as a
    map/reduce job."
    I admit, that isn't a very detailed answer, but it gives some
    indication of the reason for the discrepancy between Hive and other
    databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we
    all know this, I'm just reiterating the train of thought). The problem is
    that certain algorithms are exceedingly difficult to wedge into the
    MapReduce framework.
    That is as detailed as my personal insight can get. I've done a lot
    of MapReduce programming in Hadoop but I'm not a database expert and I
    don't really understand the steps involved in various kinds of table-joins,
    so I don't understand the particular ways in which certain database
    operations do or do not fit into MapReduce...but presumably nonequality
    joins (whatever those are :-D ) are particularly difficult to MapReduceify.
    Cheers!
    On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote:

    Hello,

    Is there a reason behind not implementing non-equality joins in
    Hive? In other words, is there any usage for theta-join, if implemented?
    Thank you in advance for your response,
    Mahsa
    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com
    music.keithwiley.com
    "It's a fine line between meticulous and obsessive-compulsive and a
    slippery
    rope between obsessive-compulsive and debilitatingly slow."
    -- Keith Wiley
    ________________________________________________________________________________
  • Mahsa mofidpoor at Mar 17, 2012 at 4:40 pm
    Hi Edward,

    Could you please clarify what you mean in your last paragraph? You found
    Pig Latin a week framework in terms of MapReduce?

    Thanks again for the response.
    Mahsa

    On Sat, Mar 17, 2012 at 12:04 PM, Edward Capriolo wrote:

    in general hive does not offer features it can not do well. Cross joins on
    any data set where one table is not very small do not scale in map reduce.
    So there is not a big win for offering syntax for it.

    Not talking about pig but one very common unnamed map reduce framework
    offers Many features that do not paralize into map reduce. I find this
    framework a total 'tease'.
    On Saturday, March 17, 2012, buddhika chamith wrote:
    Hi,

    I think matt's solution is the way to go for now. If you need some basic
    understanding on how reduce and map side joins work see [1] whether if it
    helps you.
    Regards
    Buddhika

    [1] http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/
    On Sat, Mar 17, 2012 at 6:41 AM, Alan Gates wrote:

    There are algorithms for doing general theta-joins in parallel. Search
    Google on "theta joins parallel database" and you will find some
    interesting references. I am not aware of any tools that implement these
    yet. You can also do it via a cross join followed by a filter, but again
    you need special algorithms to do a cross in MapReduce, which Hive doesn't
    implement yet. See
    http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html(search for the section on Cross) for a discussion of how to do cross in
    MapReduce.
    Alan.
    On Mar 13, 2012, at 10:13 AM, Tucker, Matt wrote:

    For theta joins, you’ll have to convert the query to an equi-join,
    and then filter for non-equality in the WHERE clause. Depending upon the
    size of each table, you might consider looking at map-side joins, which
    will allow for doing non-equality filters during a join before it’s passed
    to the reducers.
    Matt Tucker

    From: mahsa mofidpoor
    Sent: Tuesday, March 13, 2012 1:02 PM
    To: user@hive.apache.org
    Subject: Re: non-equality joins


    Hi Keith,

    Do you know exactly how an algorithm should be in order to fit in the
    MapReduce framework? Could you refer me to some references?
    Thanks and Regards,
    Mahsa



    On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley wrote:
    https://cwiki.apache.org/Hive/languagemanual-joins.html

    "Hive does not support join conditions that are not equality
    conditions as it is very difficult to express such conditions as a
    map/reduce job."
    I admit, that isn't a very detailed answer, but it gives some
    indication of the reason for the discrepancy between Hive and other
    databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we
    all know this, I'm just reiterating the train of thought). The problem is
    that certain algorithms are exceedingly difficult to wedge into the
    MapReduce framework.
    That is as detailed as my personal insight can get. I've done a lot
    of MapReduce programming in Hadoop but I'm not a database expert and I
    don't really understand the steps involved in various kinds of table-joins,
    so I don't understand the particular ways in which certain database
    operations do or do not fit into MapReduce...but presumably nonequality
    joins (whatever those are :-D ) are particularly difficult to MapReduceify.
    Cheers!
    On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote:

    Hello,

    Is there a reason behind not implementing non-equality joins in
    Hive? In other words, is there any usage for theta-join, if implemented?
    Thank you in advance for your response,
    Mahsa
    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com
    music.keithwiley.com
    "It's a fine line between meticulous and obsessive-compulsive and a
    slippery
    rope between obsessive-compulsive and debilitatingly slow."
    -- Keith Wiley
    ________________________________________________________________________________

Related Discussions