Grokbase Groups Hive user March 2009
FAQ
Since Hive does not have an ORDER BY...yet what is the solution for getting
the top 10 rows based on a field without having your client in thrift
getting too much data back? Seems like it is possible to actually get too
much data but unfortunately you have to get all rows and sort by yourself.

Suhail

Search Discussions

  • Zheng Shao at Mar 26, 2009 at 1:43 am
    There is a SORT BY.

    You can do:
    SELECT * FROM tableA SORT BY c1 DESC;

    Then each of the partition will be sorted.

    However in order to get the global 10, we will need to do LIMIT 10 on top of
    that. LIMIT 10 and SORT BY do not work exactly as the user wants now.


    Zheng
    On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi wrote:

    Since Hive does not have an ORDER BY...yet what is the solution for getting
    the top 10 rows based on a field without having your client in thrift
    getting too much data back? Seems like it is possible to actually get too
    much data but unfortunately you have to get all rows and sort by yourself.

    Suhail


    --
    Yours,
    Zheng
  • Suhail Doshi at Mar 26, 2009 at 4:03 am
    Ah okay, I guess I can simply just not do fetchAll() to grab the global ten
    so I do not mistakenly grab too much data.

    Suhail
    On Wed, Mar 25, 2009 at 6:43 PM, Zheng Shao wrote:

    There is a SORT BY.

    You can do:
    SELECT * FROM tableA SORT BY c1 DESC;

    Then each of the partition will be sorted.

    However in order to get the global 10, we will need to do LIMIT 10 on top
    of that. LIMIT 10 and SORT BY do not work exactly as the user wants now.


    Zheng
    On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi wrote:

    Since Hive does not have an ORDER BY...yet what is the solution for
    getting the top 10 rows based on a field without having your client in
    thrift getting too much data back? Seems like it is possible to actually get
    too much data but unfortunately you have to get all rows and sort by
    yourself.

    Suhail


    --
    Yours,
    Zheng


    --
    http://mixpanel.com
    Blog: http://blog.mixpanel.com
  • Jeff Hammerbacher at Mar 26, 2009 at 5:50 pm
    Hey Zheng,

    What is the timeline and priority for doing a total ordering for ORDER BY
    support?

    Thanks,
    Jeff
    On Wed, Mar 25, 2009 at 9:02 PM, Suhail Doshi wrote:

    Ah okay, I guess I can simply just not do fetchAll() to grab the global ten
    so I do not mistakenly grab too much data.

    Suhail

    On Wed, Mar 25, 2009 at 6:43 PM, Zheng Shao wrote:

    There is a SORT BY.

    You can do:
    SELECT * FROM tableA SORT BY c1 DESC;

    Then each of the partition will be sorted.

    However in order to get the global 10, we will need to do LIMIT 10 on top
    of that. LIMIT 10 and SORT BY do not work exactly as the user wants now.


    Zheng
    On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi wrote:

    Since Hive does not have an ORDER BY...yet what is the solution for
    getting the top 10 rows based on a field without having your client in
    thrift getting too much data back? Seems like it is possible to actually get
    too much data but unfortunately you have to get all rows and sort by
    yourself.

    Suhail


    --
    Yours,
    Zheng


    --
    http://mixpanel.com
    Blog: http://blog.mixpanel.com
  • Raghu Murthy at Mar 26, 2009 at 6:11 pm
    Right now there is already a way to get total ordering. You can do a SORT BY
    and specify one reducer.

    raghu
    On 3/26/09 10:49 AM, "Jeff Hammerbacher" wrote:

    Hey Zheng,

    What is the timeline and priority for doing a total ordering for ORDER BY
    support?

    Thanks,
    Jeff
    On Wed, Mar 25, 2009 at 9:02 PM, Suhail Doshi wrote:
    Ah okay, I guess I can simply just not do fetchAll() to grab the global ten
    so
    I do not mistakenly grab too much data.

    Suhail

    On Wed, Mar 25, 2009 at 6:43 PM, Zheng Shao wrote:
    There is a SORT BY.

    You can do:
    SELECT * FROM tableA SORT BY c1 DESC;

    Then each of the partition will be sorted.

    However in order to get the global 10, we will need to do LIMIT 10 on top of
    that. LIMIT 10 and SORT BY do not work exactly as the user wants now.


    Zheng


    On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi <digitalwarfare@gmail.com>
    wrote:
    Since Hive does not have an ORDER BY...yet what is the solution for getting
    the top 10 rows based on a field without having your client in thrift
    getting too much data back? Seems like it is possible to actually get too
    much data but unfortunately you have to get all rows and sort by yourself.

    Suhail


    --
    Yours,
    Zheng


    --
    http://mixpanel.com
    Blog: http://blog.mixpanel.com
  • Zheng Shao at Mar 26, 2009 at 9:38 pm
    Hi Jeff,

    Besides the way of achieving total ordering as Raghu said (with 1 reducer),
    we can also first get a partitioned ordering, and then merge the partitions
    (preserving the order) when reading.
    The reducer step can be much faster because it's parallelized, but the
    reading is sequential so it will still take a long time to get all the data.



    However most of the use cases of a total ordering is just to get the top 10.

    The current work-around is:

    First store the top 10 from each partition to some temp table:
    INSERT OVERWRITE tableB
    REDUCE a.*
    USING 'head -n 10'
    AS (col1, col2, col3, col4, ...)
    FROM (SELECT * FROM tableA SORT BY col3 DESC, col4 ASC) a

    Second, set the #reducer to 1 and get the top 10 globally.
    set mapred.reduce.tasks=1;
    SELECT * FROM tableB SORT BY col3 DESC, col4 ASC LIMIT 10


    Zheng
    On Thu, Mar 26, 2009 at 11:10 AM, Raghu Murthy wrote:

    Right now there is already a way to get total ordering. You can do a SORT
    BY
    and specify one reducer.

    raghu
    On 3/26/09 10:49 AM, "Jeff Hammerbacher" wrote:

    Hey Zheng,

    What is the timeline and priority for doing a total ordering for ORDER BY
    support?

    Thanks,
    Jeff

    On Wed, Mar 25, 2009 at 9:02 PM, Suhail Doshi <digitalwarfare@gmail.com>
    wrote:
    Ah okay, I guess I can simply just not do fetchAll() to grab the global
    ten
    so
    I do not mistakenly grab too much data.

    Suhail

    On Wed, Mar 25, 2009 at 6:43 PM, Zheng Shao wrote:
    There is a SORT BY.

    You can do:
    SELECT * FROM tableA SORT BY c1 DESC;

    Then each of the partition will be sorted.

    However in order to get the global 10, we will need to do LIMIT 10 on
    top of
    that. LIMIT 10 and SORT BY do not work exactly as the user wants now.


    Zheng


    On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi <
    digitalwarfare@gmail.com>
    wrote:
    Since Hive does not have an ORDER BY...yet what is the solution for
    getting
    the top 10 rows based on a field without having your client in thrift
    getting too much data back? Seems like it is possible to actually get
    too
    much data but unfortunately you have to get all rows and sort by
    yourself.

    --
    Yours,
    Zheng
  • Suhail Doshi at Apr 1, 2009 at 7:02 pm
    Is there a way to sort by a function such as count(1)?

    Suhail
    On Thu, Mar 26, 2009 at 2:38 PM, Zheng Shao wrote:

    Hi Jeff,

    Besides the way of achieving total ordering as Raghu said (with 1 reducer),
    we can also first get a partitioned ordering, and then merge the partitions
    (preserving the order) when reading.
    The reducer step can be much faster because it's parallelized, but the
    reading is sequential so it will still take a long time to get all the data.



    However most of the use cases of a total ordering is just to get the top
    10.

    The current work-around is:

    First store the top 10 from each partition to some temp table:
    INSERT OVERWRITE tableB
    REDUCE a.*
    USING 'head -n 10'
    AS (col1, col2, col3, col4, ...)
    FROM (SELECT * FROM tableA SORT BY col3 DESC, col4 ASC) a

    Second, set the #reducer to 1 and get the top 10 globally.
    set mapred.reduce.tasks=1;
    SELECT * FROM tableB SORT BY col3 DESC, col4 ASC LIMIT 10


    Zheng

    On Thu, Mar 26, 2009 at 11:10 AM, Raghu Murthy wrote:

    Right now there is already a way to get total ordering. You can do a SORT
    BY
    and specify one reducer.

    raghu
    On 3/26/09 10:49 AM, "Jeff Hammerbacher" wrote:

    Hey Zheng,

    What is the timeline and priority for doing a total ordering for ORDER BY
    support?

    Thanks,
    Jeff

    On Wed, Mar 25, 2009 at 9:02 PM, Suhail Doshi <digitalwarfare@gmail.com

    wrote:
    Ah okay, I guess I can simply just not do fetchAll() to grab the global
    ten
    so
    I do not mistakenly grab too much data.

    Suhail

    On Wed, Mar 25, 2009 at 6:43 PM, Zheng Shao wrote:
    There is a SORT BY.

    You can do:
    SELECT * FROM tableA SORT BY c1 DESC;

    Then each of the partition will be sorted.

    However in order to get the global 10, we will need to do LIMIT 10 on
    top of
    that. LIMIT 10 and SORT BY do not work exactly as the user wants now.


    Zheng


    On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi <
    digitalwarfare@gmail.com>
    wrote:
    Since Hive does not have an ORDER BY...yet what is the solution for
    getting
    the top 10 rows based on a field without having your client in thrift
    getting too much data back? Seems like it is possible to actually get
    too
    much data but unfortunately you have to get all rows and sort by
    yourself.

    --
    Yours,
    Zheng


    --
    http://mixpanel.com
    Blog: http://blog.mixpanel.com
  • Suhail Doshi at Apr 1, 2009 at 7:06 pm
    Nevermind, count(1) as count works heh, thought it didn't.

    Suhail
    On Wed, Apr 1, 2009 at 12:01 PM, Suhail Doshi wrote:

    Is there a way to sort by a function such as count(1)?

    Suhail

    On Thu, Mar 26, 2009 at 2:38 PM, Zheng Shao wrote:

    Hi Jeff,

    Besides the way of achieving total ordering as Raghu said (with 1
    reducer), we can also first get a partitioned ordering, and then merge the
    partitions (preserving the order) when reading.
    The reducer step can be much faster because it's parallelized, but the
    reading is sequential so it will still take a long time to get all the data.



    However most of the use cases of a total ordering is just to get the top
    10.

    The current work-around is:

    First store the top 10 from each partition to some temp table:
    INSERT OVERWRITE tableB
    REDUCE a.*
    USING 'head -n 10'
    AS (col1, col2, col3, col4, ...)
    FROM (SELECT * FROM tableA SORT BY col3 DESC, col4 ASC) a

    Second, set the #reducer to 1 and get the top 10 globally.
    set mapred.reduce.tasks=1;
    SELECT * FROM tableB SORT BY col3 DESC, col4 ASC LIMIT 10


    Zheng

    On Thu, Mar 26, 2009 at 11:10 AM, Raghu Murthy wrote:

    Right now there is already a way to get total ordering. You can do a SORT
    BY
    and specify one reducer.

    raghu
    On 3/26/09 10:49 AM, "Jeff Hammerbacher" wrote:

    Hey Zheng,

    What is the timeline and priority for doing a total ordering for ORDER BY
    support?

    Thanks,
    Jeff

    On Wed, Mar 25, 2009 at 9:02 PM, Suhail Doshi <
    digitalwarfare@gmail.com>
    wrote:
    Ah okay, I guess I can simply just not do fetchAll() to grab the
    global ten
    so
    I do not mistakenly grab too much data.

    Suhail

    On Wed, Mar 25, 2009 at 6:43 PM, Zheng Shao wrote:
    There is a SORT BY.

    You can do:
    SELECT * FROM tableA SORT BY c1 DESC;

    Then each of the partition will be sorted.

    However in order to get the global 10, we will need to do LIMIT 10 on
    top of
    that. LIMIT 10 and SORT BY do not work exactly as the user wants now.


    Zheng


    On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi <
    digitalwarfare@gmail.com>
    wrote:
    Since Hive does not have an ORDER BY...yet what is the solution for
    getting
    the top 10 rows based on a field without having your client in
    thrift
    getting too much data back? Seems like it is possible to actually
    get too
    much data but unfortunately you have to get all rows and sort by
    yourself.

    --
    Yours,
    Zheng


    --
    http://mixpanel.com
    Blog: http://blog.mixpanel.com


    --
    http://mixpanel.com
    Blog: http://blog.mixpanel.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMar 26, '09 at 1:24a
activeApr 1, '09 at 7:06p
posts8
users4
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase