Suhail
On Wed, Apr 1, 2009 at 12:01 PM, Suhail Doshi wrote:
Is there a way to sort by a function such as count(1)?
Suhail
--
http://mixpanel.com
Blog: http://blog.mixpanel.com
Is there a way to sort by a function such as count(1)?
Suhail
On Thu, Mar 26, 2009 at 2:38 PM, Zheng Shao wrote:
Hi Jeff,
Besides the way of achieving total ordering as Raghu said (with 1
reducer), we can also first get a partitioned ordering, and then merge the
partitions (preserving the order) when reading.
The reducer step can be much faster because it's parallelized, but the
reading is sequential so it will still take a long time to get all the data.
However most of the use cases of a total ordering is just to get the top
10.
The current work-around is:
First store the top 10 from each partition to some temp table:
INSERT OVERWRITE tableB
REDUCE a.*
USING 'head -n 10'
AS (col1, col2, col3, col4, ...)
FROM (SELECT * FROM tableA SORT BY col3 DESC, col4 ASC) a
Second, set the #reducer to 1 and get the top 10 globally.
set mapred.reduce.tasks=1;
SELECT * FROM tableB SORT BY col3 DESC, col4 ASC LIMIT 10
Zheng
--
Yours,
Zheng
Hi Jeff,
Besides the way of achieving total ordering as Raghu said (with 1
reducer), we can also first get a partitioned ordering, and then merge the
partitions (preserving the order) when reading.
The reducer step can be much faster because it's parallelized, but the
reading is sequential so it will still take a long time to get all the data.
However most of the use cases of a total ordering is just to get the top
10.
The current work-around is:
First store the top 10 from each partition to some temp table:
INSERT OVERWRITE tableB
REDUCE a.*
USING 'head -n 10'
AS (col1, col2, col3, col4, ...)
FROM (SELECT * FROM tableA SORT BY col3 DESC, col4 ASC) a
Second, set the #reducer to 1 and get the top 10 globally.
set mapred.reduce.tasks=1;
SELECT * FROM tableB SORT BY col3 DESC, col4 ASC LIMIT 10
Zheng
On Thu, Mar 26, 2009 at 11:10 AM, Raghu Murthy wrote:
Right now there is already a way to get total ordering. You can do a SORT
BY
and specify one reducer.
raghu
top of
digitalwarfare@gmail.com>
getting
thrift
get too
yourself.
Right now there is already a way to get total ordering. You can do a SORT
BY
and specify one reducer.
raghu
On 3/26/09 10:49 AM, "Jeff Hammerbacher" wrote:
Hey Zheng,
What is the timeline and priority for doing a total ordering for ORDER BY
support?
Thanks,
Jeff
On Wed, Mar 25, 2009 at 9:02 PM, Suhail Doshi <
digitalwarfare@gmail.com>
wrote:
global tenHey Zheng,
What is the timeline and priority for doing a total ordering for ORDER BY
support?
Thanks,
Jeff
On Wed, Mar 25, 2009 at 9:02 PM, Suhail Doshi <
digitalwarfare@gmail.com>
wrote:
Ah okay, I guess I can simply just not do fetchAll() to grab the
so
I do not mistakenly grab too much data.
Suhail
I do not mistakenly grab too much data.
Suhail
On Wed, Mar 25, 2009 at 6:43 PM, Zheng Shao wrote:
There is a SORT BY.
You can do:
SELECT * FROM tableA SORT BY c1 DESC;
Then each of the partition will be sorted.
However in order to get the global 10, we will need to do LIMIT 10 on
There is a SORT BY.
You can do:
SELECT * FROM tableA SORT BY c1 DESC;
Then each of the partition will be sorted.
However in order to get the global 10, we will need to do LIMIT 10 on
that. LIMIT 10 and SORT BY do not work exactly as the user wants now.
Zheng
On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi <
Zheng
On Wed, Mar 25, 2009 at 3:23 PM, Suhail Doshi <
wrote:
Since Hive does not have an ORDER BY...yet what is the solution for
the top 10 rows based on a field without having your client in
getting too much data back? Seems like it is possible to actually
much data but unfortunately you have to get all rows and sort by
--
Yours,
Zheng
--
http://mixpanel.com
Blog: http://blog.mixpanel.com
--
http://mixpanel.com
Blog: http://blog.mixpanel.com