FAQ
Hi,

I have a table which i have partitioned as below


----------------------------------------------------------------------------------------------------------------------------------------------------
CREATE EXTERNAL TABLE IF NOT EXISTS accesslogs_demographics
(........)
PARTITIONED BY (year INT,month INT,day INT,hour INT)
row format delimited fields terminated by '^' LOCATION
'/user/hive/warehouse/accesslogs_demographics';
----------------------------------------------------------------------------------------------------------------------------------------------------

I have populated it with around 100 million records.

The problem is when i want to query for a date range.

When i query for a single date as below, i get good performance,
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q "
select count(*) from accesslogs_demographics where (year = 2013 and month =
01 and day = 28)"
2832739

real 0m14.082s
user 0m0.087s
sys 0m0.053s
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

But this kind of query i cannot use for range of dates like (year>2011 and
month >02) etc, since i need month 01 from year 2012 which will be ignored
if i use this query.

I thought of doing something like this. But this consumes more time than
previous query.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q "
select count(*) from accesslogs_demographics where (((year * 10000) +
(month * 100 ) + (day) ) = 20130128)"
2832739

real 3m0.713s
user 0m0.155s
sys 0m0.089s
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

My Question is, Is my partitioning correct way? and what is the best way to
make impala query in such scenario.

Thanks in Advance,

Suresh

Search Discussions

  • Inder Pall at Feb 6, 2013 at 7:30 am
    Can you please share how many nodes setup you are running this stuff on?

    On Wed, Feb 6, 2013 at 12:52 PM, Suresh K wrote:

    Hi,

    I have a table which i have partitioned as below



    ----------------------------------------------------------------------------------------------------------------------------------------------------
    CREATE EXTERNAL TABLE IF NOT EXISTS accesslogs_demographics
    (........)
    PARTITIONED BY (year INT,month INT,day INT,hour INT)
    row format delimited fields terminated by '^' LOCATION
    '/user/hive/warehouse/accesslogs_demographics';

    ----------------------------------------------------------------------------------------------------------------------------------------------------

    I have populated it with around 100 million records.

    The problem is when i want to query for a date range.

    When i query for a single date as below, i get good performance,

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q "
    select count(*) from accesslogs_demographics where (year = 2013 and month =
    01 and day = 28)"
    2832739

    real 0m14.082s
    user 0m0.087s
    sys 0m0.053s

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    But this kind of query i cannot use for range of dates like (year>2011 and
    month >02) etc, since i need month 01 from year 2012 which will be ignored
    if i use this query.

    I thought of doing something like this. But this consumes more time than
    previous query.


    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q "
    select count(*) from accesslogs_demographics where (((year * 10000) +
    (month * 100 ) + (day) ) = 20130128)"
    2832739

    real 3m0.713s
    user 0m0.155s
    sys 0m0.089s

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    My Question is, Is my partitioning correct way? and what is the best way
    to make impala query in such scenario.

    Thanks in Advance,

    Suresh

    --
    - Inder
    "You are average of the 5 people you spend the most time with"
  • Suresh K at Feb 6, 2013 at 7:33 am
    Hi,

    Currently i am testing with only one EC2 High memory instance node(7.5GB
    RAM). There are 2 nodes in the cluster, but the other node is down due to
    some issue.

    Regards,
    Suresh
    On Wednesday, 6 February 2013 13:00:03 UTC+5:30, Inder Pall wrote:

    Can you please share how many nodes setup you are running this stuff on?


    On Wed, Feb 6, 2013 at 12:52 PM, Suresh K <sureshk...@gmail.com<javascript:>
    wrote:
    Hi,

    I have a table which i have partitioned as below



    ----------------------------------------------------------------------------------------------------------------------------------------------------
    CREATE EXTERNAL TABLE IF NOT EXISTS accesslogs_demographics
    (........)
    PARTITIONED BY (year INT,month INT,day INT,hour INT)
    row format delimited fields terminated by '^' LOCATION
    '/user/hive/warehouse/accesslogs_demographics';

    ----------------------------------------------------------------------------------------------------------------------------------------------------

    I have populated it with around 100 million records.

    The problem is when i want to query for a date range.

    When i query for a single date as below, i get good performance,

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q "
    select count(*) from accesslogs_demographics where (year = 2013 and month =
    01 and day = 28)"
    2832739

    real 0m14.082s
    user 0m0.087s
    sys 0m0.053s

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    But this kind of query i cannot use for range of dates like (year>2011
    and month >02) etc, since i need month 01 from year 2012 which will be
    ignored if i use this query.

    I thought of doing something like this. But this consumes more time than
    previous query.


    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q "
    select count(*) from accesslogs_demographics where (((year * 10000) +
    (month * 100 ) + (day) ) = 20130128)"
    2832739

    real 3m0.713s
    user 0m0.155s
    sys 0m0.089s

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    My Question is, Is my partitioning correct way? and what is the best way
    to make impala query in such scenario.

    Thanks in Advance,

    Suresh

    --
    - Inder
    "You are average of the 5 people you spend the most time with"
  • Marcel Kornacker at Feb 6, 2013 at 7:33 am
    Hi Suresh, the reason that this query
    " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month * 100
    ) + (day) ) = 20130128)"
    isn't fast is that Impala doesn't attempt to do partition pruning
    based on that expression, ie, it scans the whole table.

    Since you want to express ranges in your query, you could encode your
    date/time as a string or a bigint (01/28/2013 00:00.0 turns into
    2013012800), like this:
    ... PARTITIONED BY (datetime BIGINT)

    so that you can express your first query as
    select count(*) from accesslogs_demographics where datetime >=
    2013012800 and datetime < 2013012900
    and your second query as
    select count(*) from accesslogs_demographics where datetime >= 2011020100

    Marcel
    On Tue, Feb 5, 2013 at 11:22 PM, Suresh K wrote:
    Hi,

    I have a table which i have partitioned as below


    ----------------------------------------------------------------------------------------------------------------------------------------------------
    CREATE EXTERNAL TABLE IF NOT EXISTS accesslogs_demographics
    (........)
    PARTITIONED BY (year INT,month INT,day INT,hour INT)
    row format delimited fields terminated by '^' LOCATION
    '/user/hive/warehouse/accesslogs_demographics';
    ----------------------------------------------------------------------------------------------------------------------------------------------------

    I have populated it with around 100 million records.

    The problem is when i want to query for a date range.

    When i query for a single date as below, i get good performance,
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q " select
    count(*) from accesslogs_demographics where (year = 2013 and month = 01 and
    day = 28)"
    2832739

    real 0m14.082s
    user 0m0.087s
    sys 0m0.053s
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    But this kind of query i cannot use for range of dates like (year>2011 and
    month >02) etc, since i need month 01 from year 2012 which will be ignored
    if i use this query.

    I thought of doing something like this. But this consumes more time than
    previous query.

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month * 100
    ) + (day) ) = 20130128)"
    2832739

    real 3m0.713s
    user 0m0.155s
    sys 0m0.089s
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    My Question is, Is my partitioning correct way? and what is the best way to
    make impala query in such scenario.

    Thanks in Advance,

    Suresh
  • Suresh K at Feb 6, 2013 at 7:43 am
    Hi Marcel,

    Thanks, will give it a try. But I will miss out on stuffs like Group by
    Year, or Group by Month. Is there any way of partitioning where i can have
    this functionality too and also kick in pruning in the query?

    Regards,
    Suresh
    On Wednesday, 6 February 2013 13:03:54 UTC+5:30, Marcel Kornacker wrote:

    Hi Suresh, the reason that this query
    " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month * 100
    ) + (day) ) = 20130128)"
    isn't fast is that Impala doesn't attempt to do partition pruning
    based on that expression, ie, it scans the whole table.

    Since you want to express ranges in your query, you could encode your
    date/time as a string or a bigint (01/28/2013 00:00.0 turns into
    2013012800), like this:
    ... PARTITIONED BY (datetime BIGINT)

    so that you can express your first query as
    select count(*) from accesslogs_demographics where datetime >=
    2013012800 and datetime < 2013012900
    and your second query as
    select count(*) from accesslogs_demographics where datetime >= 2011020100

    Marcel
    On Tue, Feb 5, 2013 at 11:22 PM, Suresh K wrote:
    Hi,

    I have a table which i have partitioned as below


    ----------------------------------------------------------------------------------------------------------------------------------------------------
    CREATE EXTERNAL TABLE IF NOT EXISTS accesslogs_demographics
    (........)
    PARTITIONED BY (year INT,month INT,day INT,hour INT)
    row format delimited fields terminated by '^' LOCATION
    '/user/hive/warehouse/accesslogs_demographics';
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    I have populated it with around 100 million records.

    The problem is when i want to query for a date range.

    When i query for a single date as below, i get good performance,
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q " select
    count(*) from accesslogs_demographics where (year = 2013 and month = 01 and
    day = 28)"
    2832739

    real 0m14.082s
    user 0m0.087s
    sys 0m0.053s
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    But this kind of query i cannot use for range of dates like (year>2011 and
    month >02) etc, since i need month 01 from year 2012 which will be ignored
    if i use this query.

    I thought of doing something like this. But this consumes more time than
    previous query.

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month * 100
    ) + (day) ) = 20130128)"
    2832739

    real 3m0.713s
    user 0m0.155s
    sys 0m0.089s
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    My Question is, Is my partitioning correct way? and what is the best way to
    make impala query in such scenario.

    Thanks in Advance,

    Suresh
  • Marcel Kornacker at Feb 6, 2013 at 4:13 pm

    On Tue, Feb 5, 2013 at 11:43 PM, Suresh K wrote:
    Hi Marcel,

    Thanks, will give it a try. But I will miss out on stuffs like Group by
    Year, or Group by Month. Is there any way of partitioning where i can have
    this functionality too and also kick in pruning in the query?
    To extract the month, you'd write "(datetime DIV 10000) % 100", etc.,
    and then simply include that in your select list/group-by clause.
    Regards,
    Suresh

    On Wednesday, 6 February 2013 13:03:54 UTC+5:30, Marcel Kornacker wrote:

    Hi Suresh, the reason that this query
    " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month *
    100
    ) + (day) ) = 20130128)"
    isn't fast is that Impala doesn't attempt to do partition pruning
    based on that expression, ie, it scans the whole table.

    Since you want to express ranges in your query, you could encode your
    date/time as a string or a bigint (01/28/2013 00:00.0 turns into
    2013012800), like this:
    ... PARTITIONED BY (datetime BIGINT)

    so that you can express your first query as
    select count(*) from accesslogs_demographics where datetime >=
    2013012800 and datetime < 2013012900
    and your second query as
    select count(*) from accesslogs_demographics where datetime >= 2011020100

    Marcel
    On Tue, Feb 5, 2013 at 11:22 PM, Suresh K wrote:
    Hi,

    I have a table which i have partitioned as below



    ----------------------------------------------------------------------------------------------------------------------------------------------------
    CREATE EXTERNAL TABLE IF NOT EXISTS accesslogs_demographics
    (........)
    PARTITIONED BY (year INT,month INT,day INT,hour INT)
    row format delimited fields terminated by '^' LOCATION
    '/user/hive/warehouse/accesslogs_demographics';

    ----------------------------------------------------------------------------------------------------------------------------------------------------

    I have populated it with around 100 million records.

    The problem is when i want to query for a date range.

    When i query for a single date as below, i get good performance,

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q "
    select
    count(*) from accesslogs_demographics where (year = 2013 and month = 01
    and
    day = 28)"
    2832739

    real 0m14.082s
    user 0m0.087s
    sys 0m0.053s

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    But this kind of query i cannot use for range of dates like (year>2011
    and
    month >02) etc, since i need month 01 from year 2012 which will be
    ignored
    if i use this query.

    I thought of doing something like this. But this consumes more time than
    previous query.


    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q "
    select
    count(*) from accesslogs_demographics where (((year * 10000) + (month *
    100
    ) + (day) ) = 20130128)"
    2832739

    real 3m0.713s
    user 0m0.155s
    sys 0m0.089s

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    My Question is, Is my partitioning correct way? and what is the best way
    to
    make impala query in such scenario.

    Thanks in Advance,

    Suresh
  • Inder Pall at Feb 6, 2013 at 7:48 am
    Another request - do you mind sharing the average record size in picture
    here. Just want to get a sense to how much data you are doing count(*) for
    100m records in 14 seconds.

    Also it would be very helpful if you could share any comparisons of timings
    w.r.t M/R if you have tried.

    Thanks,
    - Inder

    On Wed, Feb 6, 2013 at 1:03 PM, Marcel Kornacker wrote:

    Hi Suresh, the reason that this query
    " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month * 100
    ) + (day) ) = 20130128)"
    isn't fast is that Impala doesn't attempt to do partition pruning
    based on that expression, ie, it scans the whole table.

    Since you want to express ranges in your query, you could encode your
    date/time as a string or a bigint (01/28/2013 00:00.0 turns into
    2013012800), like this:
    ... PARTITIONED BY (datetime BIGINT)

    so that you can express your first query as
    select count(*) from accesslogs_demographics where datetime >=
    2013012800 and datetime < 2013012900
    and your second query as
    select count(*) from accesslogs_demographics where datetime >= 2011020100

    Marcel
    On Tue, Feb 5, 2013 at 11:22 PM, Suresh K wrote:
    Hi,

    I have a table which i have partitioned as below


    ----------------------------------------------------------------------------------------------------------------------------------------------------
    CREATE EXTERNAL TABLE IF NOT EXISTS accesslogs_demographics
    (........)
    PARTITIONED BY (year INT,month INT,day INT,hour INT)
    row format delimited fields terminated by '^' LOCATION
    '/user/hive/warehouse/accesslogs_demographics';
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    I have populated it with around 100 million records.

    The problem is when i want to query for a date range.

    When i query for a single date as below, i get good performance,
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q " select
    count(*) from accesslogs_demographics where (year = 2013 and month = 01 and
    day = 28)"
    2832739

    real 0m14.082s
    user 0m0.087s
    sys 0m0.053s
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    But this kind of query i cannot use for range of dates like (year>2011 and
    month >02) etc, since i need month 01 from year 2012 which will be ignored
    if i use this query.

    I thought of doing something like this. But this consumes more time than
    previous query.

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month * 100
    ) + (day) ) = 20130128)"
    2832739

    real 3m0.713s
    user 0m0.155s
    sys 0m0.089s
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    My Question is, Is my partitioning correct way? and what is the best way to
    make impala query in such scenario.

    Thanks in Advance,

    Suresh


    --
    - Inder
    "You are average of the 5 people you spend the most time with"
  • Suresh K at Feb 6, 2013 at 8:38 am
    Hi Inder,

    14 seconds are for counting *2832739 *records only in that particular
    partition. Each record is around 300 Bytes. I haven't tried with M/R or
    Hive, will try that and share the timings.

    Regards,
    Suresh
    On Wednesday, 6 February 2013 13:18:35 UTC+5:30, Inder Pall wrote:

    Another request - do you mind sharing the average record size in picture
    here. Just want to get a sense to how much data you are doing count(*) for
    100m records in 14 seconds.

    Also it would be very helpful if you could share any comparisons of
    timings w.r.t M/R if you have tried.

    Thanks,
    - Inder


    On Wed, Feb 6, 2013 at 1:03 PM, Marcel Kornacker <mar...@cloudera.com<javascript:>
    wrote:
    Hi Suresh, the reason that this query
    " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month * 100
    ) + (day) ) = 20130128)"
    isn't fast is that Impala doesn't attempt to do partition pruning
    based on that expression, ie, it scans the whole table.

    Since you want to express ranges in your query, you could encode your
    date/time as a string or a bigint (01/28/2013 00:00.0 turns into
    2013012800), like this:
    ... PARTITIONED BY (datetime BIGINT)

    so that you can express your first query as
    select count(*) from accesslogs_demographics where datetime >=
    2013012800 and datetime < 2013012900
    and your second query as
    select count(*) from accesslogs_demographics where datetime >= 2011020100

    Marcel

    On Tue, Feb 5, 2013 at 11:22 PM, Suresh K <sureshk...@gmail.com<javascript:>>
    wrote:
    Hi,

    I have a table which i have partitioned as below


    ----------------------------------------------------------------------------------------------------------------------------------------------------
    CREATE EXTERNAL TABLE IF NOT EXISTS accesslogs_demographics
    (........)
    PARTITIONED BY (year INT,month INT,day INT,hour INT)
    row format delimited fields terminated by '^' LOCATION
    '/user/hive/warehouse/accesslogs_demographics';
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    I have populated it with around 100 million records.

    The problem is when i want to query for a date range.

    When i query for a single date as below, i get good performance,
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q " select
    count(*) from accesslogs_demographics where (year = 2013 and month = 01 and
    day = 28)"
    2832739

    real 0m14.082s
    user 0m0.087s
    sys 0m0.053s
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    But this kind of query i cannot use for range of dates like (year>2011 and
    month >02) etc, since i need month 01 from year 2012 which will be ignored
    if i use this query.

    I thought of doing something like this. But this consumes more time than
    previous query.

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    [root@cloudera-1 ~]# time impala-shell --impalad=localhost:21000 -q " select
    count(*) from accesslogs_demographics where (((year * 10000) + (month * 100
    ) + (day) ) = 20130128)"
    2832739

    real 3m0.713s
    user 0m0.155s
    sys 0m0.089s
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    My Question is, Is my partitioning correct way? and what is the best way to
    make impala query in such scenario.

    Thanks in Advance,

    Suresh


    --
    - Inder
    "You are average of the 5 people you spend the most time with"

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedFeb 6, '13 at 7:22a
activeFeb 6, '13 at 4:13p
posts8
users3
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase