Grokbase Groups Hive user June 2011
FAQ
Hi,

I want to test the use of indexes in hive. For this I created anindex, I
launched a first query above, I changed the settings on Hive and ran my
query on my database table:

CREATE INDEX index ON TABLE table(key) as 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX index ON table REBUILD;
INSERT OVERWRITE DIRECTORY "/tmp/index-result2" SELECT `_bucketname` ,
`_offsets` FROM default__table_index__ x WHERE x.key=100;
SET hive.index.compact.file=/tmp/index_result2;
SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
SELECT key, col2, sum(col3) FROM tableWHERE key=100 group BY col2;



No error but I am not sure that the index is really used. Indeed I get the
same performance with and without the index.
When I look at the logs (tasktracker, datanode, job_config), I see no call to
the directory / tmp / index_result2.
How do I know if my index has been really used?

Thanks for your help,

--
Guillaume WEILL

Search Discussions

  • Martin Konicek at Jun 8, 2011 at 11:27 am
    Hi,

    I was testing indexes today as well and the index definitely got used.
    You should be able to see this when you run two separate queries:
    INSERT OVERWRITE DIRECTORY "/tmp/index-result2" ...
    SELECT ...

    The SELECT was faster for me than without the index. In your case the
    time might be spent in the GROUP BY and maybe you have little data so
    the times look the same.

    What is not so good is that index can't be partitioned on different
    columns than the table. E.g. I would like to partition the table on date
    and the index on region (I can't partition the table on both date&region
    bc there are thousands of regions and that would create huge directory
    structure in HDFS, which I read is not recommended).

    Martin
    On 08/06/2011 11:28, Guillaume WEILL wrote:
    Hi,

    I want to test the use of indexes in hive. For this I created anindex,
    I launched a first query above, I changed the settings on Hive and ran
    my query on my database table:

    CREATE INDEX index ON TABLE table(key) as 'COMPACT' WITH DEFERRED REBUILD;
    ALTER INDEX index ON table REBUILD;
    INSERT OVERWRITE DIRECTORY "/tmp/index-result2" SELECT `_bucketname` , `_offsets` FROM default__table_index__ x WHERE x.key=100;

    SET hive.index.compact.file=/tmp/index_result2;
    SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
    SELECT key, col2, sum(col3) FROM tableWHERE key=100 group BY col2;



    No error but I am not sure that the index is really used. Indeed I get
    the same performance with and without the index.
    When I look at the logs (tasktracker, datanode, job_config), I see no
    call to the directory / tmp / index_result2.
    How do I know if my index has been really used?

    Thanks for your help,

    --
    Guillaume WEILL
  • Guillaume WEILL at Jun 8, 2011 at 4:09 pm
    Thanks for your reply.

    Could you say me how much time have you saved with the index and the time
    used on a query without query? The amount of data of your table could be
    helpful too.
    This is to verify your point of view about the amount of data because I am
    really not impressed by performance of my index.
    Personally I work on 100 GB.

    Guillaume



    2011/6/8 Martin Konicek <martin.konicek@gmail.com>
    Hi,

    I was testing indexes today as well and the index definitely got used. You
    should be able to see this when you run two separate queries:

    INSERT OVERWRITE DIRECTORY "/tmp/index-result2" ...
    SELECT ...

    The SELECT was faster for me than without the index. In your case the time
    might be spent in the GROUP BY and maybe you have little data so the times
    look the same.

    What is not so good is that index can't be partitioned on different columns
    than the table. E.g. I would like to partition the table on date and the
    index on region (I can't partition the table on both date&region bc there
    are thousands of regions and that would create huge directory structure in
    HDFS, which I read is not recommended).

    Martin

    On 08/06/2011 11:28, Guillaume WEILL wrote:

    Hi,

    I want to test the use of indexes in hive. For this I created anindex, I
    launched a first query above, I changed the settings on Hive and ran my
    query on my database table:

    CREATE INDEX index ON TABLE table(key) as 'COMPACT' WITH DEFERRED REBUILD;
    ALTER INDEX index ON table REBUILD;
    INSERT OVERWRITE DIRECTORY "/tmp/index-result2" SELECT `_bucketname` ,
    `_offsets` FROM default__table_index__ x WHERE x.key=100;

    SET hive.index.compact.file=/tmp/index_result2;
    SET
    hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
    SELECT key, col2, sum(col3) FROM tableWHERE key=100 group BY col2;



    No error but I am not sure that the index is really used. Indeed I get the
    same performance with and without the index.
    When I look at the logs (tasktracker, datanode, job_config), I see no call
    to the directory / tmp / index_result2.
    How do I know if my index has been really used?

    Thanks for your help,

    --
    Guillaume WEILL
  • Martin Konicek at Jun 8, 2011 at 7:10 pm
    Oh I had very little data (a few MB) - I am just testing whether
    features work at all before I try it on bigger data (for example
    compression does not work for me
    http://wiki.apache.org/hadoop/Hive/CompressedStorage ; also not sure
    whether indexes work on compressed files).

    The index got used because the second query

    SELECT key, col2 FROM table WHERE key=100 -- no GROUP BY used

    was almost instant as opposed to when not using the index (then it did
    one full map scan). Not sure what the total speedup is but there SHOULD
    be speed up for some queries or otherwise indexes would be useless. Try
    comparing a query with just a WHERE condition without a GROUP BY.

    Martin
    On 08/06/2011 18:08, Guillaume WEILL wrote:
    Thanks for your reply.

    Could you say me how much time have you saved with the index and the
    time used on a query without query? The amount of data of your table
    could be helpful too.
    This is to verify your point of view about the amount of data because
    I am really not impressed by performance of my index.
    Personally I work on 100 GB.

    Guillaume



    2011/6/8 Martin Konicek <martin.konicek@gmail.com

    Hi,

    I was testing indexes today as well and the index definitely got
    used. You should be able to see this when you run two separate
    queries:

    INSERT OVERWRITE DIRECTORY "/tmp/index-result2" ...
    SELECT ...

    The SELECT was faster for me than without the index. In your case
    the time might be spent in the GROUP BY and maybe you have little
    data so the times look the same.

    What is not so good is that index can't be partitioned on
    different columns than the table. E.g. I would like to partition
    the table on date and the index on region (I can't partition the
    table on both date&region bc there are thousands of regions and
    that would create huge directory structure in HDFS, which I read
    is not recommended).

    Martin


    On 08/06/2011 11:28, Guillaume WEILL wrote:

    Hi,

    I want to test the use of indexes in hive. For this I created
    anindex, I launched a first query above, I changed the
    settings on Hive and ran my query on my database table:

    CREATE INDEX index ON TABLE table(key) as 'COMPACT' WITH
    DEFERRED REBUILD;
    ALTER INDEX index ON table REBUILD;
    INSERT OVERWRITE DIRECTORY "/tmp/index-result2" SELECT
    `_bucketname` , `_offsets` FROM default__table_index__ x
    WHERE x.key=100;

    SET hive.index.compact.file=/tmp/index_result2;
    SET
    hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
    SELECT key, col2, sum(col3) FROM tableWHERE key=100 group BY
    col2;



    No error but I am not sure that the index is really used.
    Indeed I get the same performance with and without the index.
    When I look at the logs (tasktracker, datanode, job_config), I
    see no call to the directory / tmp / index_result2.
    How do I know if my index has been really used?

    Thanks for your help,

    --
    Guillaume WEILL

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJun 8, '11 at 9:29a
activeJun 8, '11 at 7:10p
posts4
users2
websitehive.apache.org

2 users in discussion

Guillaume WEILL: 2 posts Martin Konicek: 2 posts

People

Translate

site design / logo © 2021 Grokbase