FAQ
[Hive]: Add ability to compute statistics on hive tables
--------------------------------------------------------

Key: HADOOP-4488
URL: https://issues.apache.org/jira/browse/HADOOP-4488
Project: Hadoop Core
Issue Type: New Feature
Components: contrib/hive
Reporter: Ashish Thusoo
Assignee: Ashish Thusoo


Add commands to collect partition and column level statistics in hive.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Ashish Thusoo (JIRA) at Oct 22, 2008 at 6:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641939#action_12641939 ]

    Ashish Thusoo commented on HADOOP-4488:
    ---------------------------------------

    Type of statistics:
    The following types of statistics can be collected on hive partitions ->

    For each partition of the table:
    1. Number of Rows
    2. Size of the partition
    3. Average size of a row
    4. Number of blocks

    For a column in the partition:
    1. Number of distinct values
    2. Number of null values
    3. minimum 3 values
    4. maximum 3 values
    5. Histogram: Frequency histogram or a height balanced histogram (the former has equi range bins while the later has the same height for all the bins)

    The column level statistics could also be calculated for distributions in an average block

    Language Elements:
    ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS - this computes the partition level statistics
    ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR ALL COLUMNS SIZE n - this computes the column level statistics for all columns with n being the number of bins in the historgram
    ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR COLUMNS SIZE m c1 SIZE n1, c2 SIZE n2, c3 - this computes the column level statistics for columns c1 (using n1 bins for the histogram), c2(using n2 bins) and c3 (using the defaut m bins)

    We can later extend these so that these commands can work on samples and be able to extrapolate the results to the entire data set. For that we could use the ESTIMATE STATISTICS SAMPLE n ROWS or ESTIMATE STATISTICS SAMPLE n%

    e.g.

    ANALYZE TABLE <t> PARTITION(<partitionspec>) ESTIMATE STATISTICS 10%

    More details on the actual implementation to follow...

    [Hive]: Add ability to compute statistics on hive tables
    --------------------------------------------------------

    Key: HADOOP-4488
    URL: https://issues.apache.org/jira/browse/HADOOP-4488
    Project: Hadoop Core
    Issue Type: New Feature
    Components: contrib/hive
    Reporter: Ashish Thusoo
    Assignee: Ashish Thusoo

    Add commands to collect partition and column level statistics in hive.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Prasad Chakka (JIRA) at Oct 22, 2008 at 7:12 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641947#action_12641947 ]

    Prasad Chakka commented on HADOOP-4488:
    ---------------------------------------

    some comments and questions

    1- For each partition (or table for non-partitioned tables), we should store number of files as well (so we can optimize on number of mappers)

    2- We should make the number of bins optional and use default. We might need some trial and error to figure out the optional number depending on number of distinct values/rowcount.

    3- how do you do distinct values for floats? by rounding them or not store at all?

    4- for string we could store stats for some prefix of the string?

    5- in histograms, we should store number distinct values as well in the bucket.

    6- can we store correlation between two columns? it would help figuring out selectivity more accurately.


    [Hive]: Add ability to compute statistics on hive tables
    --------------------------------------------------------

    Key: HADOOP-4488
    URL: https://issues.apache.org/jira/browse/HADOOP-4488
    Project: Hadoop Core
    Issue Type: New Feature
    Components: contrib/hive
    Reporter: Ashish Thusoo
    Assignee: Ashish Thusoo

    Add commands to collect partition and column level statistics in hive.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ashish Thusoo (JIRA) at Oct 23, 2008 at 8:31 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642266#action_12642266 ]

    Ashish Thusoo commented on HADOOP-4488:
    ---------------------------------------

    all good points... comments are as follows:

    for 1. yes we can store this relatively easily - will add it.
    for 2. the number of bins is optional and not mandatory. We can store the system default as we do for the other variables in hive-conf.xml
    for 3. I am just planning to store the distinct values - no rounding or not storing them at all. Don't want to overload the semantics of this. Not sure how useful rounding is given that
    for 4. there are a number of other useful stats about strings, clearly prefixes are useful for like 'xyz%' kind of operations. We can perhaps add these later considering that we do not even have the base level stats. We can discuss this more to see what makes sense for like and regex kind of predicates.
    for 5. possible... though if we have sufficient number of bins the utility of this stat decreases. But will evaluate this nonetheless.
    for 6. implementable though computationally prohibitive and it is not very clear as to how much benefit this would give - clearly if most of the columns are weekly correlated (independent) then this is not of much use and many times that is quite true. Again this is more advanced stuff. Probably better in a follow on after the base level stats are working...

    Will also add to this list the avg size per column that you were mentioning yesterday.

    So the new list is:

    Table stats:
    1. # rows
    2. size of partition
    3. Avg size of a row
    4. # blocks
    5. # files

    Columns stats:
    1. # distinct values
    2. # null values
    3. min 3 values
    4. max 3 values
    5. histogram: frequency and height balanced.
    6. avg size of column
    [Hive]: Add ability to compute statistics on hive tables
    --------------------------------------------------------

    Key: HADOOP-4488
    URL: https://issues.apache.org/jira/browse/HADOOP-4488
    Project: Hadoop Core
    Issue Type: New Feature
    Components: contrib/hive
    Reporter: Ashish Thusoo
    Assignee: Ashish Thusoo

    Add commands to collect partition and column level statistics in hive.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedOct 22, '08 at 6:44p
activeOct 23, '08 at 8:31p
posts4
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Ashish Thusoo (JIRA): 4 posts

People

Translate

site design / logo © 2022 Grokbase