FAQ
Hi,

I have a table which is partitioned per hour. I wanted to get a list of partitions which are less than a certain time limit. The issue that I am facing is: the result of the query is unpredictable between continuous run - sometimes it give expected result, but sometimes, the same query is giving list of all partitions without applying the date filter.

Table schema:

CREATE TABLE blah ( EVENT_TIME STRING, SEQUENCE_NUMBER string, MACHINE_NAME STRING, APP_INSTANCE_NAME STRING, LOG_TYPE STRING, MESSAGE STRING)
PARTITIONED BY (TIME_IN_HOUR STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES
("input.regex" = "(?m)(?s)^(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d{3}Z) (\\d+)? ?([^ ]+) ([^ ]+) ([^ ]+) (.*$)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s")
STORED AS TEXTFILE;

Query:

hive -S -e "add jar /usr/lib/hive/lib/hive-serde-0.5.0.jar; add jar /usr/lib/hive/lib/hive_contrib.jar; select distinct time_in_hour from blah where time_in_hour < \"2010_10_17T01\";"

Note: I noticed same unpredictable behavior when using hive cli.

Any pointers or thoughts are greatly appreciated.

Thanks,
Ankita



The information contained in this email message and its attachments is intended only for the private and confidential use of the recipient(s) named above, unless the sender expressly agrees otherwise. Transmission of email over the Internet is not a secure communications medium. If you are requesting or have requested the transmittal of personal data, as defined in applicable privacy laws by means of email or in an attachment to email, you must select a more secure alternate means of transmittal that supports your obligations to protect such personal data. If the reader of this message is not the intended recipient and/or you have received this email in error, you must take no action based on the information in this email and you are hereby notified that any dissemination, misuse or copying or disclosure of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by email and delete the original message.

Search Discussions

  • Bakshi, Ankita at Oct 20, 2010 at 5:51 pm
    Attached are two such mapred jobs which produced different results.
    Here jobid 58 produced list of all partitions without the date filter and jobid 62 resulted in expected output.

    Thanks,
    Ankita

    ________________________________
    From: Bakshi, Ankita
    Sent: Wednesday, October 20, 2010 10:33 AM
    To: '[email protected]'
    Subject: Unpredictable results when using partitions in where clause


    Hi,

    I have a table which is partitioned per hour. I wanted to get a list of partitions which are less than a certain time limit. The issue that I am facing is: the result of the query is unpredictable between continuous run - sometimes it give expected result, but sometimes, the same query is giving list of all partitions without applying the date filter.

    Table schema:

    CREATE TABLE blah ( EVENT_TIME STRING, SEQUENCE_NUMBER string, MACHINE_NAME STRING, APP_INSTANCE_NAME STRING, LOG_TYPE STRING, MESSAGE STRING)
    PARTITIONED BY (TIME_IN_HOUR STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES
    ("input.regex" = "(?m)(?s)^(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d{3}Z) (\\d+)? ?([^ ]+) ([^ ]+) ([^ ]+) (.*$)",
    "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s")
    STORED AS TEXTFILE;

    Query:

    hive -S -e "add jar /usr/lib/hive/lib/hive-serde-0.5.0.jar; add jar /usr/lib/hive/lib/hive_contrib.jar; select distinct time_in_hour from blah where time_in_hour < \"2010_10_17T01\";"

    Note: I noticed same unpredictable behavior when using hive cli.

    Any pointers or thoughts are greatly appreciated.

    Thanks,
    Ankita



    The information contained in this email message and its attachments is intended only for the private and confidential use of the recipient(s) named above, unless the sender expressly agrees otherwise. Transmission of email over the Internet is not a secure communications medium. If you are requesting or have requested the transmittal of personal data, as defined in applicable privacy laws by means of email or in an attachment to email, you must select a more secure alternate means of transmittal that supports your obligations to protect such personal data. If the reader of this message is not the intended recipient and/or you have received this email in error, you must take no action based on the information in this email and you are hereby notified that any dissemination, misuse or copying or disclosure of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by email and delete the original message.
  • Bakshi, Ankita at Oct 20, 2010 at 11:53 pm
    Pls. ignore this question. It was a bug on our side.

    ________________________________
    From: Bakshi, Ankita
    Sent: Wednesday, October 20, 2010 10:50 AM
    To: '[email protected]'
    Subject: RE: Unpredictable results when using partitions in where clause

    Attached are two such mapred jobs which produced different results.
    Here jobid 58 produced list of all partitions without the date filter and jobid 62 resulted in expected output.

    Thanks,
    Ankita

    ________________________________
    From: Bakshi, Ankita
    Sent: Wednesday, October 20, 2010 10:33 AM
    To: '[email protected]'
    Subject: Unpredictable results when using partitions in where clause


    Hi,

    I have a table which is partitioned per hour. I wanted to get a list of partitions which are less than a certain time limit. The issue that I am facing is: the result of the query is unpredictable between continuous run - sometimes it give expected result, but sometimes, the same query is giving list of all partitions without applying the date filter.

    Table schema:

    CREATE TABLE blah ( EVENT_TIME STRING, SEQUENCE_NUMBER string, MACHINE_NAME STRING, APP_INSTANCE_NAME STRING, LOG_TYPE STRING, MESSAGE STRING)
    PARTITIONED BY (TIME_IN_HOUR STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES
    ("input.regex" = "(?m)(?s)^(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d{3}Z) (\\d+)? ?([^ ]+) ([^ ]+) ([^ ]+) (.*$)",
    "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s")
    STORED AS TEXTFILE;

    Query:

    hive -S -e "add jar /usr/lib/hive/lib/hive-serde-0.5.0.jar; add jar /usr/lib/hive/lib/hive_contrib.jar; select distinct time_in_hour from blah where time_in_hour < \"2010_10_17T01\";"

    Note: I noticed same unpredictable behavior when using hive cli.

    Any pointers or thoughts are greatly appreciated.

    Thanks,
    Ankita



    The information contained in this email message and its attachments is intended only for the private and confidential use of the recipient(s) named above, unless the sender expressly agrees otherwise. Transmission of email over the Internet is not a secure communications medium. If you are requesting or have requested the transmittal of personal data, as defined in applicable privacy laws by means of email or in an attachment to email, you must select a more secure alternate means of transmittal that supports your obligations to protect such personal data. If the reader of this message is not the intended recipient and/or you have received this email in error, you must take no action based on the information in this email and you are hereby notified that any dissemination, misuse or copying or disclosure of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by email and delete the original message.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedOct 20, '10 at 5:34p
activeOct 20, '10 at 11:53p
posts3
users1
websitehive.apache.org

1 user in discussion

Bakshi, Ankita: 3 posts

People

Translate

site design / logo © 2023 Grokbase