Grokbase Groups Hive user July 2009
FAQ
Hi,
I am a beginner at Hive's SQL so I am sorry if this question is answered
somewhere else. I tried to find the answer in Wiki, but no luck.

I have a dataset in which one of the columns is text. I need to count
number of records that match certain regex on that column. There are a
number of different regexes that I need to count records for and it
seems that there should be a way to do it in one pass through the data
with hive.

So how do I do it?

This doesn't seem to work (on Hive 0.3):
select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
LIKE "%%othertext%%") as CountB from sometable;


Thank you for your help in advance!


[unrelated: We've written a simple AWS S3 log format deserializer for
Hive. It needs some polishing - if anyone needs it or wants to polish it
up for inclusion, let me know]

--
Andraz Tori, CTO
Zemanta Ltd, New York, London, Ljubljana
www.zemanta.com
mail: andraz@zemanta.com
tel: +386 41 515 767
twitter: andraz, skype: minmax_test

Search Discussions

  • Saurabh Nanda at Jul 27, 2009 at 6:58 pm
    I think you can do that with regex replace and GROUP BY. Something
    like this- select replaced_col, count(1) from (select
    regex_replace(original_col, '.*(text1|text2|text3).*', '$1') as
    replaced_col from table)
    On 7/28/09, Andraz Tori wrote:
    Hi,
    I am a beginner at Hive's SQL so I am sorry if this question is answered
    somewhere else. I tried to find the answer in Wiki, but no luck.

    I have a dataset in which one of the columns is text. I need to count
    number of records that match certain regex on that column. There are a
    number of different regexes that I need to count records for and it
    seems that there should be a way to do it in one pass through the data
    with hive.

    So how do I do it?

    This doesn't seem to work (on Hive 0.3):
    select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
    LIKE "%%othertext%%") as CountB from sometable;


    Thank you for your help in advance!


    [unrelated: We've written a simple AWS S3 log format deserializer for
    Hive. It needs some polishing - if anyone needs it or wants to polish it
    up for inclusion, let me know]

    --
    Andraz Tori, CTO
    Zemanta Ltd, New York, London, Ljubljana
    www.zemanta.com
    mail: andraz@zemanta.com
    tel: +386 41 515 767
    twitter: andraz, skype: minmax_test



    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com
  • Saurabh Nanda at Jul 27, 2009 at 7:00 pm
    I think you can do that with regex replace and GROUP BY. Something
    like this- select replaced_col, count(1) from (select
    regex_replace(original_col, '.*(text1|text2|text3).*', '$1') as
    replaced_col from table) a group by a.replaced_col
    On 7/28/09, Andraz Tori wrote:
    Hi,
    I am a beginner at Hive's SQL so I am sorry if this question is answered
    somewhere else. I tried to find the answer in Wiki, but no luck.

    I have a dataset in which one of the columns is text. I need to count
    number of records that match certain regex on that column. There are a
    number of different regexes that I need to count records for and it
    seems that there should be a way to do it in one pass through the data
    with hive.

    So how do I do it?

    This doesn't seem to work (on Hive 0.3):
    select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
    LIKE "%%othertext%%") as CountB from sometable;


    Thank you for your help in advance!


    [unrelated: We've written a simple AWS S3 log format deserializer for
    Hive. It needs some polishing - if anyone needs it or wants to polish it
    up for inclusion, let me know]

    --
    Andraz Tori, CTO
    Zemanta Ltd, New York, London, Ljubljana
    www.zemanta.com
    mail: andraz@zemanta.com
    tel: +386 41 515 767
    twitter: andraz, skype: minmax_test



    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com
  • David Lerman at Jul 27, 2009 at 7:21 pm
    Saraubh, seems like this approach would fail if a particular line might
    match more than one regex, for example if you're looking for lines that
    match "dog" or "good" and one line contains "dogood".

    Andraz, have you tried something like:

    select sum(if(textcolumn like "...", 1, 0)) as CountA, sum(if(textcolumn
    like "...", 1, 0)) as CountB from sometable;
    On 7/27/09 3:00 PM, "Saurabh Nanda" wrote:

    I think you can do that with regex replace and GROUP BY. Something
    like this- select replaced_col, count(1) from (select
    regex_replace(original_col, '.*(text1|text2|text3).*', '$1') as
    replaced_col from table) a group by a.replaced_col
    On 7/28/09, Andraz Tori wrote:
    Hi,
    I am a beginner at Hive's SQL so I am sorry if this question is answered
    somewhere else. I tried to find the answer in Wiki, but no luck.

    I have a dataset in which one of the columns is text. I need to count
    number of records that match certain regex on that column. There are a
    number of different regexes that I need to count records for and it
    seems that there should be a way to do it in one pass through the data
    with hive.

    So how do I do it?

    This doesn't seem to work (on Hive 0.3):
    select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
    LIKE "%%othertext%%") as CountB from sometable;


    Thank you for your help in advance!


    [unrelated: We've written a simple AWS S3 log format deserializer for
    Hive. It needs some polishing - if anyone needs it or wants to polish it
    up for inclusion, let me know]

    --
    Andraz Tori, CTO
    Zemanta Ltd, New York, London, Ljubljana
    www.zemanta.com
    mail: andraz@zemanta.com
    tel: +386 41 515 767
    twitter: andraz, skype: minmax_test



    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com
  • Zheng Shao at Jul 27, 2009 at 8:45 pm
    Hi Andraz,

    I just opened a JIRA for AWS S3 log format.
    Can you attach a patch file to: https://issues.apache.org/jira/browse/HIVE-693 ?

    For your question, I think the approach suggested by David Lerman
    should work fine.

    Thanks,
    Zheng

    On Mon, Jul 27, 2009 at 11:35 AM, Andraz Toriwrote:
    Hi,
    I am a beginner at Hive's SQL so I am sorry if this question is answered
    somewhere else. I tried to find the answer in Wiki, but no luck.

    I have a dataset in which one of the columns is text. I need to count
    number of records that match certain regex on that column. There are a
    number of different regexes that I need to count records for and it
    seems that there should be a way to do it in one pass through the data
    with hive.

    So how do I do it?

    This doesn't seem to work (on Hive 0.3):
    select sum(textcolumn LIKE "%%sometext%%") as CountA, sum(textcolumn
    LIKE "%%othertext%%") as CountB from sometable;


    Thank you for your help in advance!


    [unrelated: We've written a simple AWS S3 log format deserializer for
    Hive. It needs some polishing - if anyone needs it or wants to polish it
    up for inclusion, let me know]

    --
    Andraz Tori, CTO
    Zemanta Ltd, New York, London, Ljubljana
    www.zemanta.com
    mail: andraz@zemanta.com
    tel: +386 41 515 767
    twitter: andraz, skype: minmax_test




    --
    Yours,
    Zheng
  • Andraz Tori at Jul 28, 2009 at 9:57 pm
    Thanks for the counting solution!

    Zhao, I've uploaded the S3 log parser to HIVE-693.

    Among other things, I also noticed a Hive bug today: when using hive in
    server mode (via python) to import 400 different partitions one after
    another, datanode started reporting "too many open files" errors in its
    logs. The analysis showed that Hive is not closing connections to
    datanode at all when doing loads like this:

    LOAD DATA LOCAL INPATH '/home/neith/xshairlogs//mapped-2009-04-24.gz'
    OVERWRITE INTO TABLE shairlogs PARTITION (pdate='2009-04-24')


    [don't know if that happens also in CLI mode - doing 300 commands like
    that is a bit too tedious to see if it can be reproduced :]


    bye
    andraz

    From Zheng Shao <zsh...@gmail.com> Subject Re: counting different
    regexes in a single pass Date Mon, 27 Jul 2009 20:46:24 GMT
    Hi Andraz,

    I just opened a JIRA for AWS S3 log format.
    Can you attach a patch file to: https://issues.apache.org/jira/browse/HIVE-693 ?

    For your question, I think the approach suggested by David Lerman
    should work fine.

    --
    Andraz Tori, CTO
    Zemanta Ltd, New York, London, Ljubljana
    www.zemanta.com
    mail: andraz@zemanta.com
    tel: +386 41 515 767
    twitter: andraz, skype: minmax_test

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJul 27, '09 at 6:34p
activeJul 28, '09 at 9:57p
posts6
users4
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase