Grokbase Groups Hive user May 2011
FAQ
(As an FYI I'm relatively new to Hive and have no previous SQL
experience, so have been struggling a bit with the Language manual which
seems to assume previous SQL experience)

Suppose I have a table, within which there is a column which contains
domain names (ie such as hadoop.apache.org). I want to perform a count
of all second-level domains, ie hadoop.apache.org and hive.apache.org
would count in the same bucket.

Now I could count things for a particular second-level domain like this:

SELECT
year, month, day, hour, COUNT(1) as count
FROM
domainlog
WHERE
year = 2011 AND
month = 05 AND
day = 15 AND
(
domain RLIKE ".*[.]apache[.]org"
)
GROUP BY
year, month, day, hour

however I'm not seeing a way to sum up all second-level domains rather
than a particular one. I basically want to group everything using a
regular expression along the lines of ".*[.][^.]*[.][^.]*" and then
output lines with a count for the common portion. Any pointers in the
correct direction would be welcome.

Thanks
- Adam

Search Discussions

  • Loren Siebert at May 23, 2011 at 8:12 pm
    Use regexp_extract() and group on that.
    http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#String_Functions

    For example, to get class C IP block from an IP address in a column called ‘host':

    regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc



    On May 23, 2011, at 1:03 PM, Adam Phelps wrote:

    (As an FYI I'm relatively new to Hive and have no previous SQL experience, so have been struggling a bit with the Language manual which seems to assume previous SQL experience)

    Suppose I have a table, within which there is a column which contains domain names (ie such as hadoop.apache.org). I want to perform a count of all second-level domains, ie hadoop.apache.org and hive.apache.org would count in the same bucket.

    Now I could count things for a particular second-level domain like this:

    SELECT
    year, month, day, hour, COUNT(1) as count
    FROM
    domainlog
    WHERE
    year = 2011 AND
    month = 05 AND
    day = 15 AND
    (
    domain RLIKE ".*[.]apache[.]org"
    )
    GROUP BY
    year, month, day, hour

    however I'm not seeing a way to sum up all second-level domains rather than a particular one. I basically want to group everything using a regular expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with a count for the common portion. Any pointers in the correct direction would be welcome.

    Thanks
    - Adam
  • Adam Phelps at May 24, 2011 at 1:07 am
    It does seem like that command is approximately what I need, but I can't
    seem to get its usage quite right for what I'm attempting to do.

    As an example, here's a try where I'm narrowing it down to only
    subdomains of facebook.com:

    SELECT
    regexp_extract(qname, '^((.*[.])|(.{0}))[^.]+\\.[^.]+\\.', 1),
    COUNT(1) as count
    FROM
    querylog
    WHERE
    year = 2011 AND
    month = 05 AND
    day = 15 AND
    hour = 00 AND
    qname RLIKE "^((.*[.])|(.{0}))facebook[.]com[.]$"
    GROUP BY
    regexp_extract(qname, '^((.*[.])|(.{0}))[^.]+\\.[^.]+\\.', 1)

    What I want to output in this case is something like
    facebook.com XXXXXX

    What I actually get goes like this:

    0-101.channel. 52
    0-105.channel. 50
    0-109.channel. 70
    0-11.channel. 96
    0-112.channel. 52
    0-116.channel. 53
    0-123.channel. 48

    Which is actually summing up every subdomain of facebook.com, rather
    than the second-level domain itself. I tried using different int values
    for the second argument of the regexp_extract() function but didn't get
    one that did what I need.

    Thanks
    - Adam

    On 5/23/11 1:11 PM, Loren Siebert wrote:
    Use regexp_extract() and group on that.
    http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#String_Functions

    For example, to get class C IP block from an IP address in a column called ‘host':

    regexp_extract(host, '(.*)\\.\\d{1,3}', 1) classc



    On May 23, 2011, at 1:03 PM, Adam Phelps wrote:

    (As an FYI I'm relatively new to Hive and have no previous SQL experience, so have been struggling a bit with the Language manual which seems to assume previous SQL experience)

    Suppose I have a table, within which there is a column which contains domain names (ie such as hadoop.apache.org). I want to perform a count of all second-level domains, ie hadoop.apache.org and hive.apache.org would count in the same bucket.

    Now I could count things for a particular second-level domain like this:

    SELECT
    year, month, day, hour, COUNT(1) as count
    FROM
    domainlog
    WHERE
    year = 2011 AND
    month = 05 AND
    day = 15 AND
    (
    domain RLIKE ".*[.]apache[.]org"
    )
    GROUP BY
    year, month, day, hour

    however I'm not seeing a way to sum up all second-level domains rather than a particular one. I basically want to group everything using a regular expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with a count for the common portion. Any pointers in the correct direction would be welcome.

    Thanks
    - Adam
  • Luke Forehand at May 23, 2011 at 8:21 pm
    Hello,

    Can someone please provide an example in Hive, how I can store a
    serialized object in a field? A field type of byte array or binary or
    blob is really what I was looking for, but if something slightly less
    trivial is involved some instruction would be much appreciated. This
    object is used in a custom UDF later on in the processing pipeline.

    -Luke
  • Steven Wong at May 23, 2011 at 11:59 pm
    Hive does not support the blob data type. An option is to store your binary data encoded as string (such as using base64) and define them in Hive as string.


    -----Original Message-----
    From: Luke Forehand
    Sent: Monday, May 23, 2011 1:21 PM
    To: user@hive.apache.org
    Subject: hive storing a byte array

    Hello,

    Can someone please provide an example in Hive, how I can store a
    serialized object in a field? A field type of byte array or binary or
    blob is really what I was looking for, but if something slightly less
    trivial is involved some instruction would be much appreciated. This
    object is used in a custom UDF later on in the processing pipeline.

    -Luke
  • Luke Forehand at May 24, 2011 at 2:31 pm
    Steven,

    Thanks for your reply! I have written it the way you mentioned, based on
    an earlier post in this mailing list. I'm concerned about having to
    encode/decode the string in base64, I'm wondering how much this will
    impact my job run time.

    I have also written a UDF that emits a byte array, stored in a field of
    type array<tinyint>. When reading this field, the ObjectInspector is a
    ListObjectInspector with primitiveJavaByte for the list elements. Reading
    this field in the UDF seems clunky because I have to iterate over the
    list, reading each byte into a byte array, before I can use it.

    Given both approaches, which one do you think has the least performance
    overhead?

    Thanks,
    Luke


    On 5/23/11 6:59 PM, "Steven Wong" wrote:

    Hive does not support the blob data type. An option is to store your
    binary data encoded as string (such as using base64) and define them in
    Hive as string.


    -----Original Message-----
    From: Luke Forehand
    Sent: Monday, May 23, 2011 1:21 PM
    To: user@hive.apache.org
    Subject: hive storing a byte array

    Hello,

    Can someone please provide an example in Hive, how I can store a
    serialized object in a field? A field type of byte array or binary or
    blob is really what I was looking for, but if something slightly less
    trivial is involved some instruction would be much appreciated. This
    object is used in a custom UDF later on in the processing pipeline.

    -Luke
  • Steven Wong at May 24, 2011 at 7:24 pm
    I claim no experience in storing blobs in Hive, but it sounds to me that using array/list will be quite inefficient, in terms of both size and run time.


    -----Original Message-----
    From: Luke Forehand
    Sent: Tuesday, May 24, 2011 7:31 AM
    To: user@hive.apache.org
    Subject: Re: hive storing a byte array

    Steven,

    Thanks for your reply! I have written it the way you mentioned, based on
    an earlier post in this mailing list. I'm concerned about having to
    encode/decode the string in base64, I'm wondering how much this will
    impact my job run time.

    I have also written a UDF that emits a byte array, stored in a field of
    type array<tinyint>. When reading this field, the ObjectInspector is a
    ListObjectInspector with primitiveJavaByte for the list elements. Reading
    this field in the UDF seems clunky because I have to iterate over the
    list, reading each byte into a byte array, before I can use it.

    Given both approaches, which one do you think has the least performance
    overhead?

    Thanks,
    Luke


    On 5/23/11 6:59 PM, "Steven Wong" wrote:

    Hive does not support the blob data type. An option is to store your
    binary data encoded as string (such as using base64) and define them in
    Hive as string.


    -----Original Message-----
    From: Luke Forehand
    Sent: Monday, May 23, 2011 1:21 PM
    To: user@hive.apache.org
    Subject: hive storing a byte array

    Hello,

    Can someone please provide an example in Hive, how I can store a
    serialized object in a field? A field type of byte array or binary or
    blob is really what I was looking for, but if something slightly less
    trivial is involved some instruction would be much appreciated. This
    object is used in a custom UDF later on in the processing pipeline.

    -Luke

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMay 23, '11 at 8:04p
activeMay 24, '11 at 7:24p
posts7
users4
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase