Grokbase Groups Pig user June 2011
FAQ
Hello,

I'm having an issue with regex in pig.

Specifically, I'm loading an apache access log and trying to break out the
bits from the query string:

logs = LOAD '$input' using logloader as (remoteHost:CHARARRAY,
hyphen:CHARARRAY, hyphen2:CHARARRAY, time:CHARARRAY, method:CHARARRAY,
uri:CHARARRAY, protocol:CHARARRAY, statusCode:CHARARRAY,
responseSize:CHARARRAY, treferer:CHARARRAY, agent:CHARARRAY);

full_logs = FOREACH logs GENERATE time, uri, FLATTEN(REGEX_EXTRACT(uri,
'id=[0-9]', 2));

The uri looks like:
/khello.html?ref=http%3A%2F%2Fwww.google.com
%2F&k=4165427574dfdb75e0a37a8c13ab757d4273a283&id=1234

However when I run this simple pig script, I get the uri but not the 'id'
parameter.

I then tried using "\d" instead of [0-9] - still won't work.

I tried both [0-9] and \d in php and I get 'id=1' and '1' so I'm not sure
what I'm doing wrong.

Thanks in advance.

Search Discussions

  • Jonathan Coveney at Jun 18, 2011 at 1:04 am
    regex extract doesn't need to be flattened. In this case, use:

    REGEX_EXTRACT(uri,'id=(\\d*)',0); --returns id=1234
    or
    REGEX_EXTRACT(uri,'id=(\\d*)',1); --returns 1234

    You were missing the *, which is why it only grabbed the 1.

    2011/6/17 Irooniam <irooniam@gmail.com>
    Hello,

    I'm having an issue with regex in pig.

    Specifically, I'm loading an apache access log and trying to break out the
    bits from the query string:

    logs = LOAD '$input' using logloader as (remoteHost:CHARARRAY,
    hyphen:CHARARRAY, hyphen2:CHARARRAY, time:CHARARRAY, method:CHARARRAY,
    uri:CHARARRAY, protocol:CHARARRAY, statusCode:CHARARRAY,
    responseSize:CHARARRAY, treferer:CHARARRAY, agent:CHARARRAY);

    full_logs = FOREACH logs GENERATE time, uri, FLATTEN(REGEX_EXTRACT(uri,
    'id=[0-9]', 2));

    The uri looks like:
    /khello.html?ref=http%3A%2F%2Fwww.google.com
    %2F&k=4165427574dfdb75e0a37a8c13ab757d4273a283&id=1234

    However when I run this simple pig script, I get the uri but not the 'id'
    parameter.

    I then tried using "\d" instead of [0-9] - still won't work.

    I tried both [0-9] and \d in php and I get 'id=1' and '1' so I'm not sure
    what I'm doing wrong.

    Thanks in advance.
  • Irooniam at Jun 18, 2011 at 1:14 am
    Awesome, works as advertised.

    Thanks for the help Jonathan.

    On Fri, Jun 17, 2011 at 6:04 PM, Jonathan Coveney wrote:

    regex extract doesn't need to be flattened. In this case, use:

    REGEX_EXTRACT(uri,'id=(\\d*)',0); --returns id=1234
    or
    REGEX_EXTRACT(uri,'id=(\\d*)',1); --returns 1234

    You were missing the *, which is why it only grabbed the 1.

    2011/6/17 Irooniam <irooniam@gmail.com>
    Hello,

    I'm having an issue with regex in pig.

    Specifically, I'm loading an apache access log and trying to break out the
    bits from the query string:

    logs = LOAD '$input' using logloader as (remoteHost:CHARARRAY,
    hyphen:CHARARRAY, hyphen2:CHARARRAY, time:CHARARRAY, method:CHARARRAY,
    uri:CHARARRAY, protocol:CHARARRAY, statusCode:CHARARRAY,
    responseSize:CHARARRAY, treferer:CHARARRAY, agent:CHARARRAY);

    full_logs = FOREACH logs GENERATE time, uri, FLATTEN(REGEX_EXTRACT(uri,
    'id=[0-9]', 2));

    The uri looks like:
    /khello.html?ref=http%3A%2F%2Fwww.google.com
    %2F&k=4165427574dfdb75e0a37a8c13ab757d4273a283&id=1234

    However when I run this simple pig script, I get the uri but not the 'id'
    parameter.

    I then tried using "\d" instead of [0-9] - still won't work.

    I tried both [0-9] and \d in php and I get 'id=1' and '1' so I'm not sure
    what I'm doing wrong.

    Thanks in advance.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 18, '11 at 12:44a
activeJun 18, '11 at 1:14a
posts3
users2
websitepig.apache.org

2 users in discussion

Irooniam: 2 posts Jonathan Coveney: 1 post

People

Translate

site design / logo © 2021 Grokbase