Grokbase Groups Pig user April 2008
FAQ
I guess my last message was obvious/stupid since I am not getting any
responses, but hopefully I won't be 0/2.

I love using Pig and I think it's a fantastic tool for creating complex,
map-reduce programs quickly, but that said I am having 2 problems in
addition to the one below. Hopefully I am just missing something easy
and someone can shoot me a quick response.

I have written my own eval func that extracts events from our event log.
It then splits the event by some arbitrary regex and then finds the last
match from that event that does not match another regex. The queries are
as follows.

eventlog = LOAD
'/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage(' ');
filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
'1209625200000';
filterCh = FILTER filterDate BY $15 eq 'Sony' OR $15 eq 'Dell' OR $15
eq 'HP' ;
filter1 = FILTER filterCh BY ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
filtered = FOREACH filter1 GENERATE
LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
grouped = GROUP filtered BY ($0, $1);
resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
FLATTEN(COUNT(filtered)) PARALLEL 14;

The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)

This all works fine, but I would like to change my split regex to
\\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
I do that I get this :

Exception in thread "Thread-6"
org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
line 1, column 93. Encountered: "|" (124), after : "\'\\"

Is there some special escape sequence I should know about? I searched
escape in PigLatin Wiki and found nothing.

The second problem I have is I am not able to register jars/funcs
without packaging them into the pig.jar in the
org.apache.pig.impl.builtin package. I have tried everything I can think
of and everything in the documentation. I register the jar with
PigServer.registerJar and try to use the fully qualified function name
all the task trackers fail with:

java.lang.RuntimeException: could not instantiate
'telespree.analytics.pig.LastPageExtractor' with arguments '[]'

I do:

server.registerJar("c:\\telespree.jar");

and

filtered = FOREACH filter1 GENERATE
telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");

I even tried to put these functions in the default package in pig.jar
since I saw in the code you do lookups with
packageImportList.add("");
packageImportList.add("org.apache.pig.builtin.");
packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
packageImportList.add("org.apache.pig.impl.builtin.");

So I figured using the "" import would find my function, however alas I
get the same error :
java.lang.RuntimeException: could not instantiate 'LastPageExtractor'
with arguments '[]'

However if I package them in org.apache.pig.impl.builtin it all works
fine.

Any help on these 3 areas would be much appreciated!

-Michael




-----Original Message-----
From: Michael Harris
Sent: Wednesday, April 02, 2008 10:47 AM
To: pig-user@incubator.apache.org
Subject: MapReduceLauncher static fields

Hello,



I have written a pig application that does a fixed set of queries
on-demand through a web interface. I am trying to get the progress of
the queries from the PigServer, but I have noticed that the source of
the progress data is all static fields in the MapReduceLauncher. Clearly
my webapp must be able to handle multiple concurrent pig queries (and be
thread-safe) and I would like to report the progress of each individual
query (job set) to the end user. Do these static fields indicate that I
would get the progress of multiple concurrent queries initiated by
different PigServer instances? or would I get the overall progress of
the MapReduceLauncher for all queries currently being executed?



Thanks,
Michael

Search Discussions

  • Alan Gates at Apr 8, 2008 at 10:44 pm
    The issue with not being able to escape regular expressions looks like a
    bug, you should file a JIRA so that it gets addressed.

    On the not being able to instantiate your function when it's in another
    jar, we have not seen this in this situation. But we have not tested it
    extensively on windows either. Could you post your jar file (or one
    that reproduces it with a simple function if your function is complex)?

    Alan.


    Michael Harris wrote:
    I guess my last message was obvious/stupid since I am not getting any
    responses, but hopefully I won't be 0/2.

    I love using Pig and I think it's a fantastic tool for creating complex,
    map-reduce programs quickly, but that said I am having 2 problems in
    addition to the one below. Hopefully I am just missing something easy
    and someone can shoot me a quick response.

    I have written my own eval func that extracts events from our event log.
    It then splits the event by some arbitrary regex and then finds the last
    match from that event that does not match another regex. The queries are
    as follows.

    eventlog = LOAD
    '/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
    408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
    3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage(' ');
    filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
    '1209625200000';
    filterCh = FILTER filterDate BY $15 eq 'Sony' OR $15 eq 'Dell' OR $15
    eq 'HP' ;
    filter1 = FILTER filterCh BY ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
    filtered = FOREACH filter1 GENERATE
    LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
    :[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
    grouped = GROUP filtered BY ($0, $1);
    resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
    FLATTEN(COUNT(filtered)) PARALLEL 14;

    The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)

    This all works fine, but I would like to change my split regex to
    \\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
    I do that I get this :

    Exception in thread "Thread-6"
    org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
    line 1, column 93. Encountered: "|" (124), after : "\'\\"

    Is there some special escape sequence I should know about? I searched
    escape in PigLatin Wiki and found nothing.

    The second problem I have is I am not able to register jars/funcs
    without packaging them into the pig.jar in the
    org.apache.pig.impl.builtin package. I have tried everything I can think
    of and everything in the documentation. I register the jar with
    PigServer.registerJar and try to use the fully qualified function name
    all the task trackers fail with:

    java.lang.RuntimeException: could not instantiate
    'telespree.analytics.pig.LastPageExtractor' with arguments '[]'

    I do:

    server.registerJar("c:\\telespree.jar");

    and

    filtered = FOREACH filter1 GENERATE
    telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
    r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");

    I even tried to put these functions in the default package in pig.jar
    since I saw in the code you do lookups with
    packageImportList.add("");
    packageImportList.add("org.apache.pig.builtin.");
    packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
    packageImportList.add("org.apache.pig.impl.builtin.");

    So I figured using the "" import would find my function, however alas I
    get the same error :
    java.lang.RuntimeException: could not instantiate 'LastPageExtractor'
    with arguments '[]'

    However if I package them in org.apache.pig.impl.builtin it all works
    fine.

    Any help on these 3 areas would be much appreciated!

    -Michael




    -----Original Message-----
    From: Michael Harris
    Sent: Wednesday, April 02, 2008 10:47 AM
    To: pig-user@incubator.apache.org
    Subject: MapReduceLauncher static fields

    Hello,



    I have written a pig application that does a fixed set of queries
    on-demand through a web interface. I am trying to get the progress of
    the queries from the PigServer, but I have noticed that the source of
    the progress data is all static fields in the MapReduceLauncher. Clearly
    my webapp must be able to handle multiple concurrent pig queries (and be
    thread-safe) and I would like to report the progress of each individual
    query (job set) to the end user. Do these static fields indicate that I
    would get the progress of multiple concurrent queries initiated by
    different PigServer instances? or would I get the overall progress of
    the MapReduceLauncher for all queries currently being executed?



    Thanks,
    Michael
  • Mridul Muralidharan at Apr 10, 2008 at 9:27 am
    Hi Michael,

    Not sure about the character escaping, but I do have my UDF's in jars
    independent of pig jars - and that works fine for me. You might want to
    check for path issues ?

    Regards,
    Mridul

    Michael Harris wrote:
    I guess my last message was obvious/stupid since I am not getting any
    responses, but hopefully I won't be 0/2.

    I love using Pig and I think it's a fantastic tool for creating complex,
    map-reduce programs quickly, but that said I am having 2 problems in
    addition to the one below. Hopefully I am just missing something easy
    and someone can shoot me a quick response.

    I have written my own eval func that extracts events from our event log.
    It then splits the event by some arbitrary regex and then finds the last
    match from that event that does not match another regex. The queries are
    as follows.

    eventlog = LOAD
    '/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
    408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
    3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage(' ');
    filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
    '1209625200000';
    filterCh = FILTER filterDate BY $15 eq 'Sony' OR $15 eq 'Dell' OR $15
    eq 'HP' ;
    filter1 = FILTER filterCh BY ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
    filtered = FOREACH filter1 GENERATE
    LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
    :[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
    grouped = GROUP filtered BY ($0, $1);
    resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
    FLATTEN(COUNT(filtered)) PARALLEL 14;

    The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)

    This all works fine, but I would like to change my split regex to
    \\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
    I do that I get this :

    Exception in thread "Thread-6"
    org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
    line 1, column 93. Encountered: "|" (124), after : "\'\\"

    Is there some special escape sequence I should know about? I searched
    escape in PigLatin Wiki and found nothing.

    The second problem I have is I am not able to register jars/funcs
    without packaging them into the pig.jar in the
    org.apache.pig.impl.builtin package. I have tried everything I can think
    of and everything in the documentation. I register the jar with
    PigServer.registerJar and try to use the fully qualified function name
    all the task trackers fail with:

    java.lang.RuntimeException: could not instantiate
    'telespree.analytics.pig.LastPageExtractor' with arguments '[]'

    I do:

    server.registerJar("c:\\telespree.jar");

    and

    filtered = FOREACH filter1 GENERATE
    telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
    r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");

    I even tried to put these functions in the default package in pig.jar
    since I saw in the code you do lookups with
    packageImportList.add("");
    packageImportList.add("org.apache.pig.builtin.");
    packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
    packageImportList.add("org.apache.pig.impl.builtin.");

    So I figured using the "" import would find my function, however alas I
    get the same error :
    java.lang.RuntimeException: could not instantiate 'LastPageExtractor'
    with arguments '[]'

    However if I package them in org.apache.pig.impl.builtin it all works
    fine.

    Any help on these 3 areas would be much appreciated!

    -Michael




    -----Original Message-----
    From: Michael Harris
    Sent: Wednesday, April 02, 2008 10:47 AM
    To: pig-user@incubator.apache.org
    Subject: MapReduceLauncher static fields

    Hello,



    I have written a pig application that does a fixed set of queries
    on-demand through a web interface. I am trying to get the progress of
    the queries from the PigServer, but I have noticed that the source of
    the progress data is all static fields in the MapReduceLauncher. Clearly
    my webapp must be able to handle multiple concurrent pig queries (and be
    thread-safe) and I would like to report the progress of each individual
    query (job set) to the end user. Do these static fields indicate that I
    would get the progress of multiple concurrent queries initiated by
    different PigServer instances? or would I get the overall progress of
    the MapReduceLauncher for all queries currently being executed?



    Thanks,
    Michael
  • Mridul Muralidharan at Apr 10, 2008 at 9:39 am

    Mridul Muralidharan wrote:

    Hi Michael,

    Not sure about the character escaping, but I do have my UDF's in jars
    independent of pig jars - and that works fine for me. You might want to
    check for path issues ?
    And if there is an empty constructor (or no constructor) for the udf.
    iirc pig uses the null constructor to create the udf.

    Mridul
    Regards,
    Mridul

    Michael Harris wrote:
    I guess my last message was obvious/stupid since I am not getting any
    responses, but hopefully I won't be 0/2.

    I love using Pig and I think it's a fantastic tool for creating complex,
    map-reduce programs quickly, but that said I am having 2 problems in
    addition to the one below. Hopefully I am just missing something easy
    and someone can shoot me a quick response.

    I have written my own eval func that extracts events from our event log.
    It then splits the event by some arbitrary regex and then finds the last
    match from that event that does not match another regex. The queries are
    as follows.

    eventlog = LOAD
    '/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
    408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
    3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage(' ');
    filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
    '1209625200000';
    filterCh = FILTER filterDate BY $15 eq 'Sony' OR $15 eq 'Dell' OR $15
    eq 'HP' ;
    filter1 = FILTER filterCh BY ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
    filtered = FOREACH filter1 GENERATE
    LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
    :[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
    grouped = GROUP filtered BY ($0, $1);
    resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
    FLATTEN(COUNT(filtered)) PARALLEL 14;

    The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)

    This all works fine, but I would like to change my split regex to
    \\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
    I do that I get this :

    Exception in thread "Thread-6"
    org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
    line 1, column 93. Encountered: "|" (124), after : "\'\\"

    Is there some special escape sequence I should know about? I searched
    escape in PigLatin Wiki and found nothing.

    The second problem I have is I am not able to register jars/funcs
    without packaging them into the pig.jar in the
    org.apache.pig.impl.builtin package. I have tried everything I can think
    of and everything in the documentation. I register the jar with
    PigServer.registerJar and try to use the fully qualified function name
    all the task trackers fail with:

    java.lang.RuntimeException: could not instantiate
    'telespree.analytics.pig.LastPageExtractor' with arguments '[]'

    I do:

    server.registerJar("c:\\telespree.jar");

    and

    filtered = FOREACH filter1 GENERATE
    telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
    r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");

    I even tried to put these functions in the default package in pig.jar
    since I saw in the code you do lookups with
    packageImportList.add("");
    packageImportList.add("org.apache.pig.builtin.");
    packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
    packageImportList.add("org.apache.pig.impl.builtin.");
    So I figured using the "" import would find my function, however alas I
    get the same error : java.lang.RuntimeException: could not instantiate
    'LastPageExtractor'
    with arguments '[]'

    However if I package them in org.apache.pig.impl.builtin it all works
    fine.

    Any help on these 3 areas would be much appreciated!

    -Michael




    -----Original Message-----
    From: Michael Harris Sent: Wednesday,
    April 02, 2008 10:47 AM
    To: pig-user@incubator.apache.org
    Subject: MapReduceLauncher static fields

    Hello,



    I have written a pig application that does a fixed set of queries
    on-demand through a web interface. I am trying to get the progress of
    the queries from the PigServer, but I have noticed that the source of
    the progress data is all static fields in the MapReduceLauncher. Clearly
    my webapp must be able to handle multiple concurrent pig queries (and be
    thread-safe) and I would like to report the progress of each individual
    query (job set) to the end user. Do these static fields indicate that I
    would get the progress of multiple concurrent queries initiated by
    different PigServer instances? or would I get the overall progress of
    the MapReduceLauncher for all queries currently being executed?



    Thanks,
    Michael

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 4, '08 at 4:45p
activeApr 10, '08 at 9:39a
posts4
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase