Grokbase Groups Pig user October 2010
FAQ
Hi everybody,

I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations
of log files with relatively long lines: 95 fields and growing, of which
I'll be using just 7. Just so I didn't have to declare all the fields in
the LOAD command, I tried to define the schema in my first
FOREACH...GENERATE, so the first lines of my script look like this:

input = LOAD '/tmp/test.log';
A = FILTER input BY SIZE(*) >= 95;
B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
(long)$23, (int)$2, (int)$3
AS publisher, associate, site, category,
story, hits, comments;

As you can guess by now, Pig complains while still parsing:

ERROR 1000: Error during parsing. Invalid alias: category in null

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
during parsing. Invalid alias: associate in null
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)

Am I overlooking anything? Should I give up and declare a 95-field
schema? Write a LOAD UDF? Or is there a simpler way to do what I want?

Thank you!
Marcos Rubinelli

Search Discussions

  • Renato Marroquín Mogrovejo at Oct 22, 2010 at 2:16 am
    Hi Marcos, just a quick question, have you check whether or not your data
    has all the fields in all the rows? Maybe you are dealing with sparse data,
    but due to the amount of data you are not noticing it.
    First, what does your data look like? My choice would be to first try with a
    subset of the whole data, and then write my own UDF to parse, and retrieve
    just the values I want.


    Renato M.

    2010/10/20 Marcos Medrado Rubinelli <marcosm@buscape-inc.com>
    Hi everybody,

    I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations of
    log files with relatively long lines: 95 fields and growing, of which I'll
    be using just 7. Just so I didn't have to declare all the fields in the LOAD
    command, I tried to define the schema in my first FOREACH...GENERATE, so the
    first lines of my script look like this:

    input = LOAD '/tmp/test.log';
    A = FILTER input BY SIZE(*) >= 95;
    B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
    (long)$23, (int)$2, (int)$3
    AS publisher, associate, site, category,
    story, hits, comments;

    As you can guess by now, Pig complains while still parsing:

    ERROR 1000: Error during parsing. Invalid alias: category in null

    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
    during parsing. Invalid alias: associate in null
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at
    org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)

    Am I overlooking anything? Should I give up and declare a 95-field schema?
    Write a LOAD UDF? Or is there a simpler way to do what I want?

    Thank you!
    Marcos Rubinelli
  • Bryce Poole at Oct 22, 2010 at 2:30 am

    I believe the format of the FOREACH statement should be:

    B = FOREACH A GENERATE (long)$94 AS publisher, (chararray)$93 AS associate , (long)$16 AS site, (long)$27 AS category,
    (long)$23 AS story, (int)$2 AS hits, (int)$3 AS comments;

    Hope that helps,
    Bryce
    On Oct 21, 2010, at 8:15 PM, Renato Marroquín Mogrovejo wrote:

    Hi Marcos, just a quick question, have you check whether or not your data
    has all the fields in all the rows? Maybe you are dealing with sparse data,
    but due to the amount of data you are not noticing it.
    First, what does your data look like? My choice would be to first try with a
    subset of the whole data, and then write my own UDF to parse, and retrieve
    just the values I want.


    Renato M.

    2010/10/20 Marcos Medrado Rubinelli <marcosm@buscape-inc.com>
    Hi everybody,

    I'm trying to use vanilla Pig 0.7.0 to generate monthly consolidations of
    log files with relatively long lines: 95 fields and growing, of which I'll
    be using just 7. Just so I didn't have to declare all the fields in the LOAD
    command, I tried to define the schema in my first FOREACH...GENERATE, so the
    first lines of my script look like this:

    input = LOAD '/tmp/test.log';
    A = FILTER input BY SIZE(*) >= 95;
    B = FOREACH A GENERATE (long)$94, (chararray)$93, (long)$16, (long)$27,
    (long)$23, (int)$2, (int)$3
    AS publisher, associate, site, category,
    story, hits, comments;

    As you can guess by now, Pig complains while still parsing:

    ERROR 1000: Error during parsing. Invalid alias: category in null

    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
    during parsing. Invalid alias: associate in null
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at
    org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:73)

    Am I overlooking anything? Should I give up and declare a 95-field schema?
    Write a LOAD UDF? Or is there a simpler way to do what I want?

    Thank you!
    Marcos Rubinelli

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 20, '10 at 2:28p
activeOct 22, '10 at 2:30a
posts3
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase