Grokbase Groups Pig user April 2010
FAQ
Hello,

We have a file heirarchy we want to be accessable with MR/Hive/Pig. In this
way everyone can pick favorites :)

Currently the layout looks like this.

/user/root/data/datepartition1/subpartition2/{sequence file1, sequence
fileN)

I have just installed pig-0.6.0. I am trying to follow the advice here (
http://stackoverflow.com/questions/2423949/storing-data-to-sequencefile-from-apache-pig
)

REGISTER /opt/pig-0.6.0/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader
org.apache.pig.piggybank.storage.SequenceFileLoader();
raw = load 'datafile' USING SequenceFileLoader as (version:chararray,
id:int,date:chararray);

2010-04-20 12:10:46,821 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2999: Unexpected internal error.
org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
java.lang.Error

[root@rs01 piggybank]# more /root/pig_1271779744816.log
Pig Stack Trace
---------------
ERROR 2999: Unexpected internal error.
org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
java.lan
g.Error

java.lang.ClassCastException:
org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
java.lang.Error
at
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1440)
at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:738)
at
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:352)

So it seems like I have a bug, or have I done something wrong. looks like a
bug because if Pig can't cast the error correctly something is wrong.

Two questions:
1) Can I load all the files in a directory rather then operating on one
file?

raw = load '/datadir/*' USING SequenceFileLoader as (version:chararray,
id:int,date:chararray);
Rather then
raw = load '/datafile' USING SequenceFileLoader as (version:chararray,
id:int,date:chararray);

2) PigStorage seems to let me specify a tab delimeter. How does once specify
a tab delimeter with SequenceFileLoader? Or does one have to pass the entire
line to some other Pig Component to be tokenized.

Thank you,

Search Discussions

  • Dmitriy Ryaboy at Apr 20, 2010 at 5:11 pm
    Edward,
    The sequence file loader in piggybank is more a proof of concept than a real
    loader. It works, but only if your data happens to match exactly the format
    it expects -- namely, key-value pairs where both the key and the value are
    one of {Text, IntWritable, LongWritable, FloatWritable, DoubleWritable,
    BooleanWritable, ByteWritable}. You seem to be loading 3 columns, which
    doesn't match this format.

    I am not sure what you mean by a delimiter for a sequenceFile. SequenceFiles
    are binary, not character-delimited. Are you storing a string as a value,
    and trying to interpret said string? If that's the case, then as you
    suggested, the thing to do is to load that string as a value and use
    TOKENIZE or some other string parsing function to extract your fields. Or
    write a loader that knows that the user wants to apply further processing to
    values, and do that in the loader..

    As for your other question, Pig works with Hadoop globs, not just individual
    files (though some loaders may not support that -- PigStorage definitely
    does).

    -Dmitriy
    On Tue, Apr 20, 2010 at 9:36 AM, Edward Capriolo wrote:

    Hello,

    We have a file heirarchy we want to be accessable with MR/Hive/Pig. In this
    way everyone can pick favorites :)

    Currently the layout looks like this.

    /user/root/data/datepartition1/subpartition2/{sequence file1, sequence
    fileN)

    I have just installed pig-0.6.0. I am trying to follow the advice here (

    http://stackoverflow.com/questions/2423949/storing-data-to-sequencefile-from-apache-pig
    )

    REGISTER /opt/pig-0.6.0/contrib/piggybank/java/piggybank.jar;
    DEFINE SequenceFileLoader
    org.apache.pig.piggybank.storage.SequenceFileLoader();
    raw = load 'datafile' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);

    2010-04-20 12:10:46,821 [main] ERROR org.apache.pig.tools.grunt.Grunt -
    ERROR 2999: Unexpected internal error.
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lang.Error

    [root@rs01 piggybank]# more /root/pig_1271779744816.log
    Pig Stack Trace
    ---------------
    ERROR 2999: Unexpected internal error.
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lan
    g.Error

    java.lang.ClassCastException:
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lang.Error
    at

    org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1440)
    at

    org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
    at

    org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:738)
    at

    org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
    at
    org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
    at

    org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
    at

    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
    at

    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
    at org.apache.pig.Main.main(Main.java:352)

    So it seems like I have a bug, or have I done something wrong. looks like a
    bug because if Pig can't cast the error correctly something is wrong.

    Two questions:
    1) Can I load all the files in a directory rather then operating on one
    file?

    raw = load '/datadir/*' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);
    Rather then
    raw = load '/datafile' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);

    2) PigStorage seems to let me specify a tab delimeter. How does once
    specify
    a tab delimeter with SequenceFileLoader? Or does one have to pass the
    entire
    line to some other Pig Component to be tokenized.

    Thank you,
  • Edward Capriolo at Apr 20, 2010 at 6:14 pm

    On Tue, Apr 20, 2010 at 1:11 PM, Dmitriy Ryaboy wrote:

    Edward,
    The sequence file loader in piggybank is more a proof of concept than a
    real
    loader. It works, but only if your data happens to match exactly the format
    it expects -- namely, key-value pairs where both the key and the value are
    one of {Text, IntWritable, LongWritable, FloatWritable, DoubleWritable,
    BooleanWritable, ByteWritable}. You seem to be loading 3 columns, which
    doesn't match this format.

    I am not sure what you mean by a delimiter for a sequenceFile.
    SequenceFiles
    are binary, not character-delimited. Are you storing a string as a value,
    and trying to interpret said string? If that's the case, then as you
    suggested, the thing to do is to load that string as a value and use
    TOKENIZE or some other string parsing function to extract your fields. Or
    write a loader that knows that the user wants to apply further processing
    to
    values, and do that in the loader..

    As for your other question, Pig works with Hadoop globs, not just
    individual
    files (though some loaders may not support that -- PigStorage definitely
    does).

    -Dmitriy

    On Tue, Apr 20, 2010 at 9:36 AM, Edward Capriolo <edlinuxguru@gmail.com
    wrote:
    Hello,

    We have a file heirarchy we want to be accessable with MR/Hive/Pig. In this
    way everyone can pick favorites :)

    Currently the layout looks like this.

    /user/root/data/datepartition1/subpartition2/{sequence file1, sequence
    fileN)

    I have just installed pig-0.6.0. I am trying to follow the advice here (

    http://stackoverflow.com/questions/2423949/storing-data-to-sequencefile-from-apache-pig
    )

    REGISTER /opt/pig-0.6.0/contrib/piggybank/java/piggybank.jar;
    DEFINE SequenceFileLoader
    org.apache.pig.piggybank.storage.SequenceFileLoader();
    raw = load 'datafile' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);

    2010-04-20 12:10:46,821 [main] ERROR org.apache.pig.tools.grunt.Grunt -
    ERROR 2999: Unexpected internal error.
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lang.Error

    [root@rs01 piggybank]# more /root/pig_1271779744816.log
    Pig Stack Trace
    ---------------
    ERROR 2999: Unexpected internal error.
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lan
    g.Error

    java.lang.ClassCastException:
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lang.Error
    at

    org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1440)
    at

    org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
    at

    org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:738)
    at

    org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
    at
    org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
    at

    org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
    at

    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
    at

    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
    at org.apache.pig.Main.main(Main.java:352)

    So it seems like I have a bug, or have I done something wrong. looks like a
    bug because if Pig can't cast the error correctly something is wrong.

    Two questions:
    1) Can I load all the files in a directory rather then operating on one
    file?

    raw = load '/datadir/*' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);
    Rather then
    raw = load '/datafile' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);

    2) PigStorage seems to let me specify a tab delimeter. How does once
    specify
    a tab delimeter with SequenceFileLoader? Or does one have to pass the
    entire
    line to some other Pig Component to be tokenized.

    Thank you,
    Dimitry,

    Thank you for your help. My Columns are Text,Text and I have confirmed the
    pig loader is working fine on a single file. I was specifying multiple
    columns because it seems that PigStorage can split on tabs with an argument.
    I was hoping the SequenceFileLoader was the same.

    In summary, SequenceFileLoader does work on single files not globs, and it
    requires a further step to tokenize.


    Is a SequenceFileLoader that takes globs on the Roadmap? :)

    Regards,
    Edward
  • Dmitriy Ryaboy at Apr 20, 2010 at 6:36 pm
    Like I said, proof of concept. Patches accepted :-).
    It shouldn't be too hard to make it work with globs in 0.7 -- iirc, in 0.7
    the functionality that interprets the load string and figures out globs is
    factored out so that it can be reused by all loaders without needing to
    extend PigStorage.

    -D
    On Tue, Apr 20, 2010 at 11:14 AM, Edward Capriolo wrote:
    On Tue, Apr 20, 2010 at 1:11 PM, Dmitriy Ryaboy wrote:

    Edward,
    The sequence file loader in piggybank is more a proof of concept than a
    real
    loader. It works, but only if your data happens to match exactly the format
    it expects -- namely, key-value pairs where both the key and the value are
    one of {Text, IntWritable, LongWritable, FloatWritable, DoubleWritable,
    BooleanWritable, ByteWritable}. You seem to be loading 3 columns, which
    doesn't match this format.

    I am not sure what you mean by a delimiter for a sequenceFile.
    SequenceFiles
    are binary, not character-delimited. Are you storing a string as a value,
    and trying to interpret said string? If that's the case, then as you
    suggested, the thing to do is to load that string as a value and use
    TOKENIZE or some other string parsing function to extract your fields. Or
    write a loader that knows that the user wants to apply further processing
    to
    values, and do that in the loader..

    As for your other question, Pig works with Hadoop globs, not just
    individual
    files (though some loaders may not support that -- PigStorage definitely
    does).

    -Dmitriy

    On Tue, Apr 20, 2010 at 9:36 AM, Edward Capriolo <edlinuxguru@gmail.com
    wrote:
    Hello,

    We have a file heirarchy we want to be accessable with MR/Hive/Pig. In this
    way everyone can pick favorites :)

    Currently the layout looks like this.

    /user/root/data/datepartition1/subpartition2/{sequence file1, sequence
    fileN)

    I have just installed pig-0.6.0. I am trying to follow the advice here
    (
    http://stackoverflow.com/questions/2423949/storing-data-to-sequencefile-from-apache-pig
    )

    REGISTER /opt/pig-0.6.0/contrib/piggybank/java/piggybank.jar;
    DEFINE SequenceFileLoader
    org.apache.pig.piggybank.storage.SequenceFileLoader();
    raw = load 'datafile' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);

    2010-04-20 12:10:46,821 [main] ERROR org.apache.pig.tools.grunt.Grunt -
    ERROR 2999: Unexpected internal error.
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lang.Error

    [root@rs01 piggybank]# more /root/pig_1271779744816.log
    Pig Stack Trace
    ---------------
    ERROR 2999: Unexpected internal error.
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lan
    g.Error

    java.lang.ClassCastException:
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lang.Error
    at
    org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1440)
    at
    org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
    at
    org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:738)
    at
    org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
    at
    org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
    at
    org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
    at
    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
    at
    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
    at org.apache.pig.Main.main(Main.java:352)

    So it seems like I have a bug, or have I done something wrong. looks
    like
    a
    bug because if Pig can't cast the error correctly something is wrong.

    Two questions:
    1) Can I load all the files in a directory rather then operating on one
    file?

    raw = load '/datadir/*' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);
    Rather then
    raw = load '/datafile' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);

    2) PigStorage seems to let me specify a tab delimeter. How does once
    specify
    a tab delimeter with SequenceFileLoader? Or does one have to pass the
    entire
    line to some other Pig Component to be tokenized.

    Thank you,
    Dimitry,

    Thank you for your help. My Columns are Text,Text and I have confirmed the
    pig loader is working fine on a single file. I was specifying multiple
    columns because it seems that PigStorage can split on tabs with an
    argument.
    I was hoping the SequenceFileLoader was the same.

    In summary, SequenceFileLoader does work on single files not globs, and it
    requires a further step to tokenize.


    Is a SequenceFileLoader that takes globs on the Roadmap? :)

    Regards,
    Edward
  • Edward Capriolo at Apr 20, 2010 at 6:39 pm

    On Tue, Apr 20, 2010 at 2:35 PM, Dmitriy Ryaboy wrote:

    Like I said, proof of concept. Patches accepted :-).
    It shouldn't be too hard to make it work with globs in 0.7 -- iirc, in 0.7
    the functionality that interprets the load string and figures out globs is
    factored out so that it can be reused by all loaders without needing to
    extend PigStorage.

    -D

    On Tue, Apr 20, 2010 at 11:14 AM, Edward Capriolo <edlinuxguru@gmail.com
    wrote:
    On Tue, Apr 20, 2010 at 1:11 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Edward,
    The sequence file loader in piggybank is more a proof of concept than a
    real
    loader. It works, but only if your data happens to match exactly the format
    it expects -- namely, key-value pairs where both the key and the value are
    one of {Text, IntWritable, LongWritable, FloatWritable, DoubleWritable,
    BooleanWritable, ByteWritable}. You seem to be loading 3 columns,
    which
    doesn't match this format.

    I am not sure what you mean by a delimiter for a sequenceFile.
    SequenceFiles
    are binary, not character-delimited. Are you storing a string as a
    value,
    and trying to interpret said string? If that's the case, then as you
    suggested, the thing to do is to load that string as a value and use
    TOKENIZE or some other string parsing function to extract your fields.
    Or
    write a loader that knows that the user wants to apply further
    processing
    to
    values, and do that in the loader..

    As for your other question, Pig works with Hadoop globs, not just
    individual
    files (though some loaders may not support that -- PigStorage
    definitely
    does).

    -Dmitriy

    On Tue, Apr 20, 2010 at 9:36 AM, Edward Capriolo <
    edlinuxguru@gmail.com
    wrote:
    Hello,

    We have a file heirarchy we want to be accessable with MR/Hive/Pig.
    In
    this
    way everyone can pick favorites :)

    Currently the layout looks like this.

    /user/root/data/datepartition1/subpartition2/{sequence file1,
    sequence
    fileN)

    I have just installed pig-0.6.0. I am trying to follow the advice
    here
    (
    http://stackoverflow.com/questions/2423949/storing-data-to-sequencefile-from-apache-pig
    )

    REGISTER /opt/pig-0.6.0/contrib/piggybank/java/piggybank.jar;
    DEFINE SequenceFileLoader
    org.apache.pig.piggybank.storage.SequenceFileLoader();
    raw = load 'datafile' USING SequenceFileLoader as (version:chararray,
    id:int,date:chararray);

    2010-04-20 12:10:46,821 [main] ERROR org.apache.pig.tools.grunt.Grunt
    -
    ERROR 2999: Unexpected internal error.
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lang.Error

    [root@rs01 piggybank]# more /root/pig_1271779744816.log
    Pig Stack Trace
    ---------------
    ERROR 2999: Unexpected internal error.
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lan
    g.Error

    java.lang.ClassCastException:
    org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
    java.lang.Error
    at
    org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1440)
    at
    org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
    at
    org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:738)
    at
    org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
    at
    org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
    at
    org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
    at
    org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
    at
    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
    at
    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
    at org.apache.pig.Main.main(Main.java:352)

    So it seems like I have a bug, or have I done something wrong. looks
    like
    a
    bug because if Pig can't cast the error correctly something is wrong.

    Two questions:
    1) Can I load all the files in a directory rather then operating on
    one
    file?

    raw = load '/datadir/*' USING SequenceFileLoader as
    (version:chararray,
    id:int,date:chararray);
    Rather then
    raw = load '/datafile' USING SequenceFileLoader as
    (version:chararray,
    id:int,date:chararray);

    2) PigStorage seems to let me specify a tab delimeter. How does once
    specify
    a tab delimeter with SequenceFileLoader? Or does one have to pass the
    entire
    line to some other Pig Component to be tokenized.

    Thank you,
    Dimitry,

    Thank you for your help. My Columns are Text,Text and I have confirmed the
    pig loader is working fine on a single file. I was specifying multiple
    columns because it seems that PigStorage can split on tabs with an
    argument.
    I was hoping the SequenceFileLoader was the same.

    In summary, SequenceFileLoader does work on single files not globs, and it
    requires a further step to tokenize.


    Is a SequenceFileLoader that takes globs on the Roadmap? :)

    Regards,
    Edward
    Unrelated question. Does PIG have an IRC on freenode. #pig seems to be
    invite only.
  • Alan Gates at Apr 20, 2010 at 7:25 pm
    No. It might be useful though. AFAIK no one monitors #pig.

    Alan.
    Unrelated question. Does PIG have an IRC on freenode. #pig seems to be
    invite only.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 20, '10 at 4:37p
activeApr 20, '10 at 7:25p
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase