FAQ
Hi folks,

I'm having a problem with a Pig job I wrote, it is throwing exceptions
in the map phase. I'm using the latest SVN of Pig, compiled against
the Hadoop15 jar included in SVN. My cluster is running Hadoop 0.15.1
on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
(movie,user,rating,date);
B = GROUP A BY movie;
C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
AVG(A.rating) as averagerating;
D = ORDER C BY averagerating;
STORE D INTO 'output/output.tsv';

A large number of jobs fail (but not all, some succeed) with the
following exception:

error: Error message from task (map) tip_200712051644_0002_m_000003
java.lang.RuntimeException: Unexpected data while reading tuple from
binary file
at org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:81)
at org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:41)
at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:89)
at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)
at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:216)
at org.apache.pig.impl.eval.FuncEvalSpec$1.add(FuncEvalSpec.java:105)
at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.(GenerateSpec.java:77)
at org.apache.pig.impl.mapreduceExec.PigCombine.reduce(PigCombine.java:101)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:364)
at org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309)
at org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)
at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242)
at org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)
at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:93)
at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)
at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
at org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)
at org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

As a comparison, the following job runs successfully:

A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
(movie,user,rating,date);
B = FILTER A BY movie == '8';
C = GROUP B BY movie;
D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
AVG(B.rating) as averagerating;
DUMP D;

Any help in tracking this down would be greatly appreciated. So far,
Pig is looking really slick and I'd love to write more advanced
programs with it.

Thanks,
Andrew Hitchcock

Search Discussions

  • Utkarsh Srivastava at Dec 6, 2007 at 1:33 am
    Alan, this is a problem with the combiner part (the problem of
    putting an indexed tuple directly into the bag, the first point in my
    comment about the combiner patch that was committed). Some of the
    mappers that spill their bags to disk, have a problem reading them
    back, because what was written out was an indexed tuple, while what
    is expected to be read is a regular Tuple.


    Utkarsh





    On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:

    Hi folks,

    I'm having a problem with a Pig job I wrote, it is throwing exceptions
    in the map phase. I'm using the latest SVN of Pig, compiled against
    the Hadoop15 jar included in SVN. My cluster is running Hadoop 0.15.1
    on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = GROUP A BY movie;
    C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
    AVG(A.rating) as averagerating;
    D = ORDER C BY averagerating;
    STORE D INTO 'output/output.tsv';

    A large number of jobs fail (but not all, some succeed) with the
    following exception:

    error: Error message from task (map) tip_200712051644_0002_m_000003
    java.lang.RuntimeException: Unexpected data while reading tuple from
    binary file
    at org.apache.pig.impl.io.DataBagFileReader$myIterator.next
    (DataBagFileReader.java:81)
    at org.apache.pig.impl.io.DataBagFileReader$myIterator.next
    (DataBagFileReader.java:41)
    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor
    (DataCollector.java:89)
    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add
    (SimpleEvalSpec.java:35)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec
    (GenerateSpec.java:273)
    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:216)
    at org.apache.pig.impl.eval.FuncEvalSpec$1.add(FuncEvalSpec.java:105)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.<init>
    (GenerateSpec.java:165)
    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:77)
    at org.apache.pig.impl.mapreduceExec.PigCombine.reduce
    (PigCombine.java:101)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill
    (MapTask.java:439)
    at org.apache.hadoop.mapred.MapTask
    $MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect
    (MapTask.java:364)
    at org.apache.pig.impl.mapreduceExec.PigMapReduce
    $MapDataOutputCollector.add(PigMapReduce.java:309)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add
    (GenerateSpec.java:242)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor
    (DataCollector.java:93)
    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add
    (SimpleEvalSpec.java:35)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec
    (GenerateSpec.java:273)
    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at org.apache.pig.impl.mapreduceExec.PigMapReduce.run
    (PigMapReduce.java:113)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
    at org.apache.hadoop.mapred.TaskTracker$Child.main
    (TaskTracker.java:1760)

    As a comparison, the following job runs successfully:

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = FILTER A BY movie == '8';
    C = GROUP B BY movie;
    D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
    AVG(B.rating) as averagerating;
    DUMP D;

    Any help in tracking this down would be greatly appreciated. So far,
    Pig is looking really slick and I'd love to write more advanced
    programs with it.

    Thanks,
    Andrew Hitchcock
  • Alan Gates at Dec 6, 2007 at 5:06 pm
    Utkarsh,

    I can submit a patch for this today. Do you know of a simple test case
    that reproduces the error?

    Alan.



    Utkarsh Srivastava wrote:
    Alan, this is a problem with the combiner part (the problem of putting
    an indexed tuple directly into the bag, the first point in my comment
    about the combiner patch that was committed). Some of the mappers that
    spill their bags to disk, have a problem reading them back, because
    what was written out was an indexed tuple, while what is expected to
    be read is a regular Tuple.


    Utkarsh





    On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:

    Hi folks,

    I'm having a problem with a Pig job I wrote, it is throwing exceptions
    in the map phase. I'm using the latest SVN of Pig, compiled against
    the Hadoop15 jar included in SVN. My cluster is running Hadoop 0.15.1
    on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = GROUP A BY movie;
    C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
    AVG(A.rating) as averagerating;
    D = ORDER C BY averagerating;
    STORE D INTO 'output/output.tsv';

    A large number of jobs fail (but not all, some succeed) with the
    following exception:

    error: Error message from task (map) tip_200712051644_0002_m_000003
    java.lang.RuntimeException: Unexpected data while reading tuple from
    binary file
    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:81)

    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:41)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:89)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:216)
    at
    org.apache.pig.impl.eval.FuncEvalSpec$1.add(FuncEvalSpec.java:105)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.<init>(GenerateSpec.java:165)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:77)
    at
    org.apache.pig.impl.mapreduceExec.PigCombine.reduce(PigCombine.java:101)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:364)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:93)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

    As a comparison, the following job runs successfully:

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = FILTER A BY movie == '8';
    C = GROUP B BY movie;
    D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
    AVG(B.rating) as averagerating;
    DUMP D;

    Any help in tracking this down would be greatly appreciated. So far,
    Pig is looking really slick and I'd love to write more advanced
    programs with it.

    Thanks,
    Andrew Hitchcock
  • Ted Dunning at Dec 6, 2007 at 5:20 pm
    Does anybody in the pig developer community have a reaction to Jaql yet?

    My impression is that they have done some very interesting work. Things I
    like:

    A) specific and direct access to map/reduce in a functional programming
    syntax.

    B) data has a concrete syntactic form that can be displayed and understood
    along with other concrete forms that guarantee to keep the same semantics in
    terms of tagged data elements. This universal tagging in the data makes a
    lot of run-time schema things pretty trivial. It also allows test data to
    be written into a script or example program and allows that test data to be
    processed to a concrete result without involving the cluster.

    C) they keep some of the best parts of pig like group and co-group.


    Things I don't like:

    1) Doesn't do map-reduce for all operations yet (presumably coming).

    2) Doesn't have a provision for displaying the map-reduce version of the
    program.

    3) Not open source.


    Does anybody else have any thoughts on this?
  • Utkarsh Srivastava at Dec 7, 2007 at 8:19 pm
    Jaql is very much in the same spirit as Pig, and in fact the language
    is quite similar. (They've chosen to sprinkle in some SQL-style
    declarative clauses, such as WHERE clauses attached to many of the
    operators, whereas in Pig we've explicitly avoided having operators
    do multiple different kinds of things.) You would do a WHERE clause
    in Pig by writing an explicit FILTER statement.

    Jaql is tied to JSON data, whereas Pig is data-format-agnostic. Pig
    can operate over JSON data as a special case. To demonstrate this, I
    put together a JSON StorageFunction for Pig, and examples of how it
    can be used (both attached). With this function, Pig can operate
    over JSON data in much the same way that Jaql does. (It requires the
    latest version of Pig; so if you want to try it please refresh from
    SVN first.)

    Some other observations:
    A) specific and direct access to map/reduce in a functional
    programming
    syntax.
    If a language has primitives for per-record processing, grouping, and
    group-wise aggregation, which both Pig and Jaql do, then direct
    access to map-reduce is just syntactic sugar on top of these primitives.

    In Pig, Map-Reduce is written as:

    A = foreach input generate flatten(Map(*));
    B = group A by $0;
    C = foreach B generate Reduce(*);

    Where "Map" and "Reduce" are user-supplied Pig functions.

    If people really want map-reduce as a programming abstraction, where
    the "group" operation is implicit, it would be easy to add this as a
    macro in Pig.

    B) data has a concrete syntactic form that can be displayed and
    understood
    along with other concrete forms that guarantee to keep the same
    semantics in
    terms of tagged data elements. This universal tagging in the data makes a
    lot of run-time schema things pretty trivial. It also allows test data to
    be written into a script or example program and allows that test
    data to be
    processed to a concrete result without involving the cluster.
    Pig's "maps" give very similar functionality:
    (1) the schema can vary from record to record (i.e., each record can
    have a different set of fields)
    (2) operations can reference the schema of a record at run-time, just
    like in Jaql.

    In fact, "map" structures are the bread-and-butter of JSON.


    Utkarsh
  • Ted Dunning at Dec 8, 2007 at 12:57 am
    Utkarsh,

    Thanks for your comments. I think I must have been a little unclear on some
    of my statements. See below for more.

    On 12/7/07 12:18 PM, "Utkarsh Srivastava" wrote:

    Jaql is tied to JSON data, whereas Pig is data-format-agnostic.
    I get the impression that Jaql is tied less to JSON than it appears at
    first. In particular, it looked to me like the on-disk format of data files
    could be more flexible. Certainly adding an abstraction layer for any
    record reader would be trivial. Similarly, there is nothing that says or
    requires that they actually pass around JSON encoded strings internally and
    there are several statements that imply that they actually pass around data
    structures whose only relationship to JSON is of data to a printable form.
    A) specific and direct access to map/reduce in a functional programming
    syntax.
    If a language has primitives for per-record processing, grouping, and
    group-wise aggregation, which both Pig and Jaql do, then direct
    access to map-reduce is just syntactic sugar on top of these primitives.
    Hmmm.... The key-word here is functional. Jaql is a higher-order functional
    language with lambda. And map-reduce is a function that operates on
    functions and data together. The only thing I might like better is a
    curried version of map-reduce as a function of two functions that returns a
    function that processes data (fast).

    Pig doesn't do anything like this and the difference appears to me to be
    much more than syntactic sugar. Having the functional representation gives
    you the guts of programmatic transformations essentially for free. This is
    important.

    I can't tell if Jaql things of data processing expressions as functional
    compositions, but if it does, very cool things can become doable.

    You are nearly right that in terms of expressive power, Jaql's explicit
    map-reduce is only sugar, but this is only true if you limit yourself to
    record processing primitives. If it is a full-scale first-class
    higher-order function, then it is a different beast altogether.
    In Pig, Map-Reduce is written as:

    A = foreach input generate flatten(Map(*));
    B = group A by $0;
    C = foreach B generate Reduce(*);
    And here is an important difference. The expression [foreach input generate
    flatten(Map(*))] CANNOT be expressed in Pig in functional form. There isn't
    something equivalent to [lambda(Map) return lambda(input) {foreach input
    generate flatten(Map(*))]. If that were available, then I would be able to
    write programs that manipulate program expressions in very interesting ways.

    Just as importantly, what you have provided is a recipe for computing, but
    not a function. Providing mapreduce as a function is important for
    supporting programmatic transmformations.
    If people really want map-reduce as a programming abstraction, where
    the "group" operation is implicit, it would be easy to add this as a
    macro in Pig.
    Indeed, but macros do not make a functional language.

    Pig's lazy evaluation semantics remind me quite a bit of functional
    programming. Why stop halfway?
  • Utkarsh Srivastava at Dec 6, 2007 at 6:44 pm
    There doesn't seem to be a simple test case to reproduce this,
    because the problem happens only when we spill to disk.

    Utkarsh
    On Dec 6, 2007, at 9:05 AM, Alan Gates wrote:

    Utkarsh,

    I can submit a patch for this today. Do you know of a simple test
    case that reproduces the error?

    Alan.



    Utkarsh Srivastava wrote:
    Alan, this is a problem with the combiner part (the problem of
    putting an indexed tuple directly into the bag, the first point in
    my comment about the combiner patch that was committed). Some of
    the mappers that spill their bags to disk, have a problem reading
    them back, because what was written out was an indexed tuple,
    while what is expected to be read is a regular Tuple.


    Utkarsh





    On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:

    Hi folks,

    I'm having a problem with a Pig job I wrote, it is throwing
    exceptions
    in the map phase. I'm using the latest SVN of Pig, compiled against
    the Hadoop15 jar included in SVN. My cluster is running Hadoop
    0.15.1
    on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = GROUP A BY movie;
    C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
    AVG(A.rating) as averagerating;
    D = ORDER C BY averagerating;
    STORE D INTO 'output/output.tsv';

    A large number of jobs fail (but not all, some succeed) with the
    following exception:

    error: Error message from task (map) tip_200712051644_0002_m_000003
    java.lang.RuntimeException: Unexpected data while reading tuple from
    binary file
    at org.apache.pig.impl.io.DataBagFileReader$myIterator.next
    (DataBagFileReader.java:81)
    at org.apache.pig.impl.io.DataBagFileReader$myIterator.next
    (DataBagFileReader.java:41)
    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor
    (DataCollector.java:89)
    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add
    (SimpleEvalSpec.java:35)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec
    (GenerateSpec.java:273)
    at org.apache.pig.impl.eval.GenerateSpec$1.add
    (GenerateSpec.java:86)
    at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:
    216)
    at org.apache.pig.impl.eval.FuncEvalSpec$1.add
    (FuncEvalSpec.java:105)
    at org.apache.pig.impl.eval.GenerateSpec
    $CrossProductItem.<init>(GenerateSpec.java:165)
    at org.apache.pig.impl.eval.GenerateSpec$1.add
    (GenerateSpec.java:77)
    at org.apache.pig.impl.mapreduceExec.PigCombine.reduce
    (PigCombine.java:101)
    at org.apache.hadoop.mapred.MapTask
    $MapOutputBuffer.combineAndSpill(MapTask.java:439)
    at org.apache.hadoop.mapred.MapTask
    $MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect
    (MapTask.java:364)
    at org.apache.pig.impl.mapreduceExec.PigMapReduce
    $MapDataOutputCollector.add(PigMapReduce.java:309)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add
    (GenerateSpec.java:242)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor
    (DataCollector.java:93)
    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add
    (SimpleEvalSpec.java:35)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec
    (GenerateSpec.java:273)
    at org.apache.pig.impl.eval.GenerateSpec$1.add
    (GenerateSpec.java:86)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at org.apache.pig.impl.mapreduceExec.PigMapReduce.run
    (PigMapReduce.java:113)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
    at org.apache.hadoop.mapred.TaskTracker$Child.main
    (TaskTracker.java:1760)

    As a comparison, the following job runs successfully:

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = FILTER A BY movie == '8';
    C = GROUP B BY movie;
    D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
    AVG(B.rating) as averagerating;
    DUMP D;

    Any help in tracking this down would be greatly appreciated. So far,
    Pig is looking really slick and I'd love to write more advanced
    programs with it.

    Thanks,
    Andrew Hitchcock
  • Benjamin Reed at Dec 6, 2007 at 8:50 pm
    The simple test case would be to add to add more than max memory of records to
    a big BigDataBag after calling distinct. Right?

    ben
    On Thursday 06 December 2007 10:44:13 Utkarsh Srivastava wrote:
    There doesn't seem to be a simple test case to reproduce this,
    because the problem happens only when we spill to disk.

    Utkarsh
    On Dec 6, 2007, at 9:05 AM, Alan Gates wrote:
    Utkarsh,

    I can submit a patch for this today. Do you know of a simple test
    case that reproduces the error?

    Alan.

    Utkarsh Srivastava wrote:
    Alan, this is a problem with the combiner part (the problem of
    putting an indexed tuple directly into the bag, the first point in
    my comment about the combiner patch that was committed). Some of
    the mappers that spill their bags to disk, have a problem reading
    them back, because what was written out was an indexed tuple,
    while what is expected to be read is a regular Tuple.


    Utkarsh
    On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:
    Hi folks,

    I'm having a problem with a Pig job I wrote, it is throwing
    exceptions
    in the map phase. I'm using the latest SVN of Pig, compiled against
    the Hadoop15 jar included in SVN. My cluster is running Hadoop
    0.15.1
    on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = GROUP A BY movie;
    C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
    AVG(A.rating) as averagerating;
    D = ORDER C BY averagerating;
    STORE D INTO 'output/output.tsv';

    A large number of jobs fail (but not all, some succeed) with the
    following exception:

    error: Error message from task (map) tip_200712051644_0002_m_000003
    java.lang.RuntimeException: Unexpected data while reading tuple from
    binary file
    at org.apache.pig.impl.io.DataBagFileReader$myIterator.next
    (DataBagFileReader.java:81)
    at org.apache.pig.impl.io.DataBagFileReader$myIterator.next
    (DataBagFileReader.java:41)
    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor
    (DataCollector.java:89)
    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add
    (SimpleEvalSpec.java:35)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec
    (GenerateSpec.java:273)
    at org.apache.pig.impl.eval.GenerateSpec$1.add
    (GenerateSpec.java:86)
    at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:
    216)
    at org.apache.pig.impl.eval.FuncEvalSpec$1.add
    (FuncEvalSpec.java:105)
    at org.apache.pig.impl.eval.GenerateSpec
    $CrossProductItem.<init>(GenerateSpec.java:165)
    at org.apache.pig.impl.eval.GenerateSpec$1.add
    (GenerateSpec.java:77)
    at org.apache.pig.impl.mapreduceExec.PigCombine.reduce
    (PigCombine.java:101)
    at org.apache.hadoop.mapred.MapTask
    $MapOutputBuffer.combineAndSpill(MapTask.java:439)
    at org.apache.hadoop.mapred.MapTask
    $MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect
    (MapTask.java:364)
    at org.apache.pig.impl.mapreduceExec.PigMapReduce
    $MapDataOutputCollector.add(PigMapReduce.java:309)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add
    (GenerateSpec.java:242)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor
    (DataCollector.java:93)
    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add
    (SimpleEvalSpec.java:35)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec
    (GenerateSpec.java:273)
    at org.apache.pig.impl.eval.GenerateSpec$1.add
    (GenerateSpec.java:86)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add
    (UnflattenCollector.java:56)
    at org.apache.pig.impl.mapreduceExec.PigMapReduce.run
    (PigMapReduce.java:113)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
    at org.apache.hadoop.mapred.TaskTracker$Child.main
    (TaskTracker.java:1760)

    As a comparison, the following job runs successfully:

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = FILTER A BY movie == '8';
    C = GROUP B BY movie;
    D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
    AVG(B.rating) as averagerating;
    DUMP D;

    Any help in tracking this down would be greatly appreciated. So far,
    Pig is looking really slick and I'd love to write more advanced
    programs with it.

    Thanks,
    Andrew Hitchcock
  • Alan Gates at Dec 6, 2007 at 11:26 pm
    Andrew,

    I've uploaded a patch that I think will fix your issue. You can find it
    here:
    https://issues.apache.org/jira/secure/attachment/12371190/pig7.patch If
    you get a chance, could you test and see if this resolves your issue?

    Alan.

    Utkarsh Srivastava wrote:
    Alan, this is a problem with the combiner part (the problem of putting
    an indexed tuple directly into the bag, the first point in my comment
    about the combiner patch that was committed). Some of the mappers that
    spill their bags to disk, have a problem reading them back, because
    what was written out was an indexed tuple, while what is expected to
    be read is a regular Tuple.


    Utkarsh





    On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:

    Hi folks,

    I'm having a problem with a Pig job I wrote, it is throwing exceptions
    in the map phase. I'm using the latest SVN of Pig, compiled against
    the Hadoop15 jar included in SVN. My cluster is running Hadoop 0.15.1
    on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = GROUP A BY movie;
    C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
    AVG(A.rating) as averagerating;
    D = ORDER C BY averagerating;
    STORE D INTO 'output/output.tsv';

    A large number of jobs fail (but not all, some succeed) with the
    following exception:

    error: Error message from task (map) tip_200712051644_0002_m_000003
    java.lang.RuntimeException: Unexpected data while reading tuple from
    binary file
    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:81)

    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:41)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:89)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:216)
    at
    org.apache.pig.impl.eval.FuncEvalSpec$1.add(FuncEvalSpec.java:105)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.<init>(GenerateSpec.java:165)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:77)
    at
    org.apache.pig.impl.mapreduceExec.PigCombine.reduce(PigCombine.java:101)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:364)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:93)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

    As a comparison, the following job runs successfully:

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = FILTER A BY movie == '8';
    C = GROUP B BY movie;
    D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
    AVG(B.rating) as averagerating;
    DUMP D;

    Any help in tracking this down would be greatly appreciated. So far,
    Pig is looking really slick and I'd love to write more advanced
    programs with it.

    Thanks,
    Andrew Hitchcock
  • Andrew Hitchcock at Dec 7, 2007 at 12:41 am
    The job gets past the point where it failed before, but it still died.
    The error was an IOException, so I think it is a problem with my
    cluster. I'm running the job again and I'll report back.

    Thanks very much for the fast response. We are very grateful.
    Andrew
    On Dec 6, 2007 3:23 PM, Alan Gates wrote:
    Andrew,

    I've uploaded a patch that I think will fix your issue. You can find it
    here:
    https://issues.apache.org/jira/secure/attachment/12371190/pig7.patch If
    you get a chance, could you test and see if this resolves your issue?

    Alan.


    Utkarsh Srivastava wrote:
    Alan, this is a problem with the combiner part (the problem of putting
    an indexed tuple directly into the bag, the first point in my comment
    about the combiner patch that was committed). Some of the mappers that
    spill their bags to disk, have a problem reading them back, because
    what was written out was an indexed tuple, while what is expected to
    be read is a regular Tuple.


    Utkarsh





    On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:

    Hi folks,

    I'm having a problem with a Pig job I wrote, it is throwing exceptions
    in the map phase. I'm using the latest SVN of Pig, compiled against
    the Hadoop15 jar included in SVN. My cluster is running Hadoop 0.15.1
    on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = GROUP A BY movie;
    C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
    AVG(A.rating) as averagerating;
    D = ORDER C BY averagerating;
    STORE D INTO 'output/output.tsv';

    A large number of jobs fail (but not all, some succeed) with the
    following exception:

    error: Error message from task (map) tip_200712051644_0002_m_000003
    java.lang.RuntimeException: Unexpected data while reading tuple from
    binary file
    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:81)

    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:41)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:89)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:216)
    at
    org.apache.pig.impl.eval.FuncEvalSpec$1.add(FuncEvalSpec.java:105)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.<init>(GenerateSpec.java:165)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:77)
    at
    org.apache.pig.impl.mapreduceExec.PigCombine.reduce(PigCombine.java:101)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:364)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:93)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

    As a comparison, the following job runs successfully:

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = FILTER A BY movie == '8';
    C = GROUP B BY movie;
    D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
    AVG(B.rating) as averagerating;
    DUMP D;

    Any help in tracking this down would be greatly appreciated. So far,
    Pig is looking really slick and I'd love to write more advanced
    programs with it.

    Thanks,
    Andrew Hitchcock
  • Alan Gates at Dec 7, 2007 at 12:56 am
    I finally managed to reproduce your error on my end, and then tested it
    against my fix, which did resolve the issue. I'll be checking in the
    fix shortly.

    Alan.

    Andrew Hitchcock wrote:
    The job gets past the point where it failed before, but it still died.
    The error was an IOException, so I think it is a problem with my
    cluster. I'm running the job again and I'll report back.

    Thanks very much for the fast response. We are very grateful.
    Andrew
    On Dec 6, 2007 3:23 PM, Alan Gates wrote:

    Andrew,

    I've uploaded a patch that I think will fix your issue. You can find it
    here:
    https://issues.apache.org/jira/secure/attachment/12371190/pig7.patch If
    you get a chance, could you test and see if this resolves your issue?

    Alan.


    Utkarsh Srivastava wrote:
    Alan, this is a problem with the combiner part (the problem of putting
    an indexed tuple directly into the bag, the first point in my comment
    about the combiner patch that was committed). Some of the mappers that
    spill their bags to disk, have a problem reading them back, because
    what was written out was an indexed tuple, while what is expected to
    be read is a regular Tuple.


    Utkarsh






    On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:

    Hi folks,

    I'm having a problem with a Pig job I wrote, it is throwing exceptions
    in the map phase. I'm using the latest SVN of Pig, compiled against
    the Hadoop15 jar included in SVN. My cluster is running Hadoop 0.15.1
    on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = GROUP A BY movie;
    C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
    AVG(A.rating) as averagerating;
    D = ORDER C BY averagerating;
    STORE D INTO 'output/output.tsv';

    A large number of jobs fail (but not all, some succeed) with the
    following exception:

    error: Error message from task (map) tip_200712051644_0002_m_000003
    java.lang.RuntimeException: Unexpected data while reading tuple from
    binary file
    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:81)

    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:41)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:89)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:216)
    at
    org.apache.pig.impl.eval.FuncEvalSpec$1.add(FuncEvalSpec.java:105)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.<init>(GenerateSpec.java:165)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:77)
    at
    org.apache.pig.impl.mapreduceExec.PigCombine.reduce(PigCombine.java:101)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:364)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:93)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

    As a comparison, the following job runs successfully:

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = FILTER A BY movie == '8';
    C = GROUP B BY movie;
    D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
    AVG(B.rating) as averagerating;
    DUMP D;

    Any help in tracking this down would be greatly appreciated. So far,
    Pig is looking really slick and I'd love to write more advanced
    programs with it.

    Thanks,
    Andrew Hitchcock
  • Andrew Hitchcock at Dec 7, 2007 at 1:03 am
    I ran the job again and it succeeded without problems. Thanks again!
    On Dec 6, 2007 4:54 PM, Alan Gates wrote:
    I finally managed to reproduce your error on my end, and then tested it
    against my fix, which did resolve the issue. I'll be checking in the
    fix shortly.

    Alan.


    Andrew Hitchcock wrote:
    The job gets past the point where it failed before, but it still died.
    The error was an IOException, so I think it is a problem with my
    cluster. I'm running the job again and I'll report back.

    Thanks very much for the fast response. We are very grateful.
    Andrew
    On Dec 6, 2007 3:23 PM, Alan Gates wrote:

    Andrew,

    I've uploaded a patch that I think will fix your issue. You can find it
    here:
    https://issues.apache.org/jira/secure/attachment/12371190/pig7.patch If
    you get a chance, could you test and see if this resolves your issue?

    Alan.


    Utkarsh Srivastava wrote:
    Alan, this is a problem with the combiner part (the problem of putting
    an indexed tuple directly into the bag, the first point in my comment
    about the combiner patch that was committed). Some of the mappers that
    spill their bags to disk, have a problem reading them back, because
    what was written out was an indexed tuple, while what is expected to
    be read is a regular Tuple.


    Utkarsh






    On Dec 5, 2007, at 3:50 PM, Andrew Hitchcock wrote:

    Hi folks,

    I'm having a problem with a Pig job I wrote, it is throwing exceptions
    in the map phase. I'm using the latest SVN of Pig, compiled against
    the Hadoop15 jar included in SVN. My cluster is running Hadoop 0.15.1
    on Java 1.6.0_03. Here's the pig job (which I ran through grunt):

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = GROUP A BY movie;
    C = FOREACH B GENERATE group, COUNT(A.user) as ratingcount,
    AVG(A.rating) as averagerating;
    D = ORDER C BY averagerating;
    STORE D INTO 'output/output.tsv';

    A large number of jobs fail (but not all, some succeed) with the
    following exception:

    error: Error message from task (map) tip_200712051644_0002_m_000003
    java.lang.RuntimeException: Unexpected data while reading tuple from
    binary file
    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:81)

    at
    org.apache.pig.impl.io.DataBagFileReader$myIterator.next(DataBagFileReader.java:41)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:89)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:216)
    at
    org.apache.pig.impl.eval.FuncEvalSpec$1.add(FuncEvalSpec.java:105)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.<init>(GenerateSpec.java:165)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:77)
    at
    org.apache.pig.impl.mapreduceExec.PigCombine.reduce(PigCombine.java:101)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)

    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:364)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce$MapDataOutputCollector.add(PigMapReduce.java:309)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(GenerateSpec.java:242)

    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(DataCollector.java:93)

    at
    org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at
    org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(GenerateSpec.java:273)

    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at
    org.apache.pig.impl.eval.collector.UnflattenCollector.add(UnflattenCollector.java:56)

    at
    org.apache.pig.impl.mapreduceExec.PigMapReduce.run(PigMapReduce.java:113)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

    As a comparison, the following job runs successfully:

    A = LOAD 'netflix/netflix.csv' USING PigStorage(',') AS
    (movie,user,rating,date);
    B = FILTER A BY movie == '8';
    C = GROUP B BY movie;
    D = FOREACH C GENERATE group, COUNT(B.user) as ratingcount,
    AVG(B.rating) as averagerating;
    DUMP D;

    Any help in tracking this down would be greatly appreciated. So far,
    Pig is looking really slick and I'd love to write more advanced
    programs with it.

    Thanks,
    Andrew Hitchcock

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 5, '07 at 11:50p
activeDec 8, '07 at 12:57a
posts12
users5
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase