FAQ
Hello,

I'm having an issue with a script that uses an EvalFunc I wrote. The issue
is the final output contains characters that I am not expecting (commas -
followed by what I'm guessing are null fields which I do not see).

Snippet:
C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
grunt> DUMP C;
(2,3)
(2,4)
(2,5)
(3,4)
(3,5)
(4,5)
(2,3)
(2,4)
(2,5)
(3,4)
(3,5)
(4,5)

D = GROUP C by (f1,f2);
grunt> describe D;
D: {group: (f1: int,f2: int),C: {f1: int,f2: int}}

grunt> DUMP D;
((2,3,),{(2,3,),(2,3,)})
((2,4,),{(2,4,),(2,4,)})
((2,5,),{(2,5,),(2,5,)})
((3,4,),{(3,4,),(3,4,)})
((3,5,),{(3,5,),(3,5,)})
((4,5,),{(4,5,),(4,5,)})

My question is, what are these extra comma/null fiends in each tuple? I
expected the first row to read as:
((2,3),{(2,3),(2,3)})

It seems related, but when I run 'ILLUSTRATE C', I get an exeption:
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
at
org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
at
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86)
at
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69)
at
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
at org.apache.pig.PigServer.getExamples(PigServer.java:785)
at
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)

Excruciating detail below:

My script:
REGISTER udf.jar
A = LOAD '/pig_input/co.txt' as (line:chararray);
B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B;
C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
D = GROUP C by (f1,f2);
E = FOREACH D GENERATE group, COUNT(C);
STORE E INTO 'output' USING PigStorage(',');

Here's what I'm trying to do:
For input:
A,1,2,3
B,1,2,3

Produce combinations for each row (My UDF does this):
(1,2),(1,3),(2,3)
(1,2),(1,3),(2,3)

Flatten them:
(1,2),
(1,3),
(2,3),
(1,2),
(1,3),
(2,3)

Group and count them:
(1,2),2
(1,3),2
(2,3),2

Search Discussions

  • Daniel Dai at Dec 9, 2010 at 12:04 am
    It is not expected. I would think something wrong inside
    NormalizeListUDF. Make sure you feed bag of tuples which has the schema
    (int, int) inside your UDF. If you can post your UDF, I can know better.

    Daniel

    Michael Moss wrote:
    Hello,

    I'm having an issue with a script that uses an EvalFunc I wrote. The issue
    is the final output contains characters that I am not expecting (commas -
    followed by what I'm guessing are null fields which I do not see).

    Snippet:
    C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
    grunt> DUMP C;
    (2,3)
    (2,4)
    (2,5)
    (3,4)
    (3,5)
    (4,5)
    (2,3)
    (2,4)
    (2,5)
    (3,4)
    (3,5)
    (4,5)

    D = GROUP C by (f1,f2);
    grunt> describe D;
    D: {group: (f1: int,f2: int),C: {f1: int,f2: int}}

    grunt> DUMP D;
    ((2,3,),{(2,3,),(2,3,)})
    ((2,4,),{(2,4,),(2,4,)})
    ((2,5,),{(2,5,),(2,5,)})
    ((3,4,),{(3,4,),(3,4,)})
    ((3,5,),{(3,5,),(3,5,)})
    ((4,5,),{(4,5,),(4,5,)})

    My question is, what are these extra comma/null fiends in each tuple? I
    expected the first row to read as:
    ((2,3),{(2,3),(2,3)})

    It seems related, but when I run 'ILLUSTRATE C', I get an exeption:
    java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
    at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
    at
    org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
    at
    org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86)
    at
    org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69)
    at
    org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
    at org.apache.pig.PigServer.getExamples(PigServer.java:785)
    at
    org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555)
    at
    org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
    at
    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
    at
    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
    at org.apache.pig.Main.main(Main.java:357)

    Excruciating detail below:

    My script:
    REGISTER udf.jar
    A = LOAD '/pig_input/co.txt' as (line:chararray);
    B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B;
    C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
    D = GROUP C by (f1,f2);
    E = FOREACH D GENERATE group, COUNT(C);
    STORE E INTO 'output' USING PigStorage(',');

    Here's what I'm trying to do:
    For input:
    A,1,2,3
    B,1,2,3

    Produce combinations for each row (My UDF does this):
    (1,2),(1,3),(2,3)
    (1,2),(1,3),(2,3)

    Flatten them:
    (1,2),
    (1,3),
    (2,3),
    (1,2),
    (1,3),
    (2,3)

    Group and count them:
    (1,2),2
    (1,3),2
    (2,3),2
  • Michael Moss at Dec 9, 2010 at 2:50 pm
    Thanks, Daniel.

    My UDF with schema (which I suspect is culprit) is below. I've tried
    excluding the "outputSchema()" method entirely and a several variations:

    (Full source here: http://pastie.org/1362084)

    public class NormalizeListUDF extends EvalFunc<DataBag>
    {
    public DataBag exec(Tuple input) throws IOException
    {
    if (input == null || input.size() == 0)
    return null;
    try
    {
    DataBag output = DefaultBagFactory.getInstance().newDefaultBag();

    List<Object> tuples = input.getAll();
    String line = (String) tuples.remove(0);
    line = line.trim();
    String[] items = line.split(",");

    for (int i = 1; i < items.length - 1; i++)
    {
    for (int j = i + 1; j < items.length; j++)
    {
    int num1 = Integer.parseInt(items[i]);
    int num2 = Integer.parseInt(items[j]);

    Tuple t = TupleFactory.getInstance().newTuple(1);

    if (num1 < num2)
    {
    t.set(0, num1 + "," + num2);
    }
    else if (num2 < num1)
    {
    t.set(0, num2 + "," + num1);
    }
    output.add(t);
    }
    }
    return output;
    }
    catch (Exception e)
    {
    throw WrappedIOException.wrap("Caught exception processing input row ", e);
    }
    }

    public Schema outputSchema(Schema input)
    {
    try
    {
    List<Schema.FieldSchema> fields = new ArrayList<Schema.FieldSchema>();
    Schema.FieldSchema f1 = new Schema.FieldSchema("f1", DataType.INTEGER);
    Schema.FieldSchema f2 = new Schema.FieldSchema("f2", DataType.INTEGER);
    fields.add(f1);
    fields.add(f2);
    Schema tupleInner = new Schema(fields);
    Schema.FieldSchema tupleSchema = new Schema.FieldSchema("t1", tupleInner,
    DataType.TUPLE);

    Schema bagInner = new Schema(tupleSchema);
    Schema.FieldSchema bagSchema = new Schema.FieldSchema("bag", bagInner,
    DataType.BAG);
    return new Schema(bagSchema);
    }
    catch (Exception e)
    {
    return null;
    }
    }
    }
    On Wed, Dec 8, 2010 at 7:04 PM, Daniel Dai wrote:

    It is not expected. I would think something wrong inside NormalizeListUDF.
    Make sure you feed bag of tuples which has the schema (int, int) inside your
    UDF. If you can post your UDF, I can know better.

    Daniel


    Michael Moss wrote:
    Hello,

    I'm having an issue with a script that uses an EvalFunc I wrote. The issue
    is the final output contains characters that I am not expecting (commas -
    followed by what I'm guessing are null fields which I do not see).

    Snippet:
    C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
    grunt> DUMP C;
    (2,3)
    (2,4)
    (2,5)
    (3,4)
    (3,5)
    (4,5)
    (2,3)
    (2,4)
    (2,5)
    (3,4)
    (3,5)
    (4,5)

    D = GROUP C by (f1,f2);
    grunt> describe D;
    D: {group: (f1: int,f2: int),C: {f1: int,f2: int}}

    grunt> DUMP D;
    ((2,3,),{(2,3,),(2,3,)})
    ((2,4,),{(2,4,),(2,4,)})
    ((2,5,),{(2,5,),(2,5,)})
    ((3,4,),{(3,4,),(3,4,)})
    ((3,5,),{(3,5,),(3,5,)})
    ((4,5,),{(4,5,),(4,5,)})

    My question is, what are these extra comma/null fiends in each tuple? I
    expected the first row to read as:
    ((2,3),{(2,3),(2,3)})

    It seems related, but when I run 'ILLUSTRATE C', I get an exeption:
    java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
    at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
    at

    org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
    at

    org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86)
    at

    org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69)
    at
    org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
    at org.apache.pig.PigServer.getExamples(PigServer.java:785)
    at

    org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555)
    at

    org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
    at

    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
    at

    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
    at org.apache.pig.Main.main(Main.java:357)

    Excruciating detail below:

    My script:
    REGISTER udf.jar
    A = LOAD '/pig_input/co.txt' as (line:chararray);
    B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B;
    C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
    D = GROUP C by (f1,f2);
    E = FOREACH D GENERATE group, COUNT(C);
    STORE E INTO 'output' USING PigStorage(',');

    Here's what I'm trying to do:
    For input:
    A,1,2,3
    B,1,2,3

    Produce combinations for each row (My UDF does this):
    (1,2),(1,3),(2,3)
    (1,2),(1,3),(2,3)

    Flatten them:
    (1,2),
    (1,3),
    (2,3),
    (1,2),
    (1,3),
    (2,3)

    Group and count them:
    (1,2),2
    (1,3),2
    (2,3),2
  • Daniel Dai at Dec 10, 2010 at 12:07 am
    In your udf:
    if (num1 < num2)
    {
    t.set(0, num1 + "," + num2);
    }
    else if (num2 < num1)
    {
    t.set(0, num2 + "," + num1);
    }

    You actually only put one item into the tuple. So your UDF generate a
    bag of tuples of one field, not two fields.
    I think what you mean is:
    if (num1 < num2)
    {
    t.set(0, num1);
    t.set(1, num2);
    }
    else if (num2 < num1)
    {
    t.set(0, num2);
    t.set(1, num1);
    }

    Daniel

    Michael Moss wrote:
    Thanks, Daniel.

    My UDF with schema (which I suspect is culprit) is below. I've tried
    excluding the "outputSchema()" method entirely and a several variations:

    (Full source here: http://pastie.org/1362084)

    public class NormalizeListUDF extends EvalFunc<DataBag>
    {
    public DataBag exec(Tuple input) throws IOException
    {
    if (input == null || input.size() == 0)
    return null;
    try
    {
    DataBag output = DefaultBagFactory.getInstance().newDefaultBag();

    List<Object> tuples = input.getAll();
    String line = (String) tuples.remove(0);
    line = line.trim();
    String[] items = line.split(",");

    for (int i = 1; i < items.length - 1; i++)
    {
    for (int j = i + 1; j < items.length; j++)
    {
    int num1 = Integer.parseInt(items[i]);
    int num2 = Integer.parseInt(items[j]);

    Tuple t = TupleFactory.getInstance().newTuple(1);

    if (num1 < num2)
    {
    t.set(0, num1 + "," + num2);
    }
    else if (num2 < num1)
    {
    t.set(0, num2 + "," + num1);
    }
    output.add(t);
    }
    }
    return output;
    }
    catch (Exception e)
    {
    throw WrappedIOException.wrap("Caught exception processing input row ", e);
    }
    }

    public Schema outputSchema(Schema input)
    {
    try
    {
    List<Schema.FieldSchema> fields = new ArrayList<Schema.FieldSchema>();
    Schema.FieldSchema f1 = new Schema.FieldSchema("f1", DataType.INTEGER);
    Schema.FieldSchema f2 = new Schema.FieldSchema("f2", DataType.INTEGER);
    fields.add(f1);
    fields.add(f2);
    Schema tupleInner = new Schema(fields);
    Schema.FieldSchema tupleSchema = new Schema.FieldSchema("t1", tupleInner,
    DataType.TUPLE);

    Schema bagInner = new Schema(tupleSchema);
    Schema.FieldSchema bagSchema = new Schema.FieldSchema("bag", bagInner,
    DataType.BAG);
    return new Schema(bagSchema);
    }
    catch (Exception e)
    {
    return null;
    }
    }
    }

    On Wed, Dec 8, 2010 at 7:04 PM, Daniel Dai wrote:

    It is not expected. I would think something wrong inside NormalizeListUDF.
    Make sure you feed bag of tuples which has the schema (int, int) inside your
    UDF. If you can post your UDF, I can know better.

    Daniel


    Michael Moss wrote:

    Hello,

    I'm having an issue with a script that uses an EvalFunc I wrote. The issue
    is the final output contains characters that I am not expecting (commas -
    followed by what I'm guessing are null fields which I do not see).

    Snippet:
    C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
    grunt> DUMP C;
    (2,3)
    (2,4)
    (2,5)
    (3,4)
    (3,5)
    (4,5)
    (2,3)
    (2,4)
    (2,5)
    (3,4)
    (3,5)
    (4,5)

    D = GROUP C by (f1,f2);
    grunt> describe D;
    D: {group: (f1: int,f2: int),C: {f1: int,f2: int}}

    grunt> DUMP D;
    ((2,3,),{(2,3,),(2,3,)})
    ((2,4,),{(2,4,),(2,4,)})
    ((2,5,),{(2,5,),(2,5,)})
    ((3,4,),{(3,4,),(3,4,)})
    ((3,5,),{(3,5,),(3,5,)})
    ((4,5,),{(4,5,),(4,5,)})

    My question is, what are these extra comma/null fiends in each tuple? I
    expected the first row to read as:
    ((2,3),{(2,3),(2,3)})

    It seems related, but when I run 'ILLUSTRATE C', I get an exeption:
    java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
    at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
    at

    org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
    at

    org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86)
    at

    org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69)
    at
    org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
    at org.apache.pig.PigServer.getExamples(PigServer.java:785)
    at

    org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555)
    at

    org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
    at

    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
    at

    org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
    at org.apache.pig.Main.main(Main.java:357)

    Excruciating detail below:

    My script:
    REGISTER udf.jar
    A = LOAD '/pig_input/co.txt' as (line:chararray);
    B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B;
    C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
    D = GROUP C by (f1,f2);
    E = FOREACH D GENERATE group, COUNT(C);
    STORE E INTO 'output' USING PigStorage(',');

    Here's what I'm trying to do:
    For input:
    A,1,2,3
    B,1,2,3

    Produce combinations for each row (My UDF does this):
    (1,2),(1,3),(2,3)
    (1,2),(1,3),(2,3)

    Flatten them:
    (1,2),
    (1,3),
    (2,3),
    (1,2),
    (1,3),
    (2,3)

    Group and count them:
    (1,2),2
    (1,3),2
    (2,3),2

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 8, '10 at 9:50p
activeDec 10, '10 at 12:07a
posts4
users2
websitepig.apache.org

2 users in discussion

Michael Moss: 2 posts Daniel Dai: 2 posts

People

Translate

site design / logo © 2021 Grokbase