FAQ
So, I made a dumb little python script that parses a pig script, see's what
stores there are, and then uses pig's describe function to get the schema of
the object being stored and then uses that info to make a new file that has
the proper loader/schema. I felt this was useful because I found myself
making intermediate stores, and then it being pretty difficult to make the
proper loader if there were a lot of columns (especially remembering the
type).

However, it seems that the result from DESCRIBE is not adequate to do a
load. For example, I have test.txt which is literally just random pairs of
numbers

ie

1 2
1 3
1 4
2 5
2 6
3 7
3 8
4 9
5 10
6 11
7 12
8 13
8 14
8 15

and so on.

I do this:

t1 = LOAD 'test.txt' AS (n1:int, n2:int);
t2 = GROUP t1 BY n1;
t3 = GROUP t2 BY group;

DESCRIBE t3;
STORE t3 INTO 'output.txt';

The query runs without a hitch, however, there is an issue

This is what describe gives:

t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}

However, this won't let you load the file...

the output has form
x{(y,{(a,b)}

And I'm not really sure how to go about even creating a loader that would
properly load it. Suffice it to say, it seems pretty complicated to store
and then load anything that isn't a flat file...is this by design? Is there
an easier way to go from the schema, as per describe, to the schema you'd
use to load it?

I'm curious what people do in practice. I could probably extend the script I
made to go from describe schema -> loading schema (if the pig loader can
load things that have brackets and all that?), but I want to know what the
limitations are.

As always, I apologize if there is an easy answer to this. Thanks.

Search Discussions

  • Dmitriy Ryaboy at Dec 28, 2010 at 11:08 pm
    Try using BinStorage instead of the text-based PigStorage

    D
    On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney wrote:

    So, I made a dumb little python script that parses a pig script, see's what
    stores there are, and then uses pig's describe function to get the schema
    of
    the object being stored and then uses that info to make a new file that has
    the proper loader/schema. I felt this was useful because I found myself
    making intermediate stores, and then it being pretty difficult to make the
    proper loader if there were a lot of columns (especially remembering the
    type).

    However, it seems that the result from DESCRIBE is not adequate to do a
    load. For example, I have test.txt which is literally just random pairs of
    numbers

    ie

    1 2
    1 3
    1 4
    2 5
    2 6
    3 7
    3 8
    4 9
    5 10
    6 11
    7 12
    8 13
    8 14
    8 15

    and so on.

    I do this:

    t1 = LOAD 'test.txt' AS (n1:int, n2:int);
    t2 = GROUP t1 BY n1;
    t3 = GROUP t2 BY group;

    DESCRIBE t3;
    STORE t3 INTO 'output.txt';

    The query runs without a hitch, however, there is an issue

    This is what describe gives:

    t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}

    However, this won't let you load the file...

    the output has form
    x{(y,{(a,b)}

    And I'm not really sure how to go about even creating a loader that would
    properly load it. Suffice it to say, it seems pretty complicated to store
    and then load anything that isn't a flat file...is this by design? Is there
    an easier way to go from the schema, as per describe, to the schema you'd
    use to load it?

    I'm curious what people do in practice. I could probably extend the script
    I
    made to go from describe schema -> loading schema (if the pig loader can
    load things that have brackets and all that?), but I want to know what the
    limitations are.

    As always, I apologize if there is an easy answer to this. Thanks.
  • Jonathan Coveney at Dec 28, 2010 at 11:24 pm
    Thanks. Is there any particular downside to this if you get to the millions and hundreds of millions of rows, or is it just the lack of simple use with nonpig systems?

    Sent via BlackBerry

    -----Original Message-----
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    Date: Tue, 28 Dec 2010 15:08:15
    To: <user@pig.apache.org>
    Reply-To: user@pig.apache.org
    Subject: Re: Possible deficiency in describe?

    Try using BinStorage instead of the text-based PigStorage

    D
    On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney wrote:

    So, I made a dumb little python script that parses a pig script, see's what
    stores there are, and then uses pig's describe function to get the schema
    of
    the object being stored and then uses that info to make a new file that has
    the proper loader/schema. I felt this was useful because I found myself
    making intermediate stores, and then it being pretty difficult to make the
    proper loader if there were a lot of columns (especially remembering the
    type).

    However, it seems that the result from DESCRIBE is not adequate to do a
    load. For example, I have test.txt which is literally just random pairs of
    numbers

    ie

    1 2
    1 3
    1 4
    2 5
    2 6
    3 7
    3 8
    4 9
    5 10
    6 11
    7 12
    8 13
    8 14
    8 15

    and so on.

    I do this:

    t1 = LOAD 'test.txt' AS (n1:int, n2:int);
    t2 = GROUP t1 BY n1;
    t3 = GROUP t2 BY group;

    DESCRIBE t3;
    STORE t3 INTO 'output.txt';

    The query runs without a hitch, however, there is an issue

    This is what describe gives:

    t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}

    However, this won't let you load the file...

    the output has form
    x{(y,{(a,b)}

    And I'm not really sure how to go about even creating a loader that would
    properly load it. Suffice it to say, it seems pretty complicated to store
    and then load anything that isn't a flat file...is this by design? Is there
    an easier way to go from the schema, as per describe, to the schema you'd
    use to load it?

    I'm curious what people do in practice. I could probably extend the script
    I
    made to go from describe schema -> loading schema (if the pig loader can
    load things that have brackets and all that?), but I want to know what the
    limitations are.

    As always, I apologize if there is an easy answer to this. Thanks.
  • Dmitriy Ryaboy at Dec 28, 2010 at 11:42 pm
    BinStorage is more efficient and doesn't have the trouble with nested data
    representations you encountered in PigStorage. The downside is only that
    it's not human-readable, and that it might change between versions of Pig
    (though so far we have resisted the urge, iirc)

    D
    On Tue, Dec 28, 2010 at 3:24 PM, Jonathan Coveney wrote:

    Thanks. Is there any particular downside to this if you get to the millions
    and hundreds of millions of rows, or is it just the lack of simple use with
    nonpig systems?

    Sent via BlackBerry

    -----Original Message-----
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    Date: Tue, 28 Dec 2010 15:08:15
    To: <user@pig.apache.org>
    Reply-To: user@pig.apache.org
    Subject: Re: Possible deficiency in describe?

    Try using BinStorage instead of the text-based PigStorage

    D

    On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <jcoveney@gmail.com
    wrote:
    So, I made a dumb little python script that parses a pig script, see's what
    stores there are, and then uses pig's describe function to get the schema
    of
    the object being stored and then uses that info to make a new file that has
    the proper loader/schema. I felt this was useful because I found myself
    making intermediate stores, and then it being pretty difficult to make the
    proper loader if there were a lot of columns (especially remembering the
    type).

    However, it seems that the result from DESCRIBE is not adequate to do a
    load. For example, I have test.txt which is literally just random pairs of
    numbers

    ie

    1 2
    1 3
    1 4
    2 5
    2 6
    3 7
    3 8
    4 9
    5 10
    6 11
    7 12
    8 13
    8 14
    8 15

    and so on.

    I do this:

    t1 = LOAD 'test.txt' AS (n1:int, n2:int);
    t2 = GROUP t1 BY n1;
    t3 = GROUP t2 BY group;

    DESCRIBE t3;
    STORE t3 INTO 'output.txt';

    The query runs without a hitch, however, there is an issue

    This is what describe gives:

    t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}

    However, this won't let you load the file...

    the output has form
    x{(y,{(a,b)}

    And I'm not really sure how to go about even creating a loader that would
    properly load it. Suffice it to say, it seems pretty complicated to store
    and then load anything that isn't a flat file...is this by design? Is there
    an easier way to go from the schema, as per describe, to the schema you'd
    use to load it?

    I'm curious what people do in practice. I could probably extend the script
    I
    made to go from describe schema -> loading schema (if the pig loader can
    load things that have brackets and all that?), but I want to know what the
    limitations are.

    As always, I apologize if there is an easy answer to this. Thanks.
  • Thejas M Nair at Dec 29, 2010 at 12:49 am
    BinStorage format should not change between pig versions. It is like an interface, it should not change unless there is a very strong reason.
    It used to be the format used to (de)serialize data between pig stages, but when changes were made to optimize the format as part of jira PIG-1472, a new format/loader was used instead of changing BinStorage.

    -Thejas




    On 12/28/10 3:41 PM, "Dmitriy Ryaboy" wrote:

    BinStorage is more efficient and doesn't have the trouble with nested data
    representations you encountered in PigStorage. The downside is only that
    it's not human-readable, and that it might change between versions of Pig
    (though so far we have resisted the urge, iirc)

    D
    On Tue, Dec 28, 2010 at 3:24 PM, Jonathan Coveney wrote:

    Thanks. Is there any particular downside to this if you get to the millions
    and hundreds of millions of rows, or is it just the lack of simple use with
    nonpig systems?

    Sent via BlackBerry

    -----Original Message-----
    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    Date: Tue, 28 Dec 2010 15:08:15
    To: <user@pig.apache.org>
    Reply-To: user@pig.apache.org
    Subject: Re: Possible deficiency in describe?

    Try using BinStorage instead of the text-based PigStorage

    D

    On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <jcoveney@gmail.com
    wrote:
    So, I made a dumb little python script that parses a pig script, see's what
    stores there are, and then uses pig's describe function to get the schema
    of
    the object being stored and then uses that info to make a new file that has
    the proper loader/schema. I felt this was useful because I found myself
    making intermediate stores, and then it being pretty difficult to make the
    proper loader if there were a lot of columns (especially remembering the
    type).

    However, it seems that the result from DESCRIBE is not adequate to do a
    load. For example, I have test.txt which is literally just random pairs of
    numbers

    ie

    1 2
    1 3
    1 4
    2 5
    2 6
    3 7
    3 8
    4 9
    5 10
    6 11
    7 12
    8 13
    8 14
    8 15

    and so on.

    I do this:

    t1 = LOAD 'test.txt' AS (n1:int, n2:int);
    t2 = GROUP t1 BY n1;
    t3 = GROUP t2 BY group;

    DESCRIBE t3;
    STORE t3 INTO 'output.txt';

    The query runs without a hitch, however, there is an issue

    This is what describe gives:

    t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}

    However, this won't let you load the file...

    the output has form
    x{(y,{(a,b)}

    And I'm not really sure how to go about even creating a loader that would
    properly load it. Suffice it to say, it seems pretty complicated to store
    and then load anything that isn't a flat file...is this by design? Is there
    an easier way to go from the schema, as per describe, to the schema you'd
    use to load it?

    I'm curious what people do in practice. I could probably extend the script
    I
    made to go from describe schema -> loading schema (if the pig loader can
    load things that have brackets and all that?), but I want to know what the
    limitations are.

    As always, I apologize if there is an easy answer to this. Thanks.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 28, '10 at 10:09p
activeDec 29, '10 at 12:49a
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase