Grokbase Groups Pig user April 2013
FAQ
Hi everyone,

I would like to override the input schema in AvroStorage to make a pig
script robust to schema evolution. For example, suppose a new field is
added to an avro schema with a default value of null. If the input to a
pig script using this field includes both old and new data, AvroStorage
will merge the input schemas from the old and new data. However, if the
input includes only old data, the new schema will not be available to
AvroStorage and pig will fail to interpret the script with an error such
as "projected field [newField] does not exist in schema". If AvroStorage
accepted an input schema, the script would be valid for both the new and
old data. Is there any plan to implement this?

Thanks,
Steve

Search Discussions

  • Enns, Steven at Apr 27, 2013 at 10:25 pm
    Resending now that I am subscribed :)
    On 4/25/13 4:01 PM, "Enns, Steven" wrote:

    Hi everyone,

    I would like to override the input schema in AvroStorage to make a pig
    script robust to schema evolution. For example, suppose a new field is
    added to an avro schema with a default value of null. If the input to a
    pig script using this field includes both old and new data, AvroStorage
    will merge the input schemas from the old and new data. However, if the
    input includes only old data, the new schema will not be available to
    AvroStorage and pig will fail to interpret the script with an error such
    as "projected field [newField] does not exist in schema". If AvroStorage
    accepted an input schema, the script would be valid for both the new and
    old data. Is there any plan to implement this?

    Thanks,
    Steve
  • Cheolsoo Park at May 1, 2013 at 4:11 am
    Hi Steven,

    The new AvroStorage will let you specify the input schema:
    https://issues.apache.org/jira/browse/PIG-3015

    In fact, somebody made the same request in a comment of the jira that I am
    copying and pasting below:

    Furthermore, we occasionally have issues with pig jobs picking the old
    schema when we have a schema update. Manually specifying the schema would
    fix this and give us more flexibility in defining the data we want pig to
    pull from a file.

    This jira is work in progress, but hopefully it will be in next major
    released.

    Thanks,
    Cheolsoo


    On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven wrote:

    Resending now that I am subscribed :)
    On 4/25/13 4:01 PM, "Enns, Steven" wrote:

    Hi everyone,

    I would like to override the input schema in AvroStorage to make a pig
    script robust to schema evolution. For example, suppose a new field is
    added to an avro schema with a default value of null. If the input to a
    pig script using this field includes both old and new data, AvroStorage
    will merge the input schemas from the old and new data. However, if the
    input includes only old data, the new schema will not be available to
    AvroStorage and pig will fail to interpret the script with an error such
    as "projected field [newField] does not exist in schema". If AvroStorage
    accepted an input schema, the script would be valid for both the new and
    old data. Is there any plan to implement this?

    Thanks,
    Steve
  • Enns, Steven at May 1, 2013 at 5:29 pm
    Cool thanks!
    On 4/30/13 9:10 PM, "Cheolsoo Park" wrote:

    Hi Steven,

    The new AvroStorage will let you specify the input schema:
    https://issues.apache.org/jira/browse/PIG-3015

    In fact, somebody made the same request in a comment of the jira that I am
    copying and pasting below:

    Furthermore, we occasionally have issues with pig jobs picking the old
    schema when we have a schema update. Manually specifying the schema
    would
    fix this and give us more flexibility in defining the data we want pig
    to
    pull from a file.

    This jira is work in progress, but hopefully it will be in next major
    released.

    Thanks,
    Cheolsoo


    On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven wrote:

    Resending now that I am subscribed :)
    On 4/25/13 4:01 PM, "Enns, Steven" wrote:

    Hi everyone,

    I would like to override the input schema in AvroStorage to make a pig
    script robust to schema evolution. For example, suppose a new field is
    added to an avro schema with a default value of null. If the input to a
    pig script using this field includes both old and new data, AvroStorage
    will merge the input schemas from the old and new data. However, if the
    input includes only old data, the new schema will not be available to
    AvroStorage and pig will fail to interpret the script with an error such
    as "projected field [newField] does not exist in schema". If
    AvroStorage
    accepted an input schema, the script would be valid for both the new and
    old data. Is there any plan to implement this?

    Thanks,
    Steve
  • Viraj Bhat at May 3, 2013 at 3:35 am
    Hi Cheolsoo/Pig User Group,
       I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple schemas where default values have been specified in the avro schema; The AvroStorage puts nulls in the merged data set.
    Is this a known bug in the current implementation of the AvroStorage. Using an example provided by one of my colleagues. The final dataset should contain "NU", 0, "OU" for all values where the columns do not exist.
    ==> Employee3.avro <==
    {
    "type" : "record",
    "name" : "employee",
    "fields":[
             {"name" : "name", "type" : "string", "default" : "NU"},
             {"name" : "age", "type" : "int", "default" : 0 },
             {"name" : "dept", "type": "string", "default" : "DU"}
    ]
    }

    ==> Employee4.avro <==
    {
    "type" : "record",
    "name" : "employee",
    "fields":[
             {"name" : "name", "type" : "string", "default" : "NU"},
             {"name" : "age", "type" : "int", "default" : 0},
             {"name" : "dept", "type": "string", "default" : "DU"},
             {"name" : "office", "type": "string", "default" : "OU"}
    ]
    }

    ==> Employee6.avro <==
    {
    "type" : "record",
    "name" : "employee",
    "fields":[
             {"name" : "name", "type" : "string", "default" : "NU"},
             {"name" : "lastname", "type": "string", "default" : "LNU"},
             {"name" : "age", "type" : "int","default" : 0},
             {"name" : "salary", "type": "int", "default" : 0},
             {"name" : "dept", "type": "string","default" : "DU"},
             {"name" : "office", "type": "string","default" : "OU"}
    ]
    }

    The pig script:
    employee = load '$input' using org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
    describe employee;
    dump employee;

    The call:
    dump_employees.pig employee{3,4,6}.ser

    The output:
    employee: {name: chararray,age: int,dept: chararray,lastname: chararray,salary: int,office: chararray}

    (Milo,30,DH,,,)
    (Asmya,34,PQ,,,)
    (Baljit,23,RS,,,)
    (Pune,60,Astrophysics,Warriors,5466,UTA)
    (Rajsathan,20,Biochemistry,Royals,1378,Stanford)
    (Chennai,50,Microbiology,Superkings,7338,Hopkins)
    (Mumbai,20,Applied Math,Indians,4468,UAH)
    (Praj,54,RMX,,,Champaign)
    (Buba,767,HD,,,Sunnyvale)
    (Manku,375,MS,,,New York)
    Regards
    Viraj

    -----Original Message-----
    From: Cheolsoo Park
    Sent: Tuesday, April 30, 2013 9:10 PM
    To: user@pig.apache.org
    Cc: Qi, Runping
    Subject: Re: Override input schema in AvroStorage

    Hi Steven,

    The new AvroStorage will let you specify the input schema:
    https://issues.apache.org/jira/browse/PIG-3015

    In fact, somebody made the same request in a comment of the jira that I am copying and pasting below:

    Furthermore, we occasionally have issues with pig jobs picking the old
    schema when we have a schema update. Manually specifying the schema
    would fix this and give us more flexibility in defining the data we
    want pig to pull from a file.

    This jira is work in progress, but hopefully it will be in next major released.

    Thanks,
    Cheolsoo


    On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven wrote:

    Resending now that I am subscribed :)
    On 4/25/13 4:01 PM, "Enns, Steven" wrote:

    Hi everyone,

    I would like to override the input schema in AvroStorage to make a
    pig script robust to schema evolution. For example, suppose a new
    field is added to an avro schema with a default value of null. If
    the input to a pig script using this field includes both old and new
    data, AvroStorage will merge the input schemas from the old and new
    data. However, if the input includes only old data, the new schema
    will not be available to AvroStorage and pig will fail to interpret
    the script with an error such as "projected field [newField] does not
    exist in schema". If AvroStorage accepted an input schema, the
    script would be valid for both the new and old data. Is there any plan to implement this?

    Thanks,
    Steve
  • Cheolsoo Park at May 3, 2013 at 5:01 am
    Hi Viray,

    Yes, that's a known bug. Here is what happens:

    1) Let's say there are two schema X and Y.
    2) AvroStorage creates a tuple whose size == max( sizeOf(X), sizeOf(Y) ).
    3) Fields are filled in as values are read. But if no values are found,
    those fields are left as null.

    If you'd like to fix it, please take a look at PigAvroRecordReader.java:
    http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java

    In particular, see how mProtoTuple is initialized and updated.

    Thanks,
    Cheolsoo




    On Thu, May 2, 2013 at 8:34 PM, Viraj Bhat wrote:

    Hi Cheolsoo/Pig User Group,
    I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple
    schemas where default values have been specified in the avro schema; The
    AvroStorage puts nulls in the merged data set.
    Is this a known bug in the current implementation of the AvroStorage.
    Using an example provided by one of my colleagues. The final dataset should
    contain "NU", 0, "OU" for all values where the columns do not exist.
    ==> Employee3.avro <==
    {
    "type" : "record",
    "name" : "employee",
    "fields":[
    {"name" : "name", "type" : "string", "default" : "NU"},
    {"name" : "age", "type" : "int", "default" : 0 },
    {"name" : "dept", "type": "string", "default" : "DU"}
    ]
    }

    ==> Employee4.avro <==
    {
    "type" : "record",
    "name" : "employee",
    "fields":[
    {"name" : "name", "type" : "string", "default" : "NU"},
    {"name" : "age", "type" : "int", "default" : 0},
    {"name" : "dept", "type": "string", "default" : "DU"},
    {"name" : "office", "type": "string", "default" : "OU"}
    ]
    }

    ==> Employee6.avro <==
    {
    "type" : "record",
    "name" : "employee",
    "fields":[
    {"name" : "name", "type" : "string", "default" : "NU"},
    {"name" : "lastname", "type": "string", "default" : "LNU"},
    {"name" : "age", "type" : "int","default" : 0},
    {"name" : "salary", "type": "int", "default" : 0},
    {"name" : "dept", "type": "string","default" : "DU"},
    {"name" : "office", "type": "string","default" : "OU"}
    ]
    }

    The pig script:
    employee = load '$input' using
    org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
    describe employee;
    dump employee;

    The call:
    dump_employees.pig employee{3,4,6}.ser

    The output:
    employee: {name: chararray,age: int,dept: chararray,lastname:
    chararray,salary: int,office: chararray}

    (Milo,30,DH,,,)
    (Asmya,34,PQ,,,)
    (Baljit,23,RS,,,)
    (Pune,60,Astrophysics,Warriors,5466,UTA)
    (Rajsathan,20,Biochemistry,Royals,1378,Stanford)
    (Chennai,50,Microbiology,Superkings,7338,Hopkins)
    (Mumbai,20,Applied Math,Indians,4468,UAH)
    (Praj,54,RMX,,,Champaign)
    (Buba,767,HD,,,Sunnyvale)
    (Manku,375,MS,,,New York)
    Regards
    Viraj

    -----Original Message-----
    From: Cheolsoo Park
    Sent: Tuesday, April 30, 2013 9:10 PM
    To: user@pig.apache.org
    Cc: Qi, Runping
    Subject: Re: Override input schema in AvroStorage

    Hi Steven,

    The new AvroStorage will let you specify the input schema:
    https://issues.apache.org/jira/browse/PIG-3015

    In fact, somebody made the same request in a comment of the jira that I am
    copying and pasting below:

    Furthermore, we occasionally have issues with pig jobs picking the old
    schema when we have a schema update. Manually specifying the schema
    would fix this and give us more flexibility in defining the data we
    want pig to pull from a file.

    This jira is work in progress, but hopefully it will be in next major
    released.

    Thanks,
    Cheolsoo


    On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven wrote:

    Resending now that I am subscribed :)
    On 4/25/13 4:01 PM, "Enns, Steven" wrote:

    Hi everyone,

    I would like to override the input schema in AvroStorage to make a
    pig script robust to schema evolution. For example, suppose a new
    field is added to an avro schema with a default value of null. If
    the input to a pig script using this field includes both old and new
    data, AvroStorage will merge the input schemas from the old and new
    data. However, if the input includes only old data, the new schema
    will not be available to AvroStorage and pig will fail to interpret
    the script with an error such as "projected field [newField] does not
    exist in schema". If AvroStorage accepted an input schema, the
    script would be valid for both the new and old data. Is there any plan
    to implement this?
    Thanks,
    Steve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 25, '13 at 11:22p
activeMay 3, '13 at 5:01a
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase