Grokbase Groups Pig user March 2011
FAQ
Hello,

I read that it is good practice to declare the schema in Pig Script as well as in the UDF (by implementing outputSchema), because of performance reasons.

Now in my case I have a EvalFunc that takes a chararray as input and produces a tuple with a dynamic number of chararrays (it creates it result by .newTuple(List list)).
How can I specify a schema for an unknown number of elements?

Best,
Will

Search Discussions

  • Jonathan Coveney at Mar 9, 2011 at 5:12 pm
    In any given instance will the size of the tuple change, or will it change
    on a row by row basis? If it's the former, you can have a constructor that
    indicates how many arguments, and the outputSchema can use that.

    Barring that, it is "good practice" to do so, but it's not necessary. Your
    script will work without it, but DESCRIBES will get thrown off.

    2011/3/9 Lai Will <laiw@student.ethz.ch>
    Hello,

    I read that it is good practice to declare the schema in Pig Script as well
    as in the UDF (by implementing outputSchema), because of performance
    reasons.

    Now in my case I have a EvalFunc that takes a chararray as input and
    produces a tuple with a dynamic number of chararrays (it creates it result
    by .newTuple(List list)).
    How can I specify a schema for an unknown number of elements?

    Best,
    Will
  • Lai Will at Mar 9, 2011 at 5:16 pm
    It's the latter..

    You can imagine my EvalFunc as
    ArrayList<String> booksRead(Person p) {}

    So for a list of people I get a List of ArrayList<String> of different lengths..

    -----Original Message-----
    From: Jonathan Coveney
    Sent: Wednesday, March 09, 2011 6:12 PM
    To: user@pig.apache.org
    Subject: Re: Schema

    In any given instance will the size of the tuple change, or will it change on a row by row basis? If it's the former, you can have a constructor that indicates how many arguments, and the outputSchema can use that.

    Barring that, it is "good practice" to do so, but it's not necessary. Your script will work without it, but DESCRIBES will get thrown off.

    2011/3/9 Lai Will <laiw@student.ethz.ch>
    Hello,

    I read that it is good practice to declare the schema in Pig Script as
    well as in the UDF (by implementing outputSchema), because of
    performance reasons.

    Now in my case I have a EvalFunc that takes a chararray as input and
    produces a tuple with a dynamic number of chararrays (it creates it
    result by .newTuple(List list)).
    How can I specify a schema for an unknown number of elements?

    Best,
    Will
  • Mridul Muralidharan at Mar 10, 2011 at 1:09 am
    In which case, cant you not model that as a Bag ?
    I imagine something like Tuple with fields person:chararray,
    books_read:bag{ (name:chararray, isbn:chararray) }, etc ?

    Ofcourse, it will work as a bag if the tuple contained within it has a
    fixed schema :-) (unless you repeat this process N number of times as
    required !)

    Regards,
    Mridul
    On Wednesday 09 March 2011 10:46 PM, Lai Will wrote:
    It's the latter..

    You can imagine my EvalFunc as
    ArrayList<String> booksRead(Person p) {}

    So for a list of people I get a List of ArrayList<String> of different lengths..

    -----Original Message-----
    From: Jonathan Coveney
    Sent: Wednesday, March 09, 2011 6:12 PM
    To: user@pig.apache.org
    Subject: Re: Schema

    In any given instance will the size of the tuple change, or will it change on a row by row basis? If it's the former, you can have a constructor that indicates how many arguments, and the outputSchema can use that.

    Barring that, it is "good practice" to do so, but it's not necessary. Your script will work without it, but DESCRIBES will get thrown off.

    2011/3/9 Lai Will<laiw@student.ethz.ch>
    Hello,

    I read that it is good practice to declare the schema in Pig Script as
    well as in the UDF (by implementing outputSchema), because of
    performance reasons.

    Now in my case I have a EvalFunc that takes a chararray as input and
    produces a tuple with a dynamic number of chararrays (it creates it
    result by .newTuple(List list)).
    How can I specify a schema for an unknown number of elements?

    Best,
    Will
  • Lai Will at Mar 11, 2011 at 6:11 pm
    I could, but then I would not be able to use a FilterFunc on the Bag..

    (e.g. get all the people, that have read "xyz")

    I would either have to flatten the bag and then filter or wrap the bag using another tuple.
    Both seems to be unnecessary overhead.

    Is my thinking correct?

    Best,
    Will
    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Thursday, March 10, 2011 2:08 AM
    To: user@pig.apache.org
    Cc: Lai Will
    Subject: Re: Schema


    In which case, cant you not model that as a Bag ?
    I imagine something like Tuple with fields person:chararray, books_read:bag{ (name:chararray, isbn:chararray) }, etc ?

    Ofcourse, it will work as a bag if the tuple contained within it has a fixed schema :-) (unless you repeat this process N number of times as required !)

    Regards,
    Mridul
    On Wednesday 09 March 2011 10:46 PM, Lai Will wrote:
    It's the latter..

    You can imagine my EvalFunc as
    ArrayList<String> booksRead(Person p) {}

    So for a list of people I get a List of ArrayList<String> of different lengths..

    -----Original Message-----
    From: Jonathan Coveney
    Sent: Wednesday, March 09, 2011 6:12 PM
    To: user@pig.apache.org
    Subject: Re: Schema

    In any given instance will the size of the tuple change, or will it change on a row by row basis? If it's the former, you can have a constructor that indicates how many arguments, and the outputSchema can use that.

    Barring that, it is "good practice" to do so, but it's not necessary. Your script will work without it, but DESCRIBES will get thrown off.

    2011/3/9 Lai Will<laiw@student.ethz.ch>
    Hello,

    I read that it is good practice to declare the schema in Pig Script
    as well as in the UDF (by implementing outputSchema), because of
    performance reasons.

    Now in my case I have a EvalFunc that takes a chararray as input and
    produces a tuple with a dynamic number of chararrays (it creates it
    result by .newTuple(List list)).
    How can I specify a schema for an unknown number of elements?

    Best,
    Will
  • Dmitriy Ryaboy at Mar 11, 2011 at 7:45 pm
    All arguments to funcs are automatically wrapped in a tuple anyway.

    so let's say we want to write a BagContains filter.

    foo = group stuff by key;
    foreach foo generate ( BagContains(stuff, 'magic') ? 1 : 0);

    they you'd write BagContains to take a Tuple of 2 args -- the first field is
    a bag, the second is your predicate.

    Similarly you can filter by IsEmpty(myBag), etc.

    D
    On Fri, Mar 11, 2011 at 10:11 AM, Lai Will wrote:

    I could, but then I would not be able to use a FilterFunc on the Bag..

    (e.g. get all the people, that have read "xyz")

    I would either have to flatten the bag and then filter or wrap the bag
    using another tuple.
    Both seems to be unnecessary overhead.

    Is my thinking correct?

    Best,
    Will
    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Thursday, March 10, 2011 2:08 AM
    To: user@pig.apache.org
    Cc: Lai Will
    Subject: Re: Schema


    In which case, cant you not model that as a Bag ?
    I imagine something like Tuple with fields person:chararray,
    books_read:bag{ (name:chararray, isbn:chararray) }, etc ?

    Ofcourse, it will work as a bag if the tuple contained within it has a
    fixed schema :-) (unless you repeat this process N number of times as
    required !)

    Regards,
    Mridul
    On Wednesday 09 March 2011 10:46 PM, Lai Will wrote:
    It's the latter..

    You can imagine my EvalFunc as
    ArrayList<String> booksRead(Person p) {}

    So for a list of people I get a List of ArrayList<String> of different lengths..
    -----Original Message-----
    From: Jonathan Coveney
    Sent: Wednesday, March 09, 2011 6:12 PM
    To: user@pig.apache.org
    Subject: Re: Schema

    In any given instance will the size of the tuple change, or will it
    change on a row by row basis? If it's the former, you can have a constructor
    that indicates how many arguments, and the outputSchema can use that.
    Barring that, it is "good practice" to do so, but it's not necessary.
    Your script will work without it, but DESCRIBES will get thrown off.
    2011/3/9 Lai Will<laiw@student.ethz.ch>
    Hello,

    I read that it is good practice to declare the schema in Pig Script
    as well as in the UDF (by implementing outputSchema), because of
    performance reasons.

    Now in my case I have a EvalFunc that takes a chararray as input and
    produces a tuple with a dynamic number of chararrays (it creates it
    result by .newTuple(List list)).
    How can I specify a schema for an unknown number of elements?

    Best,
    Will
  • Deepak kumar v at Mar 17, 2011 at 9:12 am
    I have a UDF , the output is a tuple of the following format

    ( [ 'Key'#{ ( chararray,[ 'key', chararray]) }] )


    I am able to specify output schema for the outer tuple and inner Map.
    I need to specify schema for the key , ValueBag within the map and schema
    for tuples within ValueBag. And items within tuple.

    Regards,
    Deepak
  • Alan Gates at Mar 17, 2011 at 4:20 pm
    Currently there is no way to specify the schema for values in the map
    up front. You have to cast them when you bring them out of the map.
    We hope to resolve that in 0.9.

    Alan.
    On Mar 17, 2011, at 2:11 AM, deepak kumar v wrote:

    I have a UDF , the output is a tuple of the following format

    ( [ 'Key'#{ ( chararray,[ 'key', chararray]) }] )


    I am able to specify output schema for the outer tuple and inner Map.
    I need to specify schema for the key , ValueBag within the map and
    schema
    for tuples within ValueBag. And items within tuple.

    Regards,
    Deepak
  • Daniel Dai at Mar 18, 2011 at 1:24 am
    In 0.9, you can use the syntax:
    m:[{(c:chararray, m1:[chararray])}]

    Daniel
    On 03/17/2011 09:18 AM, Alan Gates wrote:
    Currently there is no way to specify the schema for values in the map
    up front. You have to cast them when you bring them out of the map.
    We hope to resolve that in 0.9.

    Alan.
    On Mar 17, 2011, at 2:11 AM, deepak kumar v wrote:

    I have a UDF , the output is a tuple of the following format

    ( [ 'Key'#{ ( chararray,[ 'key', chararray]) }] )


    I am able to specify output schema for the outer tuple and inner Map.
    I need to specify schema for the key , ValueBag within the map and
    schema
    for tuples within ValueBag. And items within tuple.

    Regards,
    Deepak

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 9, '11 at 9:55a
activeMar 18, '11 at 1:24a
posts9
users7
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase