Grokbase Groups Pig dev November 2010
FAQ
HI,





Is there any way in Pig where a LoadFunc can retrieve the Schema definition
entered by the user in the AS clause?

e.g. A = LOAD '$INPUT' USING MyLoader() AS (a:int, b:int);



My question comes from the below problem I'm facing:



So I'm writing a Loader that adds partition fields to the Schema. E.g.
daydate, day, year month etc.

These partitions are used to filter out entire folders in the storage
location.

I want to use the FILTER statement to filter by these keys.



So if I create a Loader that returns its own Schema the following works and
the LoadMetaData: setPartitionFilter method gets called correctly by pig.

e.g.

A = LOAD '$INPUT' using MyLoader('a:int, b:int'); -- the loader will parse
this and also add the partition folder daydate

F = FILTER A BY daydate='2010-11-01';

STORE F INTO '$OUTPUT'





But if the Loader does not return a Schema and the Schema is defined by the
user in the AS clause Pig never calls LoadMetaData:setPartitionFilter at all
and the partition filtering never happens.

e.g.

A = LOAD '$INPUT' AS (a:int, b:int, daydate:chararray);

F = FILTER A BY daydate='2010-11-01';

STORE F INTO '$OUTPUT';



Any suggestions?



Thanks,

Gerrit

Search Discussions

  • Alan Gates at Nov 10, 2010 at 9:56 pm
    To answer your direct question, no, there is currently no provision in
    the interface for Pig to provide the user defined schema to the load
    function.

    But it seems like the real solution to your problem is that
    LoadMetaData:setPartitionFilter ought to be called regardless of
    whether the loader returns a schema. Is there a technical reason we
    don't do that?

    Alan.
    On Nov 5, 2010, at 8:13 AM, Gerrit Jansen van Vuuren wrote:

    HI,





    Is there any way in Pig where a LoadFunc can retrieve the Schema
    definition
    entered by the user in the AS clause?

    e.g. A = LOAD '$INPUT' USING MyLoader() AS (a:int, b:int);



    My question comes from the below problem I'm facing:



    So I'm writing a Loader that adds partition fields to the Schema. E.g.
    daydate, day, year month etc.

    These partitions are used to filter out entire folders in the storage
    location.

    I want to use the FILTER statement to filter by these keys.



    So if I create a Loader that returns its own Schema the following
    works and
    the LoadMetaData: setPartitionFilter method gets called correctly by
    pig.

    e.g.

    A = LOAD '$INPUT' using MyLoader('a:int, b:int'); -- the loader will
    parse
    this and also add the partition folder daydate

    F = FILTER A BY daydate='2010-11-01';

    STORE F INTO '$OUTPUT'





    But if the Loader does not return a Schema and the Schema is defined
    by the
    user in the AS clause Pig never calls
    LoadMetaData:setPartitionFilter at all
    and the partition filtering never happens.

    e.g.

    A = LOAD '$INPUT' AS (a:int, b:int, daydate:chararray);

    F = FILTER A BY daydate='2010-11-01';

    STORE F INTO '$OUTPUT';



    Any suggestions?



    Thanks,

    Gerrit
  • Gerrit Jansen van Vuuren at Nov 11, 2010 at 9:31 am
    Hi,

    I guess it should only call the setPartitionFilter when the
    LoadMetadata:getPartitionKeys returns a none null value. Currently
    getPartitionKeys is only called if the Loader returns a schema.


    Should I create a Jira and try at proposing a fix to this?

    Cheers,
    Gerrit


    -----Original Message-----
    From: Alan Gates
    Sent: Wednesday, November 10, 2010 9:56 PM
    To: dev@pig.apache.org
    Subject: Re: pig LoadMetaData find schema in AS clause from Loader.

    To answer your direct question, no, there is currently no provision in
    the interface for Pig to provide the user defined schema to the load
    function.

    But it seems like the real solution to your problem is that
    LoadMetaData:setPartitionFilter ought to be called regardless of
    whether the loader returns a schema. Is there a technical reason we
    don't do that?

    Alan.
    On Nov 5, 2010, at 8:13 AM, Gerrit Jansen van Vuuren wrote:

    HI,





    Is there any way in Pig where a LoadFunc can retrieve the Schema
    definition
    entered by the user in the AS clause?

    e.g. A = LOAD '$INPUT' USING MyLoader() AS (a:int, b:int);



    My question comes from the below problem I'm facing:



    So I'm writing a Loader that adds partition fields to the Schema. E.g.
    daydate, day, year month etc.

    These partitions are used to filter out entire folders in the storage
    location.

    I want to use the FILTER statement to filter by these keys.



    So if I create a Loader that returns its own Schema the following
    works and
    the LoadMetaData: setPartitionFilter method gets called correctly by
    pig.

    e.g.

    A = LOAD '$INPUT' using MyLoader('a:int, b:int'); -- the loader will
    parse
    this and also add the partition folder daydate

    F = FILTER A BY daydate='2010-11-01';

    STORE F INTO '$OUTPUT'





    But if the Loader does not return a Schema and the Schema is defined
    by the
    user in the AS clause Pig never calls
    LoadMetaData:setPartitionFilter at all
    and the partition filtering never happens.

    e.g.

    A = LOAD '$INPUT' AS (a:int, b:int, daydate:chararray);

    F = FILTER A BY daydate='2010-11-01';

    STORE F INTO '$OUTPUT';



    Any suggestions?



    Thanks,

    Gerrit
  • Thejas M Nair at Nov 11, 2010 at 3:18 pm
    Yes, setPartitionFilter can be called only if pig knows the partition columns. Without knowing the partition columns the partition filter cannot be extracted.
    If a user specifies a schema in the load statement, pig finds the partition columns by finding the position of columns returned by getPartitionKeys in the user defined schema, based on mapping of schema from getSchema() to user specified schema. Ie, pig assumes that the columns returned in getPartitionKeys() are columns in the schema returned in getSchema().

    In your case, does getPartitionKeys return columns that are specified in the user defined schema ?

    Yes, please open a jira, and lets discuss it there. I think at least javadoc might need to be updated

    -Thejas

    On 11/11/10 1:30 AM, "Gerrit Jansen van Vuuren" wrote:

    Hi,

    I guess it should only call the setPartitionFilter when the
    LoadMetadata:getPartitionKeys returns a none null value. Currently
    getPartitionKeys is only called if the Loader returns a schema.


    Should I create a Jira and try at proposing a fix to this?

    Cheers,
    Gerrit


    -----Original Message-----
    From: Alan Gates
    Sent: Wednesday, November 10, 2010 9:56 PM
    To: dev@pig.apache.org
    Subject: Re: pig LoadMetaData find schema in AS clause from Loader.

    To answer your direct question, no, there is currently no provision in
    the interface for Pig to provide the user defined schema to the load
    function.

    But it seems like the real solution to your problem is that
    LoadMetaData:setPartitionFilter ought to be called regardless of
    whether the loader returns a schema. Is there a technical reason we
    don't do that?

    Alan.
    On Nov 5, 2010, at 8:13 AM, Gerrit Jansen van Vuuren wrote:

    HI,





    Is there any way in Pig where a LoadFunc can retrieve the Schema
    definition
    entered by the user in the AS clause?

    e.g. A = LOAD '$INPUT' USING MyLoader() AS (a:int, b:int);



    My question comes from the below problem I'm facing:



    So I'm writing a Loader that adds partition fields to the Schema. E.g.
    daydate, day, year month etc.

    These partitions are used to filter out entire folders in the storage
    location.

    I want to use the FILTER statement to filter by these keys.



    So if I create a Loader that returns its own Schema the following
    works and
    the LoadMetaData: setPartitionFilter method gets called correctly by
    pig.

    e.g.

    A = LOAD '$INPUT' using MyLoader('a:int, b:int'); -- the loader will
    parse
    this and also add the partition folder daydate

    F = FILTER A BY daydate='2010-11-01';

    STORE F INTO '$OUTPUT'





    But if the Loader does not return a Schema and the Schema is defined
    by the
    user in the AS clause Pig never calls
    LoadMetaData:setPartitionFilter at all
    and the partition filtering never happens.

    e.g.

    A = LOAD '$INPUT' AS (a:int, b:int, daydate:chararray);

    F = FILTER A BY daydate='2010-11-01';

    STORE F INTO '$OUTPUT';



    Any suggestions?



    Thanks,

    Gerrit
  • Gerrit Jansen van Vuuren at Nov 11, 2010 at 3:45 pm
    Hi,



    I've create a Jira for this issue:
    https://issues.apache.org/jira/browse/PIG-1717



    The getPartitionKeys in my case will always return the keys that are defined
    as partitions in the path so that if the user loads from :
    /log/type1/daydate=2010-11-01 the partition key returned always is
    "daydate".



    Currently the following code does not cause the loader to be notified on the
    partition filter:

    A = load 'input' using MyLoader() as (q, p, daydate);

    F = FILTER A BY daydate='2010-11-01';



    If in some way pig could call the getPartitionKeys and then be aware that
    the daydate is a partition, all would work well.





    Cheers,





    From: Thejas M Nair
    Sent: Thursday, November 11, 2010 3:18 PM
    To: dev@pig.apache.org; Gerrit van Vuuren
    Subject: Re: pig LoadMetaData find schema in AS clause from Loader.



    Yes, setPartitionFilter can be called only if pig knows the partition
    columns. Without knowing the partition columns the partition filter cannot
    be extracted.
    If a user specifies a schema in the load statement, pig finds the partition
    columns by finding the position of columns returned by getPartitionKeys in
    the user defined schema, based on mapping of schema from getSchema() to user
    specified schema. Ie, pig assumes that the columns returned in
    getPartitionKeys() are columns in the schema returned in getSchema().

    In your case, does getPartitionKeys return columns that are specified in the
    user defined schema ?

    Yes, please open a jira, and lets discuss it there. I think at least javadoc
    might need to be updated

    -Thejas

    On 11/11/10 1:30 AM, "Gerrit Jansen van Vuuren"
    wrote:

    Hi,

    I guess it should only call the setPartitionFilter when the
    LoadMetadata:getPartitionKeys returns a none null value. Currently
    getPartitionKeys is only called if the Loader returns a schema.


    Should I create a Jira and try at proposing a fix to this?

    Cheers,
    Gerrit


    -----Original Message-----
    From: Alan Gates
    Sent: Wednesday, November 10, 2010 9:56 PM
    To: dev@pig.apache.org
    Subject: Re: pig LoadMetaData find schema in AS clause from Loader.

    To answer your direct question, no, there is currently no provision in
    the interface for Pig to provide the user defined schema to the load
    function.

    But it seems like the real solution to your problem is that
    LoadMetaData:setPartitionFilter ought to be called regardless of
    whether the loader returns a schema. Is there a technical reason we
    don't do that?

    Alan.
    On Nov 5, 2010, at 8:13 AM, Gerrit Jansen van Vuuren wrote:

    HI,





    Is there any way in Pig where a LoadFunc can retrieve the Schema
    definition
    entered by the user in the AS clause?

    e.g. A = LOAD '$INPUT' USING MyLoader() AS (a:int, b:int);



    My question comes from the below problem I'm facing:



    So I'm writing a Loader that adds partition fields to the Schema. E.g.
    daydate, day, year month etc.

    These partitions are used to filter out entire folders in the storage
    location.

    I want to use the FILTER statement to filter by these keys.



    So if I create a Loader that returns its own Schema the following
    works and
    the LoadMetaData: setPartitionFilter method gets called correctly by
    pig.

    e.g.

    A = LOAD '$INPUT' using MyLoader('a:int, b:int'); -- the loader will
    parse
    this and also add the partition folder daydate

    F = FILTER A BY daydate='2010-11-01';

    STORE F INTO '$OUTPUT'





    But if the Loader does not return a Schema and the Schema is defined
    by the
    user in the AS clause Pig never calls
    LoadMetaData:setPartitionFilter at all
    and the partition filtering never happens.

    e.g.

    A = LOAD '$INPUT' AS (a:int, b:int, daydate:chararray);

    F = FILTER A BY daydate='2010-11-01';

    STORE F INTO '$OUTPUT';



    Any suggestions?



    Thanks,

    Gerrit

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedNov 5, '10 at 3:13p
activeNov 11, '10 at 3:45p
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase