Grokbase Groups Pig user May 2011
FAQ
Is there a way to store the headers (titles of each) column using the Store
command in Pig Script (STORE out3 INTO '$OUTPUT' USING PigStorage();. Right
now it stores only the data. Somewhere I read in Pig0.8 it stores the header
with map reduce option. Do we have to supply extra parameters ?

Thanks, Deepak

--
"Please consider the environment before printing this e-mail"

The Newspaper Marketing Agency: Opening Up Newspapers:
www.nmauk.co.uk

This e-mail and any attachments are confidential, may be legally privileged and are the property of
News International Limited (which is the holding company for the News International group, is
registered in England under number 81701 and whose registered office is 3 Thomas More Square,
London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

If you have received this e-mail in error, please notify the sender immediately and do not use,
distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
those of the author and are not necessarily agreed or authorised by News International Limited or
any member of its group. News International Limited may monitor outgoing or incoming emails as
permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.

Search Discussions

  • Dmitriy Ryaboy at May 25, 2011 at 3:14 pm
    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using the Store
    command in Pig Script (STORE out3 INTO '$OUTPUT' USING PigStorage();. Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores the header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or o
    [truncated by sender]
  • Subhramanian, Deepak at May 25, 2011 at 3:49 pm
    I tried the PigStorageSchema. For some reason it doesnt create the headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak
    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using the Store
    command in Pig Script (STORE out3 INTO '$OUTPUT' USING PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged
    and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.

    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or o
    [truncated by sender]


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.
  • Subhramanian, Deepak at May 25, 2011 at 4:13 pm
    Hi , I just realized that it is creating .pig_header file in the same output
    directory. I guess I need to create a new UDF. Also if I am grouping it is
    appending the tag aggregated::group: to the header column. Is Flatten is not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID aggregated::group::AdID
    aggregated::group::CreativeID aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID aggregated::group::SiteID
    aggregated::group::PageID aggregated::group::Keyword
    aggregated::Impressions


    On 25 May 2011 16:48, Subhramanian, Deepak wrote:

    I tried the PigStorageSchema. For some reason it doesnt create the headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using the
    Store
    command in Pig Script (STORE out3 INTO '$OUTPUT' USING PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.

    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or o
    [truncated by sender]


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.
  • Dmitriy Ryaboy at May 25, 2011 at 5:30 pm
    Can you explain what UDF you are looking for?
    The intended usage for the .pig_header file is to cat it:

    hadoop fs -cat myresults/.pig_header myresults/part*

    (which drops the header right on top of your data).

    We don't want to put the header inside the data files because that can
    break subsequent processing.

    As for names of the fields, that's a pig feature, it's there for
    disambiguation. If you don't like it, you can rename the fields:
    FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)



    D

    On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
    wrote:
    Hi , I just realized that it is creating .pig_header file in the same output
    directory. I guess I need to create a new UDF. Also if I am grouping it is
    appending the tag aggregated::group: to the header column. Is Flatten is not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID      aggregated::group::AdID
    aggregated::group::CreativeID   aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID       aggregated::group::SiteID
    aggregated::group::PageID       aggregated::group::Keyword
    aggregated::Impressions



    On 25 May 2011 16:48, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I tried the PigStorageSchema. For some reason it doesnt create the headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using the
    Store
    command in Pig Script  (STORE out3 INTO '$OUTPUT' USING PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.

    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or o
    [truncated by sender]


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.
  • Subhramanian, Deepak at May 25, 2011 at 7:52 pm
    Thanks for the inputs. I am looking for a UDF which I can use to store the
    headers in the pig output file.
    On 25 May 2011 18:30, Dmitriy Ryaboy wrote:

    Can you explain what UDF you are looking for?
    The intended usage for the .pig_header file is to cat it:

    hadoop fs -cat myresults/.pig_header myresults/part*

    (which drops the header right on top of your data).

    We don't want to put the header inside the data files because that can
    break subsequent processing.

    As for names of the fields, that's a pig feature, it's there for
    disambiguation. If you don't like it, you can rename the fields:
    FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)



    D

    On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
    wrote:
    Hi , I just realized that it is creating .pig_header file in the same output
    directory. I guess I need to create a new UDF. Also if I am grouping it is
    appending the tag aggregated::group: to the header column. Is Flatten is not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID aggregated::group::AdID
    aggregated::group::CreativeID aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID aggregated::group::SiteID
    aggregated::group::PageID aggregated::group::Keyword
    aggregated::Impressions



    On 25 May 2011 16:48, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I tried the PigStorageSchema. For some reason it doesnt create the
    headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using the
    Store
    command in Pig Script (STORE out3 INTO '$OUTPUT' USING PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is
    3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.

    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or o
    [truncated by sender]


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.
    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this
    e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News
    International Limited or
    any member of its group. News International Limited may monitor outgoing
    or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this
    e-mail or attachments.


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.
  • Dmitriy Ryaboy at May 25, 2011 at 8:23 pm
    Still not clear on how you expect a UDF to help.. normally when we say
    UDFs, we mean functions work on individual tuples. They don't have
    anything to do with how you store data.

    You probably mean StoreFunc; since in this case you want a StoreFunc
    that messes with the file format, as opposed to writing a side file
    like PigStorageSchema does, you'll need to go pretty deep -- write a
    whole StoreFunc + OutputFormat + RecordWriter stack.





    On Wed, May 25, 2011 at 12:51 PM, Subhramanian, Deepak
    wrote:
    Thanks for the inputs. I am looking for a UDF which I can use to store the
    headers in the pig output file.
    On 25 May 2011 18:30, Dmitriy Ryaboy wrote:

    Can you explain what UDF you are looking for?
    The intended usage for the .pig_header file is to cat it:

    hadoop fs -cat myresults/.pig_header myresults/part*

    (which drops the header right on top of your data).

    We don't want to put the header inside the data files because that can
    break subsequent processing.

    As for names of the fields, that's a pig feature, it's there for
    disambiguation. If you don't like it, you can rename the fields:
    FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)



    D

    On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
    wrote:
    Hi , I just realized that it is creating .pig_header file in the same output
    directory. I guess I need to create a new UDF. Also if I am grouping it is
    appending the tag aggregated::group: to the header column. Is Flatten is not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID      aggregated::group::AdID
    aggregated::group::CreativeID   aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID       aggregated::group::SiteID
    aggregated::group::PageID       aggregated::group::Keyword
    aggregated::Impressions



    On 25 May 2011 16:48, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I tried the PigStorageSchema. For some reason it doesnt create the
    headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using the
    Store
    command in Pig Script  (STORE out3 INTO '$OUTPUT' USING PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is
    3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.

    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or o
    [truncated by sender]


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.
    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this
    e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News
    International Limited or
    any member of its group. News International Limited may monitor outgoing
    or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this
    e-mail or attachments.


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.
  • Alan Gates at May 25, 2011 at 8:29 pm
    The issue you are going to have is that you will only want one of your
    reducers, the one writing part-0000, to put out the header line.
    Otherwise it will get repeated at various points through your data.
    I'm not sure whether the necessary info is available to the store
    function to decide whether it is writing part-0000 or not. Unless
    your data is quite large you can use the PigStorageSchema as is and at
    the "cat" line Dmitriy suggests as the last line in your Pig Latin
    script. That will cause an extra read and write of the data, but it
    will produce one file with the header in the right place without
    requiring a new store function.

    Alan.
    On May 25, 2011, at 1:22 PM, Dmitriy Ryaboy wrote:

    Still not clear on how you expect a UDF to help.. normally when we say
    UDFs, we mean functions work on individual tuples. They don't have
    anything to do with how you store data.

    You probably mean StoreFunc; since in this case you want a StoreFunc
    that messes with the file format, as opposed to writing a side file
    like PigStorageSchema does, you'll need to go pretty deep -- write a
    whole StoreFunc + OutputFormat + RecordWriter stack.





    On Wed, May 25, 2011 at 12:51 PM, Subhramanian, Deepak
    wrote:
    Thanks for the inputs. I am looking for a UDF which I can use to
    store the
    headers in the pig output file.
    On 25 May 2011 18:30, Dmitriy Ryaboy wrote:

    Can you explain what UDF you are looking for?
    The intended usage for the .pig_header file is to cat it:

    hadoop fs -cat myresults/.pig_header myresults/part*

    (which drops the header right on top of your data).

    We don't want to put the header inside the data files because that
    can
    break subsequent processing.

    As for names of the fields, that's a pig feature, it's there for
    disambiguation. If you don't like it, you can rename the fields:
    FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)



    D

    On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
    wrote:
    Hi , I just realized that it is creating .pig_header file in the
    same output
    directory. I guess I need to create a new UDF. Also if I am
    grouping it is
    appending the tag aggregated::group: to the header column. Is
    Flatten is not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID aggregated::group::AdID
    aggregated::group::CreativeID aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID aggregated::group::SiteID
    aggregated::group::PageID aggregated::group::Keyword
    aggregated::Impressions



    On 25 May 2011 16:48, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I tried the PigStorageSchema. For some reason it doesnt create the
    headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column
    using the
    Store
    command in Pig Script (STORE out3 INTO '$OUTPUT' USING
    PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it
    stores the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for
    the News
    International group, is
    registered in England under number 81701 and whose registered
    office is
    3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems
    they were
    generated.

    If you have received this e-mail in error, please notify the
    sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or o
    [truncated by sender]


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the
    News
    International group, is
    registered in England under number 81701 and whose registered
    office is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they
    were
    generated.
    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions
    in this
    e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised
    by News
    International Limited or
    any member of its group. News International Limited may monitor
    outgoing
    or incoming emails as
    permitted by law. It accepts no liability for viruses introduced
    by this
    e-mail or attachments.


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the
    News International group, is
    registered in England under number 81701 and whose registered
    office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they
    were generated.

    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in
    this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by
    News International Limited or
    any member of its group. News International Limited may monitor
    outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by
    this e-mail or attachments.
  • Subhramanian, Deepak at May 26, 2011 at 11:02 am
    I thought any java class extension was a UDF. Thanks Dmitriy for clarifying.
    Yes. I meant extending the StoreFunce. I guess I will use the
    PigStorageSchema for the time being as I am tight on my deadlines. And use
    the cat to concatenate the header. I didnt realized that we can use the cat
    directly in the pig script and that is why thought of extending the
    StoreFunc. Thanks Alan for your inputs.

    I will have to read more on how the output part files are created on hdfs
    so that I can combine all the part files at the end of the pig script for a
    final output if the file size is very big.
    On 25 May 2011 21:22, Dmitriy Ryaboy wrote:

    Still not clear on how you expect a UDF to help.. normally when we say
    UDFs, we mean functions work on individual tuples. They don't have
    anything to do with how you store data.

    You probably mean StoreFunc; since in this case you want a StoreFunc
    that messes with the file format, as opposed to writing a side file
    like PigStorageSchema does, you'll need to go pretty deep -- write a
    whole StoreFunc + OutputFormat + RecordWriter stack.





    On Wed, May 25, 2011 at 12:51 PM, Subhramanian, Deepak
    wrote:
    Thanks for the inputs. I am looking for a UDF which I can use to store the
    headers in the pig output file.
    On 25 May 2011 18:30, Dmitriy Ryaboy wrote:

    Can you explain what UDF you are looking for?
    The intended usage for the .pig_header file is to cat it:

    hadoop fs -cat myresults/.pig_header myresults/part*

    (which drops the header right on top of your data).

    We don't want to put the header inside the data files because that can
    break subsequent processing.

    As for names of the fields, that's a pig feature, it's there for
    disambiguation. If you don't like it, you can rename the fields:
    FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)



    D

    On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
    wrote:
    Hi , I just realized that it is creating .pig_header file in the same output
    directory. I guess I need to create a new UDF. Also if I am grouping
    it
    is
    appending the tag aggregated::group: to the header column. Is Flatten
    is
    not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID aggregated::group::AdID
    aggregated::group::CreativeID aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID aggregated::group::SiteID
    aggregated::group::PageID aggregated::group::Keyword
    aggregated::Impressions



    On 25 May 2011 16:48, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I tried the PigStorageSchema. For some reason it doesnt create the
    headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using
    the
    Store
    command in Pig Script (STORE out3 INTO '$OUTPUT' USING
    PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores
    the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the
    News
    International group, is
    registered in England under number 81701 and whose registered office
    is
    3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they
    were
    generated.

    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or o
    [truncated by sender]


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office
    is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.
    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in
    this
    e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by
    News
    International Limited or
    any member of its group. News International Limited may monitor
    outgoing
    or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by
    this
    e-mail or attachments.


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.
    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this
    e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News
    International Limited or
    any member of its group. News International Limited may monitor outgoing
    or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this
    e-mail or attachments.


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.
  • Subhramanian, Deepak at May 26, 2011 at 6:20 pm
    I tried using the fs -cat and sh -cat function to combine the header and
    output file to a new file . But it is not working. Does hadoop give an
    option to combine two files to a new file in pig script.

    This is the command I used at the end of the pig script.

    STORE out3 INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    sh -cat $OUTPUT/.pig_header $OUTPUT/part* > $OUTPUT/top10adv.csv


    hadoop fs -ls pigdbck/output/top10advperimpfileh5
    Found 4 items
    -rw-r--r-- 1 root supergroup 30 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/.pig_header
    -rw-r--r-- 1 root supergroup 361 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/.pig_schema
    drwxr-xr-x - root supergroup 0 2011-05-26 17:51
    /user/root/pigdbck/output/top10advperimpfileh5/_logs
    -rw-r--r-- 1 root supergroup 117 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/part-r-00000

    On 26 May 2011 12:02, Subhramanian, Deepak wrote:

    I thought any java class extension was a UDF. Thanks Dmitriy for
    clarifying. Yes. I meant extending the StoreFunce. I guess I will use the
    PigStorageSchema for the time being as I am tight on my deadlines. And use
    the cat to concatenate the header. I didnt realized that we can use the cat
    directly in the pig script and that is why thought of extending the
    StoreFunc. Thanks Alan for your inputs.

    I will have to read more on how the output part files are created on hdfs
    so that I can combine all the part files at the end of the pig script for a
    final output if the file size is very big.

    On 25 May 2011 21:22, Dmitriy Ryaboy wrote:

    Still not clear on how you expect a UDF to help.. normally when we say
    UDFs, we mean functions work on individual tuples. They don't have
    anything to do with how you store data.

    You probably mean StoreFunc; since in this case you want a StoreFunc
    that messes with the file format, as opposed to writing a side file
    like PigStorageSchema does, you'll need to go pretty deep -- write a
    whole StoreFunc + OutputFormat + RecordWriter stack.





    On Wed, May 25, 2011 at 12:51 PM, Subhramanian, Deepak
    wrote:
    Thanks for the inputs. I am looking for a UDF which I can use to store the
    headers in the pig output file.
    On 25 May 2011 18:30, Dmitriy Ryaboy wrote:

    Can you explain what UDF you are looking for?
    The intended usage for the .pig_header file is to cat it:

    hadoop fs -cat myresults/.pig_header myresults/part*

    (which drops the header right on top of your data).

    We don't want to put the header inside the data files because that can
    break subsequent processing.

    As for names of the fields, that's a pig feature, it's there for
    disambiguation. If you don't like it, you can rename the fields:
    FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)



    D

    On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
    wrote:
    Hi , I just realized that it is creating .pig_header file in the same output
    directory. I guess I need to create a new UDF. Also if I am grouping
    it
    is
    appending the tag aggregated::group: to the header column. Is Flatten
    is
    not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID aggregated::group::AdID
    aggregated::group::CreativeID aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID aggregated::group::SiteID
    aggregated::group::PageID aggregated::group::Keyword
    aggregated::Impressions



    On 25 May 2011 16:48, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I tried the PigStorageSchema. For some reason it doesnt create the
    headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using
    the
    Store
    command in Pig Script (STORE out3 INTO '$OUTPUT' USING
    PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores
    the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak
    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.
  • Dmitriy Ryaboy at May 26, 2011 at 7:07 pm
    Try -getmerge

    D

    On Thu, May 26, 2011 at 11:20 AM, Subhramanian, Deepak
    wrote:
    I tried using the fs -cat and sh -cat function to combine the header and
    output file to a new file . But it is not working.  Does hadoop give an
    option to combine two files  to a new file in pig script.

    This is the command I used at the end of the pig script.

    STORE out3 INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    sh -cat $OUTPUT/.pig_header  $OUTPUT/part* > $OUTPUT/top10adv.csv


    hadoop fs -ls pigdbck/output/top10advperimpfileh5
    Found 4 items
    -rw-r--r--   1 root supergroup         30 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/.pig_header
    -rw-r--r--   1 root supergroup        361 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/.pig_schema
    drwxr-xr-x   - root supergroup          0 2011-05-26 17:51
    /user/root/pigdbck/output/top10advperimpfileh5/_logs
    -rw-r--r--   1 root supergroup        117 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/part-r-00000


    On 26 May 2011 12:02, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I thought any java class extension was a UDF. Thanks Dmitriy for
    clarifying. Yes. I meant extending the StoreFunce. I guess I will use the
    PigStorageSchema for the time being as I am tight on my deadlines. And use
    the cat to concatenate the header. I didnt realized that we can use the cat
    directly in the pig script and that is why thought of extending the
    StoreFunc.  Thanks Alan for your inputs.

    I will have to read more on how the output part files are created on hdfs
    so that I can combine all the part files at the end of the pig script for a
    final output  if the file size is very big.

    On 25 May 2011 21:22, Dmitriy Ryaboy wrote:

    Still not clear on how you expect a UDF to help.. normally when we say
    UDFs, we mean functions work on individual tuples. They don't have
    anything to do with how you store data.

    You probably mean StoreFunc; since in this case you want a StoreFunc
    that messes with the file format, as opposed to writing a side file
    like PigStorageSchema does, you'll need to go pretty deep -- write a
    whole StoreFunc + OutputFormat + RecordWriter stack.





    On Wed, May 25, 2011 at 12:51 PM, Subhramanian, Deepak
    wrote:
    Thanks for the inputs. I am looking for a UDF which I can use to store the
    headers in the pig output file.
    On 25 May 2011 18:30, Dmitriy Ryaboy wrote:

    Can you explain what UDF you are looking for?
    The intended usage for the .pig_header file is to cat it:

    hadoop fs -cat myresults/.pig_header myresults/part*

    (which drops the header right on top of your data).

    We don't want to put the header inside the data files because that can
    break subsequent processing.

    As for names of the fields, that's a pig feature, it's there for
    disambiguation. If you don't like it, you can rename the fields:
    FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)



    D

    On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
    wrote:
    Hi , I just realized that it is creating .pig_header file in the same output
    directory. I guess I need to create a new UDF. Also if I am grouping
    it
    is
    appending the tag aggregated::group: to the header column. Is Flatten
    is
    not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID      aggregated::group::AdID
    aggregated::group::CreativeID   aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID       aggregated::group::SiteID
    aggregated::group::PageID       aggregated::group::Keyword
    aggregated::Impressions



    On 25 May 2011 16:48, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I tried the PigStorageSchema. For some reason it doesnt create the
    headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk>
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column using
    the
    Store
    command in Pig Script  (STORE out3 INTO '$OUTPUT' USING
    PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it stores
    the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak
    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.
  • Subhramanian, Deepak at May 27, 2011 at 1:47 pm
    Thanks. getmerge save the output file locally. So had to move again back to
    hdfs directory. Is there a way to append datetime to the final output file.
    ?

    fs -cp $OUTPUT/.pig_header $OUTPUT/part* $OUTPUT/tmp
    fs -getmerge $OUTPUT/tmp top10advertisers.csv
    fs -rm results/top10advertisers.csv
    fs -moveFromLocal top10advertisers.csv results

    On 26 May 2011 20:07, Dmitriy Ryaboy wrote:

    Try -getmerge

    D

    On Thu, May 26, 2011 at 11:20 AM, Subhramanian, Deepak
    wrote:
    I tried using the fs -cat and sh -cat function to combine the header and
    output file to a new file . But it is not working. Does hadoop give an
    option to combine two files to a new file in pig script.

    This is the command I used at the end of the pig script.

    STORE out3 INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    sh -cat $OUTPUT/.pig_header $OUTPUT/part* > $OUTPUT/top10adv.csv


    hadoop fs -ls pigdbck/output/top10advperimpfileh5
    Found 4 items
    -rw-r--r-- 1 root supergroup 30 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/.pig_header
    -rw-r--r-- 1 root supergroup 361 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/.pig_schema
    drwxr-xr-x - root supergroup 0 2011-05-26 17:51
    /user/root/pigdbck/output/top10advperimpfileh5/_logs
    -rw-r--r-- 1 root supergroup 117 2011-05-26 17:52
    /user/root/pigdbck/output/top10advperimpfileh5/part-r-00000


    On 26 May 2011 12:02, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I thought any java class extension was a UDF. Thanks Dmitriy for
    clarifying. Yes. I meant extending the StoreFunce. I guess I will use
    the
    PigStorageSchema for the time being as I am tight on my deadlines. And
    use
    the cat to concatenate the header. I didnt realized that we can use the
    cat
    directly in the pig script and that is why thought of extending the
    StoreFunc. Thanks Alan for your inputs.

    I will have to read more on how the output part files are created on
    hdfs
    so that I can combine all the part files at the end of the pig script
    for a
    final output if the file size is very big.

    On 25 May 2011 21:22, Dmitriy Ryaboy wrote:

    Still not clear on how you expect a UDF to help.. normally when we say
    UDFs, we mean functions work on individual tuples. They don't have
    anything to do with how you store data.

    You probably mean StoreFunc; since in this case you want a StoreFunc
    that messes with the file format, as opposed to writing a side file
    like PigStorageSchema does, you'll need to go pretty deep -- write a
    whole StoreFunc + OutputFormat + RecordWriter stack.





    On Wed, May 25, 2011 at 12:51 PM, Subhramanian, Deepak
    wrote:
    Thanks for the inputs. I am looking for a UDF which I can use to
    store
    the
    headers in the pig output file.
    On 25 May 2011 18:30, Dmitriy Ryaboy wrote:

    Can you explain what UDF you are looking for?
    The intended usage for the .pig_header file is to cat it:

    hadoop fs -cat myresults/.pig_header myresults/part*

    (which drops the header right on top of your data).

    We don't want to put the header inside the data files because that
    can
    break subsequent processing.

    As for names of the fields, that's a pig feature, it's there for
    disambiguation. If you don't like it, you can rename the fields:
    FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)



    D

    On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
    wrote:
    Hi , I just realized that it is creating .pig_header file in the
    same
    output
    directory. I guess I need to create a new UDF. Also if I am
    grouping
    it
    is
    appending the tag aggregated::group: to the header column. Is
    Flatten
    is
    not
    suppose to remove the group ?

    cat .pig_header
    aggregated::group::AdvertiserID null::Advertiser
    aggregated::group::OrderID aggregated::group::AdID
    aggregated::group::CreativeID aggregated::group::CreativeVersion
    aggregated::group::CreativeSizeID aggregated::group::SiteID
    aggregated::group::PageID aggregated::group::Keyword
    aggregated::Impressions



    On 25 May 2011 16:48, Subhramanian, Deepak <
    deepak.subhramanian@newsint.co.uk> wrote:
    I tried the PigStorageSchema. For some reason it doesnt create
    the
    headers.
    Is it because I am loading the data using another UDF ?

    This is the command I used in the pigscript..

    STORE out INTO '$OUTPUT' USING
    org.apache.pig.piggybank.storage.PigStorageSchema();

    Thanks, Deepak

    On 25 May 2011 16:13, Dmitriy Ryaboy wrote:

    You can try PigStorageSchema from the piggybank.

    -----Original Message-----
    From: "Subhramanian, Deepak" <deepak.subhramanian@newsint.co.uk
    To: user@pig.apache.org
    Sent: 5/25/2011 5:28 AM
    Subject: Storing Headers in Pig Output File

    Is there a way to store the headers (titles of each) column
    using
    the
    Store
    command in Pig Script (STORE out3 INTO '$OUTPUT' USING
    PigStorage();.
    Right
    now it stores only the data. Somewhere I read in Pig0.8 it
    stores
    the
    header
    with map reduce option. Do we have to supply extra parameters ?

    Thanks, Deepak
    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally
    privileged and are the property of
    News International Limited (which is the holding company for the News
    International group, is
    registered in England under number 81701 and whose registered office is 3
    Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were
    generated.
    If you have received this e-mail in error, please notify the sender
    immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this
    e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News
    International Limited or
    any member of its group. News International Limited may monitor outgoing
    or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this
    e-mail or attachments.


    --
    Deepak Subhramanian
    Data & Analytics
    News International, Digital Technology
    Email: deepak.subhramanian@newsint.co.uk

    --
    "Please consider the environment before printing this e-mail"

    The Newspaper Marketing Agency: Opening Up Newspapers:
    www.nmauk.co.uk

    This e-mail and any attachments are confidential, may be legally privileged and are the property of
    News International Limited (which is the holding company for the News International group, is
    registered in England under number 81701 and whose registered office is 3 Thomas More Square,
    London E98 1XY, VAT number GB 243 8054 69), on whose systems they were generated.

    If you have received this e-mail in error, please notify the sender immediately and do not use,
    distribute, store or copy it in any way. Statements or opinions in this e-mail or any attachment are
    those of the author and are not necessarily agreed or authorised by News International Limited or
    any member of its group. News International Limited may monitor outgoing or incoming emails as
    permitted by law. It accepts no liability for viruses introduced by this e-mail or attachments.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 25, '11 at 12:28p
activeMay 27, '11 at 1:47p
posts12
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase