Grokbase Groups Pig user June 2011
FAQ
Hello all-

I've got a quick question and Google isn't proving to be much help.

I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?

The data might look something like this:

# Data Source: Project A
# Contact MMoore with Questions
# SenderId RecipientId
1 2
3 5
6 7
#2 1
3 6
11 7

Thanks!
-Michael

______________________________________
Michael Moore :: Michael.Moore@jhuapl.edu
The Johns Hopkins University Applied Physics Laboratory
0B7B17EE1AE2A80B pgp
BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint

Search Discussions

  • William Dowling at Jun 7, 2011 at 7:13 pm
    Can you stream it through

    grep -v ‘^#’



    ?



    William F Dowling

    Sr Technical Specialist, Software Engineering

    Thomson Reuters

    0 +1 215 823 3853



    From: Moore, Michael A.
    Sent: Tuesday, June 07, 2011 3:04 PM
    To: user@pig.apache.org
    Subject: Loading Files with Comment Lines



    Hello all-



    I've got a quick question and Google isn't proving to be much help.



    I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?



    The data might look something like this:



    # Data Source: Project A

    # Contact MMoore with Questions

    # SenderId RecipientId

    1 2

    3 5

    6 7

    #2 1

    3 6

    11 7



    Thanks!

    -Michael



    ______________________________________

    Michael Moore :: Michael.Moore@jhuapl.edu

    The Johns Hopkins University Applied Physics Laboratory

    0B7B17EE1AE2A80B pgp

    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
  • Moore, Michael A. at Jun 7, 2011 at 7:14 pm
    Possibly. Can I do that if the file is already in HDFS?
    ______________________________________
    Michael Moore :: Michael.Moore@jhuapl.edu
    The Johns Hopkins University Applied Physics Laboratory
    0B7B17EE1AE2A80B pgp
    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint

    On Jun 7, 2011, at 3:12 PM, wrote:

    Can you stream it through

    grep -v ‘^#’



    ?



    William F Dowling

    Sr Technical Specialist, Software Engineering

    Thomson Reuters

    0 +1 215 823 3853



    From: Moore, Michael A.
    Sent: Tuesday, June 07, 2011 3:04 PM
    To: user@pig.apache.org
    Subject: Loading Files with Comment Lines



    Hello all-



    I've got a quick question and Google isn't proving to be much help.



    I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?



    The data might look something like this:



    # Data Source: Project A

    # Contact MMoore with Questions

    # SenderId RecipientId

    1 2

    3 5

    6 7

    #2 1

    3 6

    11 7



    Thanks!

    -Michael



    ______________________________________

    Michael Moore :: Michael.Moore@jhuapl.edu
    The Johns Hopkins University Applied Physics Laboratory

    0B7B17EE1AE2A80B pgp

    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint



  • William Dowling at Jun 7, 2011 at 7:17 pm
    I do that kind of streaming on hdfs files using Hadoop streaming, outside of pig. I assume you could do it from inside pig too, but haven’t tested.



    William F Dowling

    Sr Technical Specialist, Software Engineering

    Thomson Reuters

    0 +1 215 823 3853



    From: Moore, Michael A.
    Sent: Tuesday, June 07, 2011 3:14 PM
    To: user@pig.apache.org
    Subject: Re: Loading Files with Comment Lines



    Possibly. Can I do that if the file is already in HDFS?

    ______________________________________

    Michael Moore :: Michael.Moore@jhuapl.edu

    The Johns Hopkins University Applied Physics Laboratory

    0B7B17EE1AE2A80B pgp

    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint





    On Jun 7, 2011, at 3:12 PM, wrote:





    Can you stream it through

    grep -v ‘^#’



    ?



    William F Dowling

    Sr Technical Specialist, Software Engineering

    Thomson Reuters

    0 +1 215 823 3853



    From: Moore, Michael A.
    Sent: Tuesday, June 07, 2011 3:04 PM
    To: user@pig.apache.org
    Subject: Loading Files with Comment Lines



    Hello all-



    I've got a quick question and Google isn't proving to be much help.



    I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?



    The data might look something like this:



    # Data Source: Project A

    # Contact MMoore with Questions

    # SenderId RecipientId

    1 2

    3 5

    6 7

    #2 1

    3 6

    11 7



    Thanks!

    -Michael



    ______________________________________

    Michael Moore :: Michael.Moore@jhuapl.edu

    The Johns Hopkins University Applied Physics Laboratory

    0B7B17EE1AE2A80B pgp

    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
  • Moore, Michael A. at Jun 7, 2011 at 7:19 pm
    Hmm, thanks for the reply. Anyone have a Pig way of doing this? I'd rather not write a UDF to look for comment lines, but I can do so if I have to. This seems like something PigStorage or the like should handle.
    ______________________________________
    Michael Moore :: Michael.Moore@jhuapl.edu
    The Johns Hopkins University Applied Physics Laboratory
    JHUAPL/AISD/VES analytics section
    240-228-6768 phone
    202-370-7993 mobile

    0B7B17EE1AE2A80B pgp
    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint

    On Jun 7, 2011, at 3:17 PM, <william.dowling@thomsonreuters.com> wrote:

    I do that kind of streaming on hdfs files using Hadoop streaming, outside of pig. I assume you could do it from inside pig too, but haven’t tested.



    William F Dowling

    Sr Technical Specialist, Software Engineering

    Thomson Reuters

    0 +1 215 823 3853



    From: Moore, Michael A.
    Sent: Tuesday, June 07, 2011 3:14 PM
    To: user@pig.apache.org
    Subject: Re: Loading Files with Comment Lines



    Possibly. Can I do that if the file is already in HDFS?

    ______________________________________

    Michael Moore :: Michael.Moore@jhuapl.edu
    The Johns Hopkins University Applied Physics Laboratory

    0B7B17EE1AE2A80B pgp

    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint





    On Jun 7, 2011, at 3:12 PM, wrote:





    Can you stream it through

    grep -v ‘^#’



    ?



    William F Dowling

    Sr Technical Specialist, Software Engineering

    Thomson Reuters

    0 +1 215 823 3853



    From: Moore, Michael A.
    Sent: Tuesday, June 07, 2011 3:04 PM
    To: user@pig.apache.org
    Subject: Loading Files with Comment Lines



    Hello all-



    I've got a quick question and Google isn't proving to be much help.



    I've got a big file, that has a few lines in it prefaced with a pound sign (#) to indicate they are to be ignored. I would like to LOAD this file using PigStorage. Is there a way to do this, or is it handled automatically?



    The data might look something like this:



    # Data Source: Project A

    # Contact MMoore with Questions

    # SenderId RecipientId

    1 2

    3 5

    6 7

    #2 1

    3 6

    11 7



    Thanks!

    -Michael



    ______________________________________

    Michael Moore :: Michael.Moore@jhuapl.edu
    The Johns Hopkins University Applied Physics Laboratory

    0B7B17EE1AE2A80B pgp

    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint






  • Daniel Eklund at Jun 7, 2011 at 7:18 pm
    agree with the pre-processing step... BUT, in case the data is big data
    (i.e. pound signs scattered over terabytes), you could load things into a
    relvar first as one big data, filter, and then split on the columns... i
    have many similar issues where the default loader won't handle something,
    and I have been using this 'design pattern'... Something like:

    A = LOAD 'yourfile' AS (data:chararray);
    B = FILTER A by SUBSTRING(data,0,1) != '#';
    C = FOREACH B generate SOMETOKENIZEUDF(data) as ( .. your columns...);

    I've become a big fan of the python udfs, and you could easily use them as
    your own 'loader' in the third step above.

    I will not vouch for the efficiency of the approach.
    On Tue, Jun 7, 2011 at 3:12 PM, wrote:

    Can you stream it through

    grep -v ‘^#’



    ?



    William F Dowling

    Sr Technical Specialist, Software Engineering

    Thomson Reuters

    0 +1 215 823 3853



    From: Moore, Michael A.
    Sent: Tuesday, June 07, 2011 3:04 PM
    To: user@pig.apache.org
    Subject: Loading Files with Comment Lines



    Hello all-



    I've got a quick question and Google isn't proving to be much help.



    I've got a big file, that has a few lines in it prefaced with a pound sign
    (#) to indicate they are to be ignored. I would like to LOAD this file
    using PigStorage. Is there a way to do this, or is it handled
    automatically?



    The data might look something like this:



    # Data Source: Project A

    # Contact MMoore with Questions

    # SenderId RecipientId

    1 2

    3 5

    6 7

    #2 1

    3 6

    11 7



    Thanks!

    -Michael



    ______________________________________

    Michael Moore :: Michael.Moore@jhuapl.edu >

    The Johns Hopkins University Applied Physics Laboratory

    0B7B17EE1AE2A80B pgp

    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint




  • Alan Gates at Jun 7, 2011 at 8:26 pm
    A = load 'input' as (x, y);
    B = filter A by SUBSTRING(x, 0, 1) != '#';
    ...

    On Jun 7, 2011, at 12:04 PM, Moore, Michael A. wrote:

    Hello all-

    I've got a quick question and Google isn't proving to be much help.

    I've got a big file, that has a few lines in it prefaced with a
    pound sign (#) to indicate they are to be ignored. I would like to
    LOAD this file using PigStorage. Is there a way to do this, or is
    it handled automatically?

    The data might look something like this:

    # Data Source: Project A
    # Contact MMoore with Questions
    # SenderId RecipientId
    1 2
    3 5
    6 7
    #2 1
    3 6
    11 7

    Thanks!
    -Michael

    ______________________________________
    Michael Moore :: Michael.Moore@jhuapl.edu
    The Johns Hopkins University Applied Physics Laboratory
    0B7B17EE1AE2A80B pgp
    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
  • Moore, Michael A. at Jun 8, 2011 at 2:44 pm
    Brilliant! Thanks Alan!

    ________________________________________
    From: Alan Gates [gates@yahoo-inc.com]
    Sent: Tuesday, June 07, 2011 4:25 PM
    To: user@pig.apache.org
    Subject: Re: Loading Files with Comment Lines

    A = load 'input' as (x, y);
    B = filter A by SUBSTRING(x, 0, 1) != '#';
    ...

    On Jun 7, 2011, at 12:04 PM, Moore, Michael A. wrote:

    Hello all-

    I've got a quick question and Google isn't proving to be much help.

    I've got a big file, that has a few lines in it prefaced with a
    pound sign (#) to indicate they are to be ignored. I would like to
    LOAD this file using PigStorage. Is there a way to do this, or is
    it handled automatically?

    The data might look something like this:

    # Data Source: Project A
    # Contact MMoore with Questions
    # SenderId RecipientId
    1 2
    3 5
    6 7
    #2 1
    3 6
    11 7

    Thanks!
    -Michael

    ______________________________________
    Michael Moore :: Michael.Moore@jhuapl.edu
    The Johns Hopkins University Applied Physics Laboratory
    0B7B17EE1AE2A80B pgp
    BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 7, '11 at 7:04p
activeJun 8, '11 at 2:44p
posts8
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase