Grokbase Groups Pig user July 2009
FAQ
Hi all,



We have a facility in hadoop where we can specify multiple input paths.
Does this exist in Pig? Essentially, Is it possible to specify multiple
paths in load command? For example, I have n number of input paths which
I need to load for processing. The only possibility that I can see right
now is to use n variables using n load commands and do an union at the
end.

For ex:



Raw1 = LOAD '$inputPath1/*' using PigStorage('\t');

Raw2 = LOAD '$inputPath2/*' using PigStorage('\t');

.

.

.

.

Rawn = LOAD '$inputPathn/*' using PigStorage('\t');

Raw = UNION Raw1,Raw2,....RawN



Can anyone kindly let me know if there is a simpler way of doing it in
single LOAD line or something like that?



Thanks

Pallavi

Search Discussions

  • Zjffdu at Jul 9, 2009 at 12:06 pm
    You can use pattern to match the path:

    For example:

    Raw1 = LOAD '{inputPath1,inputPath2,...}/*' using PigStorage('\t');

    This will load all the data under inputPath1,inputPath2,...

    This is a mechanism supported by hadoop internally.



    -----Original Message-----
    From: Palleti, Pallavi
    Sent: 2009年7月8日 20:34
    To: pig-user@hadoop.apache.org
    Subject: Specifying multiple input paths in LOAD command

    Hi all,



    We have a facility in hadoop where we can specify multiple input paths.
    Does this exist in Pig? Essentially, Is it possible to specify multiple
    paths in load command? For example, I have n number of input paths which
    I need to load for processing. The only possibility that I can see right
    now is to use n variables using n load commands and do an union at the
    end.

    For ex:



    Raw1 = LOAD '$inputPath1/*' using PigStorage('\t');

    Raw2 = LOAD '$inputPath2/*' using PigStorage('\t');

    .

    .

    .

    .

    Rawn = LOAD '$inputPathn/*' using PigStorage('\t');

    Raw = UNION Raw1,Raw2,....RawN



    Can anyone kindly let me know if there is a simpler way of doing it in
    single LOAD line or something like that?



    Thanks

    Pallavi
  • Thejas Nair at Jul 9, 2009 at 1:06 pm
    From my experience, the entries in {} have to be one dir name, it can't be a
    path containing several dirs.
    This does not work - LOAD '{/d1/abc/def/f1,/d1/abc/xyz/f1}'
    This works - LOAD '/d1/abc/{def,xyz}/f1'

    -Thejas

    On 7/9/09 8:07 PM, "zjffdu" wrote:

    You can use pattern to match the path:

    For example:

    Raw1 = LOAD '{inputPath1,inputPath2,...}/*' using PigStorage('\t');

    This will load all the data under inputPath1,inputPath2,...

    This is a mechanism supported by hadoop internally.



    -----Original Message-----
    From: Palleti, Pallavi
    Sent: 2009年7月8日 20:34
    To: pig-user@hadoop.apache.org
    Subject: Specifying multiple input paths in LOAD command

    Hi all,



    We have a facility in hadoop where we can specify multiple input paths.
    Does this exist in Pig? Essentially, Is it possible to specify multiple
    paths in load command? For example, I have n number of input paths which
    I need to load for processing. The only possibility that I can see right
    now is to use n variables using n load commands and do an union at the
    end.

    For ex:



    Raw1 = LOAD '$inputPath1/*' using PigStorage('\t');

    Raw2 = LOAD '$inputPath2/*' using PigStorage('\t');

    .

    .

    .

    .

    Rawn = LOAD '$inputPathn/*' using PigStorage('\t');

    Raw = UNION Raw1,Raw2,....RawN



    Can anyone kindly let me know if there is a simpler way of doing it in
    single LOAD line or something like that?



    Thanks

    Pallavi






  • Daniel Dai at Jul 9, 2009 at 2:48 pm
    PIG-252 (https://issues.apache.org/jira/browse/PIG-252) address this issue.

    Instead of using union, you can try this:

    Raw = LOAD '$inputPathprefix{1,2,3,4}/*' using PigStorage('\t');



    ----- Original Message -----
    From: "Palleti, Pallavi" <pallavi.palleti@corp.aol.com>
    To: <pig-user@hadoop.apache.org>
    Sent: Wednesday, July 08, 2009 8:34 PM
    Subject: Specifying multiple input paths in LOAD command


    Hi all,



    We have a facility in hadoop where we can specify multiple input paths.
    Does this exist in Pig? Essentially, Is it possible to specify multiple
    paths in load command? For example, I have n number of input paths which
    I need to load for processing. The only possibility that I can see right
    now is to use n variables using n load commands and do an union at the
    end.

    For ex:



    Raw1 = LOAD '$inputPath1/*' using PigStorage('\t');

    Raw2 = LOAD '$inputPath2/*' using PigStorage('\t');

    .

    .

    .

    .

    Rawn = LOAD '$inputPathn/*' using PigStorage('\t');

    Raw = UNION Raw1,Raw2,....RawN



    Can anyone kindly let me know if there is a simpler way of doing it in
    single LOAD line or something like that?



    Thanks

    Pallavi

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 9, '09 at 3:34a
activeJul 9, '09 at 2:48p
posts4
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase