FAQ
Hi:

I would like to load multiple files in my pig latin program, such as

A = LOAD '<regular expression>' ...

What types of regular expressions does pig latin support to match file
names? Thanks.



--
tp

Search Discussions

  • Kevin Weil at Nov 5, 2008 at 4:14 pm
    Charles,

    I'm sure you've read the documentation about being able to load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For something more
    fine-grained, you could wrap your pig script in a higher level language like
    python that could turn your regex into a list of files, and then fill in the
    appropriate files in the load expression.

    Kevin

    On Tue, Nov 4, 2008 at 5:54 PM, charles du wrote:

    Hi:

    I would like to load multiple files in my pig latin program, such as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to match file
    names? Thanks.



    --
    tp
  • Charles du at Nov 18, 2008 at 12:27 am
    I tried the following command, the pig latin does not recognize it.

    A = LOAD
    '{/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*'

    I tried the hadoop command, and it lists all files under these two
    directories as expected.
    hadoop dfs -ls
    {/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*


    I tried another commands, and it works
    A = LOAD '/user/middleware/click/0810/{081011, 081012}/*'

    it looks to me that I cannot put '/' in '{}' for pig latin. Is it right? Is
    there a way I can get around it. Thanks.


    On Wed, Nov 5, 2008 at 8:14 AM, Kevin Weil wrote:

    Charles,

    I'm sure you've read the documentation about being able to load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For something more
    fine-grained, you could wrap your pig script in a higher level language
    like
    python that could turn your regex into a list of files, and then fill in
    the
    appropriate files in the load expression.

    Kevin

    On Tue, Nov 4, 2008 at 5:54 PM, charles du wrote:

    Hi:

    I would like to load multiple files in my pig latin program, such as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to match file
    names? Thanks.



    --
    tp


    --
    tp
  • Ian Holsman at Nov 18, 2008 at 12:53 am
    does
    A = LOAD '/user/middleware/click/0811/0811*'
    work for you?


    charles du wrote:
    I tried the following command, the pig latin does not recognize it.

    A = LOAD
    '{/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*'

    I tried the hadoop command, and it lists all files under these two
    directories as expected.
    hadoop dfs -ls
    {/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*


    I tried another commands, and it works
    A = LOAD '/user/middleware/click/0810/{081011, 081012}/*'

    it looks to me that I cannot put '/' in '{}' for pig latin. Is it right? Is
    there a way I can get around it. Thanks.



    On Wed, Nov 5, 2008 at 8:14 AM, Kevin Weil wrote:

    Charles,

    I'm sure you've read the documentation about being able to load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For something more
    fine-grained, you could wrap your pig script in a higher level language
    like
    python that could turn your regex into a list of files, and then fill in
    the
    appropriate files in the load expression.

    Kevin


    On Tue, Nov 4, 2008 at 5:54 PM, charles du wrote:

    Hi:

    I would like to load multiple files in my pig latin program, such as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to match file
    names? Thanks.



    --
    tp

  • Charles du at Nov 18, 2008 at 1:43 am
    This works. The problem is that I need get files from different directories:
    a few from 0810 and a few from 0811. Thanks.

    Chuang

    On Mon, Nov 17, 2008 at 4:51 PM, Ian Holsman wrote:

    does
    A = LOAD '/user/middleware/click/0811/0811*'
    work for you?



    charles du wrote:
    I tried the following command, the pig latin does not recognize it.

    A = LOAD

    '{/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*'
    I tried the hadoop command, and it lists all files under these two
    directories as expected.
    hadoop dfs -ls
    {/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*


    I tried another commands, and it works
    A = LOAD '/user/middleware/click/0810/{081011, 081012}/*'

    it looks to me that I cannot put '/' in '{}' for pig latin. Is it right?
    Is
    there a way I can get around it. Thanks.



    On Wed, Nov 5, 2008 at 8:14 AM, Kevin Weil wrote:


    Charles,

    I'm sure you've read the documentation about being able to load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For something more
    fine-grained, you could wrap your pig script in a higher level language
    like
    python that could turn your regex into a list of files, and then fill in
    the
    appropriate files in the load expression.

    Kevin


    On Tue, Nov 4, 2008 at 5:54 PM, charles du wrote:


    Hi:

    I would like to load multiple files in my pig latin program, such as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to match file
    names? Thanks.



    --
    tp



    --
    tp
  • Ian Holsman at Nov 18, 2008 at 2:02 am
    charles du wrote:
    This works. The problem is that I need get files from different directories:
    a few from 0810 and a few from 0811. Thanks.

    */* works as well.
    Chuang


    On Mon, Nov 17, 2008 at 4:51 PM, Ian Holsman wrote:

    does
    A = LOAD '/user/middleware/click/0811/0811*'
    work for you?



    charles du wrote:

    I tried the following command, the pig latin does not recognize it.

    A = LOAD

    '{/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*'
    I tried the hadoop command, and it lists all files under these two
    directories as expected.
    hadoop dfs -ls
    {/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*


    I tried another commands, and it works
    A = LOAD '/user/middleware/click/0810/{081011, 081012}/*'

    it looks to me that I cannot put '/' in '{}' for pig latin. Is it right?
    Is
    there a way I can get around it. Thanks.



    On Wed, Nov 5, 2008 at 8:14 AM, Kevin Weil wrote:



    Charles,

    I'm sure you've read the documentation about being able to load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For something more
    fine-grained, you could wrap your pig script in a higher level language
    like
    python that could turn your regex into a list of files, and then fill in
    the
    appropriate files in the load expression.

    Kevin


    On Tue, Nov 4, 2008 at 5:54 PM, charles du wrote:



    Hi:

    I would like to load multiple files in my pig latin program, such as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to match file
    names? Thanks.



    --
    tp


  • Charles du at Nov 19, 2008 at 4:06 pm
    Thanks for your reply.

    I have multiple directories, and I want to get a subset of files from each
    directory. For example: I have two directories A and B with each having
    three files.
    A/: A1, A2, A3
    B/: B4, B5, B6

    How can I load A/A3, B/B4, and B/B5 as input? Right now, I did it using
    UNION, and I am wondering if there is a better way to do it.

    Hadoop dfs -ls {A/A3, B/B4, B/B5}

    will list these three files, I tried similar thing in pig latin as follows,
    and it does not work

    Data = Load '{A/A3, B/B4, B/B5}' ...

    What is the right way to write it in pig latin?

    Thanks.


    On Mon, Nov 17, 2008 at 6:01 PM, Ian Holsman wrote:

    charles du wrote:
    This works. The problem is that I need get files from different
    directories:
    a few from 0810 and a few from 0811. Thanks.

    */* works as well.


    Chuang

    On Mon, Nov 17, 2008 at 4:51 PM, Ian Holsman wrote:


    does
    A = LOAD '/user/middleware/click/0811/0811*'
    work for you?



    charles du wrote:


    I tried the following command, the pig latin does not recognize it.

    A = LOAD


    '{/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*'
    I tried the hadoop command, and it lists all files under these two
    directories as expected.
    hadoop dfs -ls

    {/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*


    I tried another commands, and it works
    A = LOAD '/user/middleware/click/0810/{081011, 081012}/*'

    it looks to me that I cannot put '/' in '{}' for pig latin. Is it right?
    Is
    there a way I can get around it. Thanks.



    On Wed, Nov 5, 2008 at 8:14 AM, Kevin Weil wrote:




    Charles,

    I'm sure you've read the documentation about being able to load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For something
    more
    fine-grained, you could wrap your pig script in a higher level language
    like
    python that could turn your regex into a list of files, and then fill
    in
    the
    appropriate files in the load expression.

    Kevin


    On Tue, Nov 4, 2008 at 5:54 PM, charles du <taiping.du@gmail.com>
    wrote:




    Hi:

    I would like to load multiple files in my pig latin program, such as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to match file
    names? Thanks.



    --
    tp





    --
    tp
  • Kevin Weil at Dec 17, 2008 at 10:13 am
    Is there any resolution on this, by chance? In the example below, Charles
    wants to load A/A3 and B/B4 in one load statement, rather than using a union
    of multiple load statements.

    Thanks,
    Kevin
    On Wed, Nov 19, 2008 at 8:06 AM, charles du wrote:

    Thanks for your reply.

    I have multiple directories, and I want to get a subset of files from each
    directory. For example: I have two directories A and B with each having
    three files.
    A/: A1, A2, A3
    B/: B4, B5, B6

    How can I load A/A3, B/B4, and B/B5 as input? Right now, I did it using
    UNION, and I am wondering if there is a better way to do it.

    Hadoop dfs -ls {A/A3, B/B4, B/B5}

    will list these three files, I tried similar thing in pig latin as follows,
    and it does not work

    Data = Load '{A/A3, B/B4, B/B5}' ...

    What is the right way to write it in pig latin?

    Thanks.


    On Mon, Nov 17, 2008 at 6:01 PM, Ian Holsman wrote:

    charles du wrote:
    This works. The problem is that I need get files from different
    directories:
    a few from 0810 and a few from 0811. Thanks.

    */* works as well.


    Chuang

    On Mon, Nov 17, 2008 at 4:51 PM, Ian Holsman wrote:


    does
    A = LOAD '/user/middleware/click/0811/0811*'
    work for you?



    charles du wrote:


    I tried the following command, the pig latin does not recognize it.

    A = LOAD

    '{/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*'
    I tried the hadoop command, and it lists all files under these two
    directories as expected.
    hadoop dfs -ls
    {/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*

    I tried another commands, and it works
    A = LOAD '/user/middleware/click/0810/{081011, 081012}/*'

    it looks to me that I cannot put '/' in '{}' for pig latin. Is it
    right?
    Is
    there a way I can get around it. Thanks.



    On Wed, Nov 5, 2008 at 8:14 AM, Kevin Weil wrote:




    Charles,

    I'm sure you've read the documentation about being able to load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For something
    more
    fine-grained, you could wrap your pig script in a higher level
    language
    like
    python that could turn your regex into a list of files, and then fill
    in
    the
    appropriate files in the load expression.

    Kevin


    On Tue, Nov 4, 2008 at 5:54 PM, charles du <taiping.du@gmail.com>
    wrote:




    Hi:

    I would like to load multiple files in my pig latin program, such as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to match
    file
    names? Thanks.



    --
    tp





    --
    tp
  • Alan Gates at Dec 17, 2008 at 4:55 pm
    Pig just passes the file names it gets onto hadoop, so I would expect
    whatever works in hadoop to work in pig. How does the / not work
    inside {} in pig? Does it give an error or just not read the correct
    files? A JIRA should be filed on this so we can track it and get it
    fixed.

    Alan.
    On Dec 17, 2008, at 2:12 AM, Kevin Weil wrote:

    Is there any resolution on this, by chance? In the example below,
    Charles
    wants to load A/A3 and B/B4 in one load statement, rather than
    using a union
    of multiple load statements.

    Thanks,
    Kevin
    On Wed, Nov 19, 2008 at 8:06 AM, charles du wrote:

    Thanks for your reply.

    I have multiple directories, and I want to get a subset of files
    from each
    directory. For example: I have two directories A and B with each
    having
    three files.
    A/: A1, A2, A3
    B/: B4, B5, B6

    How can I load A/A3, B/B4, and B/B5 as input? Right now, I did it
    using
    UNION, and I am wondering if there is a better way to do it.

    Hadoop dfs -ls {A/A3, B/B4, B/B5}

    will list these three files, I tried similar thing in pig latin as
    follows,
    and it does not work

    Data = Load '{A/A3, B/B4, B/B5}' ...

    What is the right way to write it in pig latin?

    Thanks.



    On Mon, Nov 17, 2008 at 6:01 PM, Ian Holsman <lists@holsman.net>
    wrote:
    charles du wrote:
    This works. The problem is that I need get files from different
    directories:
    a few from 0810 and a few from 0811. Thanks.

    */* works as well.


    Chuang

    On Mon, Nov 17, 2008 at 4:51 PM, Ian Holsman <lists@holsman.net>
    wrote:


    does
    A = LOAD '/user/middleware/click/0811/0811*'
    work for you?



    charles du wrote:


    I tried the following command, the pig latin does not
    recognize it.

    A = LOAD

    '{/user/middleware/click/0810/081011,/user/middleware/click/
    0811/081112}/*'
    I tried the hadoop command, and it lists all files under these
    two
    directories as expected.
    hadoop dfs -ls
    {/user/middleware/click/0810/081011,/user/middleware/click/
    0811/081112}/*

    I tried another commands, and it works
    A = LOAD '/user/middleware/click/0810/{081011, 081012}/*'

    it looks to me that I cannot put '/' in '{}' for pig latin. Is it
    right?
    Is
    there a way I can get around it. Thanks.



    On Wed, Nov 5, 2008 at 8:14 AM, Kevin Weil <kevinweil@gmail.com>
    wrote:



    Charles,

    I'm sure you've read the documentation about being able to
    load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For
    something
    more
    fine-grained, you could wrap your pig script in a higher level
    language
    like
    python that could turn your regex into a list of files, and
    then fill
    in
    the
    appropriate files in the load expression.

    Kevin


    On Tue, Nov 4, 2008 at 5:54 PM, charles du
    <taiping.du@gmail.com>
    wrote:




    Hi:

    I would like to load multiple files in my pig latin program,
    such as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to
    match
    file
    names? Thanks.



    --
    tp





    --
    tp
  • Kevin Weil at Dec 17, 2008 at 8:21 pm
    Alan,

    It fails with Hadoop on top of the stack, but perhaps Pig isn't passing it
    in correctly from further down the stack? I can do

    hadoop dfs -ls dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}

    from the command line, but if my load statement looks like

    files = LOAD 'dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}' USING ...

    then I get the stack below. I have filed JIRA
    569<https://issues.apache.org/jira/browse/PIG-569>about this. I have
    a scripting language running around Pig generating load
    statements and so forth -- is there any work around to make this into a
    single load statement, rather than having to load each individually and
    union them?

    Thanks,
    Kevin

    2008-12-17 12:02:28,480 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
    - 0% complete
    2008-12-17 12:02:28,480 [main] ERROR
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
    - Map reduce job failed
    2008-12-17 12:02:28,480 [main] ERROR
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
    - java.io.IOException: Unable to get collect for pattern
    dir{dir1/subdir1,dir2/subdir2,dir3/subdir3} [Failed to obtain glob for
    dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}]
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:231)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:40)
    at
    org.apache.pig.impl.io.FileLocalizer.globMatchesFiles(FileLocalizer.java:486)
    at
    org.apache.pig.impl.io.FileLocalizer.fileExists(FileLocalizer.java:455)
    at
    org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:108)
    at
    org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
    at
    org.apache.pig.impl.io.ValidatingInputFileSpec.(PigInputFormat.java:200)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742)
    at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370)
    at
    org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
    at
    org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
    at java.lang.Thread.run(Thread.java:619)
    Caused by: org.apache.pig.backend.datastorage.DataStorageException: Failed
    to obtain glob for dir/{dir1/subdir1,dir2/subdir2,dir3/subdir3}
    ... 13 more
    Caused by: java.io.IOException: Illegal file pattern: Expecting set closure
    character or end of range, or } for glob {dir1 at 5
    at
    org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1084)
    at
    org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1069)
    at
    org.apache.hadoop.fs.FileSystem$GlobFilter.(FileSystem.java:953)
    at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962)
    at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962)
    at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:962)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:902)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:862)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.asCollection(HDataStorage.java:215)
    ... 12 more

    On Wed, Dec 17, 2008 at 8:54 AM, Alan Gates wrote:

    Pig just passes the file names it gets onto hadoop, so I would expect
    whatever works in hadoop to work in pig. How does the / not work inside {}
    in pig? Does it give an error or just not read the correct files? A JIRA
    should be filed on this so we can track it and get it fixed.

    Alan.


    On Dec 17, 2008, at 2:12 AM, Kevin Weil wrote:

    Is there any resolution on this, by chance? In the example below, Charles
    wants to load A/A3 and B/B4 in one load statement, rather than using a
    union
    of multiple load statements.

    Thanks,
    Kevin

    On Wed, Nov 19, 2008 at 8:06 AM, charles du wrote:

    Thanks for your reply.
    I have multiple directories, and I want to get a subset of files from
    each
    directory. For example: I have two directories A and B with each having
    three files.
    A/: A1, A2, A3
    B/: B4, B5, B6

    How can I load A/A3, B/B4, and B/B5 as input? Right now, I did it using
    UNION, and I am wondering if there is a better way to do it.

    Hadoop dfs -ls {A/A3, B/B4, B/B5}

    will list these three files, I tried similar thing in pig latin as
    follows,
    and it does not work

    Data = Load '{A/A3, B/B4, B/B5}' ...

    What is the right way to write it in pig latin?

    Thanks.



    On Mon, Nov 17, 2008 at 6:01 PM, Ian Holsman wrote:

    charles du wrote:
    This works. The problem is that I need get files from different
    directories:
    a few from 0810 and a few from 0811. Thanks.



    */* works as well.

    Chuang

    On Mon, Nov 17, 2008 at 4:51 PM, Ian Holsman <lists@holsman.net>
    wrote:



    does
    A = LOAD '/user/middleware/click/0811/0811*'
    work for you?



    charles du wrote:



    I tried the following command, the pig latin does not recognize it.
    A = LOAD



    '{/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*'
    I tried the hadoop command, and it lists all files under these two
    directories as expected.
    hadoop dfs -ls


    {/user/middleware/click/0810/081011,/user/middleware/click/0811/081112}/*
    I tried another commands, and it works
    A = LOAD '/user/middleware/click/0810/{081011, 081012}/*'

    it looks to me that I cannot put '/' in '{}' for pig latin. Is it
    right?
    Is
    there a way I can get around it. Thanks.



    On Wed, Nov 5, 2008 at 8:14 AM, Kevin Weil <kevinweil@gmail.com> wrote:



    Charles,
    I'm sure you've read the documentation about being able to load full
    directories. Load also supports syntax along the lines of

    LOAD 'mylog-{a,b,c}' USING ...

    to load mylog-a, mylog-b, and mylog-c for processing. For something
    more
    fine-grained, you could wrap your pig script in a higher level
    language
    like
    python that could turn your regex into a list of files, and then
    fill
    in
    the
    appropriate files in the load expression.

    Kevin


    On Tue, Nov 4, 2008 at 5:54 PM, charles du <taiping.du@gmail.com>
    wrote:





    Hi:
    I would like to load multiple files in my pig latin program, such
    as

    A = LOAD '<regular expression>' ...

    What types of regular expressions does pig latin support to match
    file
    names? Thanks.


    --
    tp





    --
    tp

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedNov 5, '08 at 1:54a
activeDec 17, '08 at 8:21p
posts10
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase