Grokbase Groups Pig dev June 2008
FAQ
Allow multiple paths in the load statement
------------------------------------------

Key: PIG-252
URL: https://issues.apache.org/jira/browse/PIG-252
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich

From Tom White:
I;m having a problem loading data from multiple paths in Pig. What I'm trying to do is to load data from a range of dates, so I would like to specify an input of two globbed paths:

x = LOAD '2008/05/{26,27,28,29,30,31},2008/06/{1,2}'

Pig doesn't seem to like this though as it's trying to interpret it as a single path. The best I can do it to use UNION:

x1 = LOAD '2008/05/{26,27,28,29,30,31}'
x2 = LOAD '2008/06/{1,2}'
x = UNION x1, x2

The downside to this is that I want to parameterize my paths, and having separate script for each number of paths in the input is cumbersome.

Is there a better way of doing this? Are there any plans to support multiple paths, and/or PathFilters?


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Tom White (JIRA) at Jun 5, 2008 at 2:22 pm
    [ https://issues.apache.org/jira/browse/PIG-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602656#action_12602656 ]

    Tom White commented on PIG-252:
    -------------------------------

    By making globs more powerful (HADOOP-3498), we would be able to say:

    {code}
    x = LOAD '{2008/05/{26,27,28,29,30,31},2008/06/{1,2}}'
    {code}
    Allow multiple paths in the load statement
    ------------------------------------------

    Key: PIG-252
    URL: https://issues.apache.org/jira/browse/PIG-252
    Project: Pig
    Issue Type: Improvement
    Reporter: Olga Natkovich

    From Tom White:
    I;m having a problem loading data from multiple paths in Pig. What I'm trying to do is to load data from a range of dates, so I would like to specify an input of two globbed paths:
    x = LOAD '2008/05/{26,27,28,29,30,31},2008/06/{1,2}'
    Pig doesn't seem to like this though as it's trying to interpret it as a single path. The best I can do it to use UNION:
    x1 = LOAD '2008/05/{26,27,28,29,30,31}'
    x2 = LOAD '2008/06/{1,2}'
    x = UNION x1, x2
    The downside to this is that I want to parameterize my paths, and having separate script for each number of paths in the input is cumbersome.
    Is there a better way of doing this? Are there any plans to support multiple paths, and/or PathFilters?
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Jun 5, 2008 at 2:34 pm
    [ https://issues.apache.org/jira/browse/PIG-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602663#action_12602663 ]

    Pi Song commented on PIG-252:
    -----------------------------

    After HADOOP-3498, I think we will need just a parser change in Pig to not reject "," and "{}" in filenames.
    Allow multiple paths in the load statement
    ------------------------------------------

    Key: PIG-252
    URL: https://issues.apache.org/jira/browse/PIG-252
    Project: Pig
    Issue Type: Improvement
    Reporter: Olga Natkovich

    From Tom White:
    I;m having a problem loading data from multiple paths in Pig. What I'm trying to do is to load data from a range of dates, so I would like to specify an input of two globbed paths:
    x = LOAD '2008/05/{26,27,28,29,30,31},2008/06/{1,2}'
    Pig doesn't seem to like this though as it's trying to interpret it as a single path. The best I can do it to use UNION:
    x1 = LOAD '2008/05/{26,27,28,29,30,31}'
    x2 = LOAD '2008/06/{1,2}'
    x = UNION x1, x2
    The downside to this is that I want to parameterize my paths, and having separate script for each number of paths in the input is cumbersome.
    Is there a better way of doing this? Are there any plans to support multiple paths, and/or PathFilters?
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Jun 25, 2008 at 9:17 pm
    [ https://issues.apache.org/jira/browse/PIG-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608196#action_12608196 ]

    Daniel Dai commented on PIG-252:
    --------------------------------

    In mapreduce mode, Pig will pass filename to hadoop without any filtering. Globbing is parsed by "org.apache.hadoop.fs.FileSystem.globStatus" or "org.apache.hadoop.fs.FileSystem.globPaths". Once [HADOOP-3498|https://issues.apache.org/jira/browse/HADOOP-3498] is fixed, Pig should automatically handle this.
    Also some work is going on to add globbing in local mode ([PIG 279|https://issues.apache.org/jira/browse/PIG-279]). We need to put this into the globbing parsing process for local mode.
    Allow multiple paths in the load statement
    ------------------------------------------

    Key: PIG-252
    URL: https://issues.apache.org/jira/browse/PIG-252
    Project: Pig
    Issue Type: Improvement
    Reporter: Olga Natkovich

    From Tom White:
    I;m having a problem loading data from multiple paths in Pig. What I'm trying to do is to load data from a range of dates, so I would like to specify an input of two globbed paths:
    x = LOAD '2008/05/{26,27,28,29,30,31},2008/06/{1,2}'
    Pig doesn't seem to like this though as it's trying to interpret it as a single path. The best I can do it to use UNION:
    x1 = LOAD '2008/05/{26,27,28,29,30,31}'
    x2 = LOAD '2008/06/{1,2}'
    x = UNION x1, x2
    The downside to this is that I want to parameterize my paths, and having separate script for each number of paths in the input is cumbersome.
    Is there a better way of doing this? Are there any plans to support multiple paths, and/or PathFilters?
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Jul 1, 2008 at 6:45 pm
    [ https://issues.apache.org/jira/browse/PIG-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-252:
    ---------------------------

    Attachment: localglobbing.patch

    Pig will use hadoop default mode as its local mode execution engine. There should be no difference to support globbing in both local mode and mapreduce mode. Pig will pass unfiltered globbing string to hadoop ("org.apache.hadoop.fs.FileSystem.globPaths"). So once [HADOOP-3498|https://issues.apache.org/jira/browse/HADOOP-3498] is fixed, pig should automatically benefit from it. The only thing is currently there is still some code for file existence checking for local mode specificly. We need to clear this out. I attached a patch for reference (target branches/types).
    Allow multiple paths in the load statement
    ------------------------------------------

    Key: PIG-252
    URL: https://issues.apache.org/jira/browse/PIG-252
    Project: Pig
    Issue Type: Improvement
    Reporter: Olga Natkovich
    Attachments: localglobbing.patch


    From Tom White:
    I;m having a problem loading data from multiple paths in Pig. What I'm trying to do is to load data from a range of dates, so I would like to specify an input of two globbed paths:
    x = LOAD '2008/05/{26,27,28,29,30,31},2008/06/{1,2}'
    Pig doesn't seem to like this though as it's trying to interpret it as a single path. The best I can do it to use UNION:
    x1 = LOAD '2008/05/{26,27,28,29,30,31}'
    x2 = LOAD '2008/06/{1,2}'
    x = UNION x1, x2
    The downside to this is that I want to parameterize my paths, and having separate script for each number of paths in the input is cumbersome.
    Is there a better way of doing this? Are there any plans to support multiple paths, and/or PathFilters?
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Sep 5, 2008 at 10:58 am
    [ https://issues.apache.org/jira/browse/PIG-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628612#action_12628612 ]

    Tom White commented on PIG-252:
    -------------------------------

    HADOOP-3498 has been committed, so once Pig uses Hadoop 0.19.0 this issue can be closed.
    Allow multiple paths in the load statement
    ------------------------------------------

    Key: PIG-252
    URL: https://issues.apache.org/jira/browse/PIG-252
    Project: Pig
    Issue Type: Improvement
    Reporter: Olga Natkovich
    Attachments: localglobbing.patch


    From Tom White:
    I;m having a problem loading data from multiple paths in Pig. What I'm trying to do is to load data from a range of dates, so I would like to specify an input of two globbed paths:
    x = LOAD '2008/05/{26,27,28,29,30,31},2008/06/{1,2}'
    Pig doesn't seem to like this though as it's trying to interpret it as a single path. The best I can do it to use UNION:
    x1 = LOAD '2008/05/{26,27,28,29,30,31}'
    x2 = LOAD '2008/06/{1,2}'
    x = UNION x1, x2
    The downside to this is that I want to parameterize my paths, and having separate script for each number of paths in the input is cumbersome.
    Is there a better way of doing this? Are there any plans to support multiple paths, and/or PathFilters?
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedJun 4, '08 at 9:20p
activeSep 5, '08 at 10:58a
posts6
users1
websitepig.apache.org

1 user in discussion

Tom White (JIRA): 6 posts

People

Translate

site design / logo © 2022 Grokbase