Grokbase Groups Pig user August 2009
FAQ
Hello Everyone,

I am trying to write Pig scripts for my project. Problem I ma facing is I
want to load different files to same variable .Can it be possible to do
without modifying the Loader. I read about Hadoop globbing . Does anyone
have solution to these.

I know I can load all files of a given directory to single variable.
But is it possible to load specific files from that directory? Or specific
files from different directories to same load variable?

I also know about UNION strategy but that increase one map-reduce job and I
want to avoid that.

Any kind of suggestions are welcomed.

Pankil

Search Discussions

  • Olga Natkovich at Aug 26, 2009 at 6:17 pm
    Pankil,

    You have a couple of options:

    (1) If you disable the multiquery support, you can take advantage of the
    full Hadoop globing capabilities which is likely to be sufficient.
    (2) If you need to use multiquery, only single-pattern globs are
    supported so you would not be able to specify multiple unrelated
    directories. If that is not sufficient, you will need to use union but
    it might not significantly impact your performance. I would try that
    first before trying a custom solution.

    Olga

    -----Original Message-----
    From: Pankil Doshi
    Sent: Wednesday, August 26, 2009 10:22 AM
    To: pig-user@hadoop.apache.org
    Subject: Question Regarding Multiple Loads

    Hello Everyone,

    I am trying to write Pig scripts for my project. Problem I ma facing is
    I
    want to load different files to same variable .Can it be possible to do
    without modifying the Loader. I read about Hadoop globbing . Does
    anyone
    have solution to these.

    I know I can load all files of a given directory to single variable.
    But is it possible to load specific files from that directory? Or
    specific
    files from different directories to same load variable?

    I also know about UNION strategy but that increase one map-reduce job
    and I
    want to avoid that.

    Any kind of suggestions are welcomed.

    Pankil
  • Mridul Muralidharan at Aug 26, 2009 at 7:10 pm
    Hi Pankil,

    As thejas pointed out in the other thread, you can use globbing that
    hadoop supports :
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
    stem.html#globStatus(org.apache.hadoop.fs.Path)


    Regards,
    Mridul

    Pankil Doshi wrote:
    Hello Everyone,

    I am trying to write Pig scripts for my project. Problem I ma facing is I
    want to load different files to same variable .Can it be possible to do
    without modifying the Loader. I read about Hadoop globbing . Does anyone
    have solution to these.

    I know I can load all files of a given directory to single variable.
    But is it possible to load specific files from that directory? Or specific
    files from different directories to same load variable?

    I also know about UNION strategy but that increase one map-reduce job and I
    want to avoid that.

    Any kind of suggestions are welcomed.

    Pankil
  • Pankil Doshi at Aug 27, 2009 at 12:17 am
    Which version of hadoop support hadoop globbing? or Do i have to apply patch
    for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?

    Pankil

    On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
    wrote:
    Hi Pankil,

    As thejas pointed out in the other thread, you can use globbing that
    hadoop supports :
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
    stem.html#globStatus(org.apache.hadoop.fs.Path)


    Regards,
    Mridul


    Pankil Doshi wrote:
    Hello Everyone,

    I am trying to write Pig scripts for my project. Problem I ma facing is I
    want to load different files to same variable .Can it be possible to do
    without modifying the Loader. I read about Hadoop globbing . Does anyone
    have solution to these.

    I know I can load all files of a given directory to single variable.
    But is it possible to load specific files from that directory? Or specific
    files from different directories to same load variable?

    I also know about UNION strategy but that increase one map-reduce job and
    I
    want to avoid that.

    Any kind of suggestions are welcomed.

    Pankil
  • Zjffdu at Aug 30, 2009 at 2:46 pm
    The currently version 0.183 that Pig use will be OK for you.

    e.g. raw = LOAD '/data/*.log' USING PigStorage();

    This statement will load all the files with extension log.


    -----Original Message-----
    From: Pankil Doshi
    Sent: 2009年8月26日 17:17
    To: pig-user@hadoop.apache.org
    Subject: Re: Question Regarding Multiple Loads

    Which version of hadoop support hadoop globbing? or Do i have to apply patch
    for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?

    Pankil

    On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
    wrote:
    Hi Pankil,

    As thejas pointed out in the other thread, you can use globbing that
    hadoop supports :
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
    stem.html#globStatus(org.apache.hadoop.fs.Path)


    Regards,
    Mridul


    Pankil Doshi wrote:
    Hello Everyone,

    I am trying to write Pig scripts for my project. Problem I ma facing is I
    want to load different files to same variable .Can it be possible to do
    without modifying the Loader. I read about Hadoop globbing . Does anyone
    have solution to these.

    I know I can load all files of a given directory to single variable.
    But is it possible to load specific files from that directory? Or
    specific
    files from different directories to same load variable?

    I also know about UNION strategy but that increase one map-reduce job and
    I
    want to avoid that.

    Any kind of suggestions are welcomed.

    Pankil
  • Mridul Muralidharan at Aug 31, 2009 at 12:10 am

    Pankil Doshi wrote:
    Which version of hadoop support hadoop globbing? or Do i have to apply patch
    for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?

    Someone from pig team can give details of actual versions.
    But I have been using globbing for quite a while now, and I think all
    versions of pig which you can get your hands on should be able to
    support it !

    Regards,
    Mridul

    PS: iirc there are difference between hadoop globbing and bash globbing,
    so you might want to look at the javadoc.
    Pankil

    On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
    wrote:
    Hi Pankil,

    As thejas pointed out in the other thread, you can use globbing that
    hadoop supports :
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
    stem.html#globStatus(org.apache.hadoop.fs.Path)


    Regards,
    Mridul


    Pankil Doshi wrote:
    Hello Everyone,

    I am trying to write Pig scripts for my project. Problem I ma facing is I
    want to load different files to same variable .Can it be possible to do
    without modifying the Loader. I read about Hadoop globbing . Does anyone
    have solution to these.

    I know I can load all files of a given directory to single variable.
    But is it possible to load specific files from that directory? Or specific
    files from different directories to same load variable?

    I also know about UNION strategy but that increase one map-reduce job and
    I
    want to avoid that.

    Any kind of suggestions are welcomed.

    Pankil
  • Daniel Dai at Sep 3, 2009 at 2:03 pm
    Pig will pass filename directly to hadoop. So the support of globbing is
    provided by the underlying hadoop. Hadoop 18 only support single-pattern
    globs. Hadoop 19/20 support globbing for multiple unrelated directories.
    Lastest Pig release (0.3) bundles hadoop 18, so you can only use
    single-pattern globbing with that release.


    ----- Original Message -----
    From: "Mridul Muralidharan" <mridulm@yahoo-inc.com>
    To: <pig-user@hadoop.apache.org>
    Sent: Sunday, August 30, 2009 5:08 PM
    Subject: Re: Question Regarding Multiple Loads

    Pankil Doshi wrote:
    Which version of hadoop support hadoop globbing? or Do i have to apply
    patch
    for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested
    it?

    Someone from pig team can give details of actual versions.
    But I have been using globbing for quite a while now, and I think all
    versions of pig which you can get your hands on should be able to support
    it !

    Regards,
    Mridul

    PS: iirc there are difference between hadoop globbing and bash globbing,
    so you might want to look at the javadoc.
    Pankil

    On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
    wrote:
    Hi Pankil,

    As thejas pointed out in the other thread, you can use globbing that
    hadoop supports :
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
    stem.html#globStatus(org.apache.hadoop.fs.Path)


    Regards,
    Mridul


    Pankil Doshi wrote:
    Hello Everyone,

    I am trying to write Pig scripts for my project. Problem I ma facing is
    I
    want to load different files to same variable .Can it be possible to do
    without modifying the Loader. I read about Hadoop globbing . Does
    anyone
    have solution to these.

    I know I can load all files of a given directory to single variable.
    But is it possible to load specific files from that directory? Or
    specific
    files from different directories to same load variable?

    I also know about UNION strategy but that increase one map-reduce job
    and
    I
    want to avoid that.

    Any kind of suggestions are welcomed.

    Pankil

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 26, '09 at 5:22p
activeSep 3, '09 at 2:03p
posts7
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase