FAQ
Hi

First of all i'm using an old version of pig, the one that ran on hadoop
12.1, and yes i will upgrade soon...

Following I have some requests/questions, based on the use of Pig so far:

1: If you have 1 billion files (purposely exaggerating) where apx 50 % of
the files are related to one segment and 50 % to another segment,
then i guess the pig script for isolating the segments would be something
like following:

files = LOAD 'path/to/1_billion_files' AS (segment);
sementA = FILTER files BY (segment='a');
sementB = FILTER files BY (segment='b');

STORE segmentA into 'segemtA.dat';
STORE segmentB into 'segemtB.dat';

So the question is, are all 1 billion files filtered and read twice? If so
(guess it is), would it be possible to do
something like this (just to avoid the overhead of 1 billion reads):

STORE SPLIT segmentA into 'segemtA.dat', segmentB into 'segemtB.dat';

2: Would it be possible to allow the use of asterisks in the load method of
Pig.

files = LOAD 'batches/*/batch/*/segments'

3: Allowing Userdefined hadoop job names when 'execution' a script, i have a
feeling that this one is in the newest version, true?

Appreciate any comments anyone might have, thanks :-)

Br Casper

Search Discussions

  • Alan Gates at Feb 15, 2008 at 4:06 pm

    Casper Rasmussen wrote:
    Hi

    First of all i'm using an old version of pig, the one that ran on hadoop
    12.1, and yes i will upgrade soon...

    Following I have some requests/questions, based on the use of Pig so far:

    1: If you have 1 billion files (purposely exaggerating) where apx 50 % of
    the files are related to one segment and 50 % to another segment,
    then i guess the pig script for isolating the segments would be something
    like following:

    files = LOAD 'path/to/1_billion_files' AS (segment);
    sementA = FILTER files BY (segment='a');
    sementB = FILTER files BY (segment='b');

    STORE segmentA into 'segemtA.dat';
    STORE segmentB into 'segemtB.dat';

    So the question is, are all 1 billion files filtered and read twice? If so
    (guess it is), would it be possible to do
    something like this (just to avoid the overhead of 1 billion reads):

    STORE SPLIT segmentA into 'segemtA.dat', segmentB into 'segemtB.dat';
    Yes, currently all 1B files are read and filtered twice. No, your split
    suggestion won't work, yet. Right now pig views all jobs as a tree of
    operations, with a given store (or dump) command as a root. To do what
    you want we need to view the commands as a graph, with multiple heads,
    which it can evaluate simultaneously. We're working in that direction
    but it will be a while before we're there.
    2: Would it be possible to allow the use of asterisks in the load method of
    Pig.

    files = LOAD 'batches/*/batch/*/segments'
    The latest versions of pig use hadoop pattern matching in their files,
    so the above commands would work.
    3: Allowing Userdefined hadoop job names when 'execution' a script, i have a
    feeling that this one is in the newest version, true?
    We don't yet allow users to define their job names, but we certainly
    have had requests to do so.
    Appreciate any comments anyone might have, thanks :-)

    Br Casper
    Alan.
  • Benjamin Francisoud at Feb 15, 2008 at 4:17 pm
    Alan Gates a écrit :
    3: Allowing Userdefined hadoop job names when 'execution' a script, i
    have a
    feeling that this one is in the newest version, true?
    We don't yet allow users to define their job names, but we certainly
    have had requests to do so.
    +1 (may be in PigContext ?)
  • Olga Natkovich at Feb 15, 2008 at 4:36 pm
    Actually, we do allow user to set job name:

    set job.name 'foo'.

    http://wiki.apache.org/pig/Grunt

    Olga
    -----Original Message-----
    From: Alan Gates
    Sent: Friday, February 15, 2008 8:05 AM
    To: pig-user@incubator.apache.org
    Subject: Re: Storage split question, load asterisks,
    userdefined job names



    Casper Rasmussen wrote:
    Hi

    First of all i'm using an old version of pig, the one that ran on
    hadoop 12.1, and yes i will upgrade soon...

    Following I have some requests/questions, based on the use
    of Pig so far:
    1: If you have 1 billion files (purposely exaggerating)
    where apx 50 %
    of the files are related to one segment and 50 % to another segment,
    then i guess the pig script for isolating the segments would be
    something like following:

    files = LOAD 'path/to/1_billion_files' AS (segment);
    sementA = FILTER
    files BY (segment='a'); sementB = FILTER files BY (segment='b');

    STORE segmentA into 'segemtA.dat';
    STORE segmentB into 'segemtB.dat';

    So the question is, are all 1 billion files filtered and
    read twice?
    If so (guess it is), would it be possible to do something like this
    (just to avoid the overhead of 1 billion reads):

    STORE SPLIT segmentA into 'segemtA.dat', segmentB into
    'segemtB.dat';
    Yes, currently all 1B files are read and filtered twice. No,
    your split suggestion won't work, yet. Right now pig views
    all jobs as a tree of operations, with a given store (or
    dump) command as a root. To do what you want we need to view
    the commands as a graph, with multiple heads, which it can
    evaluate simultaneously. We're working in that direction but
    it will be a while before we're there.
    2: Would it be possible to allow the use of asterisks in the load
    method of Pig.

    files = LOAD 'batches/*/batch/*/segments'
    The latest versions of pig use hadoop pattern matching in
    their files, so the above commands would work.
    3: Allowing Userdefined hadoop job names when 'execution' a script, i
    have a feeling that this one is in the newest version, true?
    We don't yet allow users to define their job names, but we
    certainly have had requests to do so.
    Appreciate any comments anyone might have, thanks :-)

    Br Casper
    Alan.
  • Casper Rasmussen at Feb 15, 2008 at 5:46 pm
    Cool, even without the 'store split', it's nice working with Pig, and my
    current work is build on the fact that the storage point is root of the
    operations, so for now nothing is wasted :-)

    Thanks...
    On Fri, Feb 15, 2008 at 5:32 PM, Olga Natkovich wrote:

    Actually, we do allow user to set job name:

    set job.name 'foo'.

    http://wiki.apache.org/pig/Grunt

    Olga
    -----Original Message-----
    From: Alan Gates
    Sent: Friday, February 15, 2008 8:05 AM
    To: pig-user@incubator.apache.org
    Subject: Re: Storage split question, load asterisks,
    userdefined job names



    Casper Rasmussen wrote:
    Hi

    First of all i'm using an old version of pig, the one that ran on
    hadoop 12.1, and yes i will upgrade soon...

    Following I have some requests/questions, based on the use
    of Pig so far:
    1: If you have 1 billion files (purposely exaggerating)
    where apx 50 %
    of the files are related to one segment and 50 % to another segment,
    then i guess the pig script for isolating the segments would be
    something like following:

    files = LOAD 'path/to/1_billion_files' AS (segment);
    sementA = FILTER
    files BY (segment='a'); sementB = FILTER files BY (segment='b');

    STORE segmentA into 'segemtA.dat';
    STORE segmentB into 'segemtB.dat';

    So the question is, are all 1 billion files filtered and
    read twice?
    If so (guess it is), would it be possible to do something like this
    (just to avoid the overhead of 1 billion reads):

    STORE SPLIT segmentA into 'segemtA.dat', segmentB into
    'segemtB.dat';
    Yes, currently all 1B files are read and filtered twice. No,
    your split suggestion won't work, yet. Right now pig views
    all jobs as a tree of operations, with a given store (or
    dump) command as a root. To do what you want we need to view
    the commands as a graph, with multiple heads, which it can
    evaluate simultaneously. We're working in that direction but
    it will be a while before we're there.
    2: Would it be possible to allow the use of asterisks in the load
    method of Pig.

    files = LOAD 'batches/*/batch/*/segments'
    The latest versions of pig use hadoop pattern matching in
    their files, so the above commands would work.
    3: Allowing Userdefined hadoop job names when 'execution' a script, i
    have a feeling that this one is in the newest version, true?
    We don't yet allow users to define their job names, but we
    certainly have had requests to do so.
    Appreciate any comments anyone might have, thanks :-)

    Br Casper
    Alan.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 15, '08 at 2:55p
activeFeb 15, '08 at 5:46p
posts5
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase