Grokbase Groups Pig user March 2010
FAQ
Hi,
Could pig recognize files name are importing ? If could, how to do ? I want
to combine them according filename.

Exp:
google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....

Sort and combine by name, then output two files: google_all.csv,
baidu_all.csv in a pig script.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)

Search Discussions

  • Romain Rigaux at Mar 2, 2010 at 7:00 pm
    Hi,

    In Pig 0.6 you can extend the PigStorage and grab the name of the file with
    something like this:

    @Override
    public void bindTo(String fileName, BufferedPositionedInputStream is, long
    offset, long end)
    throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp and get
    the group with the name only (e.g. google, baidu)
    }

    @Override
    public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
    next.append(fileName);
    }

    return next;
    }

    Then you can group on the name and split on it.

    Thanks,

    Romain
    On Mon, Mar 1, 2010 at 3:09 AM, Jumping wrote:

    Hi,
    Could pig recognize files name are importing ? If could, how to do ? I want
    to combine them according filename.

    Exp:
    google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
    baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....

    Sort and combine by name, then output two files: google_all.csv,
    baidu_all.csv in a pig script.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and under
    budget.)
  • Romain Rigaux at Mar 3, 2010 at 6:22 pm
    Actually I was using another loader and I just tried with PigStorage (Pig
    0.6) and it seems to work too.

    If your input file has two columns this will have the expected schema and
    data:

    A = load 'file' USING MyLoader() AS (f1:chararray,
    f2:chararray, fileName:chararray);

    A: {f1: chararray,f2: chararray,filename: chararray}

    If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
    will be null.

    So in practice the loader loads the data "independently" and then "casts" it
    to the schema you provided. After yes, I don't say that it is a very clean
    solution.

    Thanks,

    Romain

    2010/3/2 Mridul Muralidharan <mridulm@yahoo-inc.com>
    I am not sure if this will work as you expect.
    Depending on which implementation of PigStorage you end up using, it
    might exhibit different behavior.

    If I am not wrong, currently, for example, if you specify something like :

    A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
    fileName:chararray);


    your code will end up generating a tuple of 4 fields - the fileName
    always being 'null' and the actual filename you inserted through
    MyLoader ending up being the 4th field (and so not 'seen' by pig - not
    sure what happens if you do a join, etc with this tuple though !
    Essentially runtime is not consistent with script schema).


    Note - this is an implementation specific behavior, which could probably
    have been fixed by implementation specific hack
    "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
    the last field expected].

    As expected, it is brittle code.


    From a while back, I remember facing issues with pig's implicit
    conversion to/from bytearray, its implicit project which was introduced,
    insertion of null's to extend to schema specified (the above behavior),
    etc.
    So you would become dependent on the impl changes.


    I dont think BinStorage and PigStorage have been written with
    inheritance in mind ...


    Regards,
    Mridul




    On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
    Hi,

    In Pig 0.6 you can extend the PigStorage and grab the name of the file with
    something like this:

    @Override
    public void bindTo(String fileName, BufferedPositionedInputStream is, long
    offset, long end)
    throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp and get
    the group with the name only (e.g. google, baidu)
    }

    @Override
    public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
    next.append(fileName);
    }

    return next;
    }

    Then you can group on the name and split on it.

    Thanks,

    Romain

    On Mon, Mar 1, 2010 at 3:09 AM, Jumpingwrote:
    Hi,
    Could pig recognize files name are importing ? If could, how to do ? I
    want
    to combine them according filename.

    Exp:
    google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
    baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....

    Sort and combine by name, then output two files: google_all.csv,
    baidu_all.csv in a pig script.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and under
    budget.)
  • Zaki rahaman at Mar 3, 2010 at 7:13 pm
    In this case, why wouldn't you simply use globbing in your load statements?
    Somethign like

    baidu = LOAD 'input/path/*baidu*' AS (schema);
    google = LOAD 'input/path/*google*' AS (schema);

    Store baidu INTO 'output/path/baidu_all';
    Store google INTO 'output/path/google_all';
    On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux wrote:

    Actually I was using another loader and I just tried with PigStorage (Pig
    0.6) and it seems to work too.

    If your input file has two columns this will have the expected schema and
    data:

    A = load 'file' USING MyLoader() AS (f1:chararray,
    f2:chararray, fileName:chararray);

    A: {f1: chararray,f2: chararray,filename: chararray}

    If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
    will be null.

    So in practice the loader loads the data "independently" and then "casts"
    it
    to the schema you provided. After yes, I don't say that it is a very clean
    solution.

    Thanks,

    Romain

    2010/3/2 Mridul Muralidharan <mridulm@yahoo-inc.com>
    I am not sure if this will work as you expect.
    Depending on which implementation of PigStorage you end up using, it
    might exhibit different behavior.

    If I am not wrong, currently, for example, if you specify something like :
    A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
    fileName:chararray);


    your code will end up generating a tuple of 4 fields - the fileName
    always being 'null' and the actual filename you inserted through
    MyLoader ending up being the 4th field (and so not 'seen' by pig - not
    sure what happens if you do a join, etc with this tuple though !
    Essentially runtime is not consistent with script schema).


    Note - this is an implementation specific behavior, which could probably
    have been fixed by implementation specific hack
    "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
    the last field expected].

    As expected, it is brittle code.


    From a while back, I remember facing issues with pig's implicit
    conversion to/from bytearray, its implicit project which was introduced,
    insertion of null's to extend to schema specified (the above behavior),
    etc.
    So you would become dependent on the impl changes.


    I dont think BinStorage and PigStorage have been written with
    inheritance in mind ...


    Regards,
    Mridul




    On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
    Hi,

    In Pig 0.6 you can extend the PigStorage and grab the name of the file with
    something like this:

    @Override
    public void bindTo(String fileName, BufferedPositionedInputStream
    is,
    long
    offset, long end)
    throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp and get
    the group with the name only (e.g. google, baidu)
    }

    @Override
    public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
    next.append(fileName);
    }

    return next;
    }

    Then you can group on the name and split on it.

    Thanks,

    Romain

    On Mon, Mar 1, 2010 at 3:09 AM, Jumpingwrote:
    Hi,
    Could pig recognize files name are importing ? If could, how to do ? I
    want
    to combine them according filename.

    Exp:
    google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
    baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....

    Sort and combine by name, then output two files: google_all.csv,
    baidu_all.csv in a pig script.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and
    under
    budget.)


    --
    Zaki Rahaman
  • Jumping at Mar 4, 2010 at 1:06 am
    Thanks all of you guys.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and under
    budget.)

    On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman wrote:

    In this case, why wouldn't you simply use globbing in your load statements?
    Somethign like

    baidu = LOAD 'input/path/*baidu*' AS (schema);
    google = LOAD 'input/path/*google*' AS (schema);

    Store baidu INTO 'output/path/baidu_all';
    Store google INTO 'output/path/google_all';

    On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <romain.rigaux@gmail.com
    wrote:
    Actually I was using another loader and I just tried with PigStorage (Pig
    0.6) and it seems to work too.

    If your input file has two columns this will have the expected schema and
    data:

    A = load 'file' USING MyLoader() AS (f1:chararray,
    f2:chararray, fileName:chararray);

    A: {f1: chararray,f2: chararray,filename: chararray}

    If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
    will be null.

    So in practice the loader loads the data "independently" and then "casts"
    it
    to the schema you provided. After yes, I don't say that it is a very clean
    solution.

    Thanks,

    Romain

    2010/3/2 Mridul Muralidharan <mridulm@yahoo-inc.com>
    I am not sure if this will work as you expect.
    Depending on which implementation of PigStorage you end up using, it
    might exhibit different behavior.

    If I am not wrong, currently, for example, if you specify something
    like
    :
    A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
    fileName:chararray);


    your code will end up generating a tuple of 4 fields - the fileName
    always being 'null' and the actual filename you inserted through
    MyLoader ending up being the 4th field (and so not 'seen' by pig - not
    sure what happens if you do a join, etc with this tuple though !
    Essentially runtime is not consistent with script schema).


    Note - this is an implementation specific behavior, which could
    probably
    have been fixed by implementation specific hack
    "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
    the last field expected].

    As expected, it is brittle code.


    From a while back, I remember facing issues with pig's implicit
    conversion to/from bytearray, its implicit project which was
    introduced,
    insertion of null's to extend to schema specified (the above behavior),
    etc.
    So you would become dependent on the impl changes.


    I dont think BinStorage and PigStorage have been written with
    inheritance in mind ...


    Regards,
    Mridul




    On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
    Hi,

    In Pig 0.6 you can extend the PigStorage and grab the name of the
    file
    with
    something like this:

    @Override
    public void bindTo(String fileName, BufferedPositionedInputStream
    is,
    long
    offset, long end)
    throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp
    and
    get
    the group with the name only (e.g. google, baidu)
    }

    @Override
    public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
    next.append(fileName);
    }

    return next;
    }

    Then you can group on the name and split on it.

    Thanks,

    Romain

    On Mon, Mar 1, 2010 at 3:09 AM, Jumpingwrote:
    Hi,
    Could pig recognize files name are importing ? If could, how to do ?
    I
    want
    to combine them according filename.

    Exp:
    google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
    baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
    ....
    Sort and combine by name, then output two files: google_all.csv,
    baidu_all.csv in a pig script.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and
    under
    budget.)


    --
    Zaki Rahaman
  • Zaki Rahaman at Mar 4, 2010 at 1:29 am
    Just curious,

    What solution did you use?

    Sent from my iPhone
    On Mar 3, 2010, at 8:06 PM, Jumping wrote:

    Thanks all of you guys.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and
    under
    budget.)


    On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman
    wrote:
    In this case, why wouldn't you simply use globbing in your load
    statements?
    Somethign like

    baidu = LOAD 'input/path/*baidu*' AS (schema);
    google = LOAD 'input/path/*google*' AS (schema);

    Store baidu INTO 'output/path/baidu_all';
    Store google INTO 'output/path/google_all';

    On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux
    <romain.rigaux@gmail.com
    wrote:
    Actually I was using another loader and I just tried with
    PigStorage (Pig
    0.6) and it seems to work too.

    If your input file has two columns this will have the expected
    schema and
    data:

    A = load 'file' USING MyLoader() AS (f1:chararray,
    f2:chararray, fileName:chararray);

    A: {f1: chararray,f2: chararray,filename: chararray}

    If you do "tuple.set(tuple.getLength() - 1, fileName)" your third
    column
    will be null.

    So in practice the loader loads the data "independently" and then
    "casts"
    it
    to the schema you provided. After yes, I don't say that it is a very clean
    solution.

    Thanks,

    Romain

    2010/3/2 Mridul Muralidharan <mridulm@yahoo-inc.com>
    I am not sure if this will work as you expect.
    Depending on which implementation of PigStorage you end up using,
    it
    might exhibit different behavior.

    If I am not wrong, currently, for example, if you specify something
    like
    :
    A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
    fileName:chararray);


    your code will end up generating a tuple of 4 fields - the fileName
    always being 'null' and the actual filename you inserted through
    MyLoader ending up being the 4th field (and so not 'seen' by pig
    - not
    sure what happens if you do a join, etc with this tuple though !
    Essentially runtime is not consistent with script schema).


    Note - this is an implementation specific behavior, which could
    probably
    have been fixed by implementation specific hack
    "tuple.set(tuple.getLength() - 1, fileName)" [if you know
    fileName is
    the last field expected].

    As expected, it is brittle code.


    From a while back, I remember facing issues with pig's implicit
    conversion to/from bytearray, its implicit project which was
    introduced,
    insertion of null's to extend to schema specified (the above
    behavior),
    etc.
    So you would become dependent on the impl changes.


    I dont think BinStorage and PigStorage have been written with
    inheritance in mind ...


    Regards,
    Mridul




    On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
    Hi,

    In Pig 0.6 you can extend the PigStorage and grab the name of the
    file
    with
    something like this:

    @Override
    public void bindTo(String fileName,
    BufferedPositionedInputStream
    is,
    long
    offset, long end)
    throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp
    and
    get
    the group with the name only (e.g. google, baidu)
    }

    @Override
    public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
    next.append(fileName);
    }

    return next;
    }

    Then you can group on the name and split on it.

    Thanks,

    Romain

    On Mon, Mar 1, 2010 at 3:09 AM, Jumping<quzhengping@gmail.com>
    wrote:
    Hi,
    Could pig recognize files name are importing ? If could, how to
    do ?
    I
    want
    to combine them according filename.

    Exp:
    google_2009_12_21.csv, google_2010_01_21.csv,
    google_2010_02_21.csv,
    baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
    ....
    Sort and combine by name, then output two files: google_all.csv,
    baidu_all.csv in a pig script.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and
    under
    budget.)


    --
    Zaki Rahaman
  • Jumping at Mar 4, 2010 at 1:45 am
    I am using MapReduce on Amazon, there is another problem, like as how to
    use two "$INPUT" parameters in a pig script.

    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and under
    budget.)

    On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman wrote:

    Just curious,

    What solution did you use?

    Sent from my iPhone


    On Mar 3, 2010, at 8:06 PM, Jumping wrote:

    Thanks all of you guys.

    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and under
    budget.)


    On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <zaki.rahaman@gmail.com>
    wrote:

    In this case, why wouldn't you simply use globbing in your load
    statements?
    Somethign like

    baidu = LOAD 'input/path/*baidu*' AS (schema);
    google = LOAD 'input/path/*google*' AS (schema);

    Store baidu INTO 'output/path/baidu_all';
    Store google INTO 'output/path/google_all';

    On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <romain.rigaux@gmail.com
    wrote:
    Actually I was using another loader and I just tried with PigStorage
    (Pig
    0.6) and it seems to work too.

    If your input file has two columns this will have the expected schema
    and
    data:

    A = load 'file' USING MyLoader() AS (f1:chararray,
    f2:chararray, fileName:chararray);

    A: {f1: chararray,f2: chararray,filename: chararray}

    If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
    will be null.

    So in practice the loader loads the data "independently" and then
    "casts"
    it
    to the schema you provided. After yes, I don't say that it is a very clean
    solution.

    Thanks,

    Romain

    2010/3/2 Mridul Muralidharan <mridulm@yahoo-inc.com>

    I am not sure if this will work as you expect.
    Depending on which implementation of PigStorage you end up using, it
    might exhibit different behavior.

    If I am not wrong, currently, for example, if you specify something
    like
    :
    A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
    fileName:chararray);


    your code will end up generating a tuple of 4 fields - the fileName
    always being 'null' and the actual filename you inserted through
    MyLoader ending up being the 4th field (and so not 'seen' by pig - not
    sure what happens if you do a join, etc with this tuple though !
    Essentially runtime is not consistent with script schema).


    Note - this is an implementation specific behavior, which could
    probably
    have been fixed by implementation specific hack
    "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
    the last field expected].

    As expected, it is brittle code.


    From a while back, I remember facing issues with pig's implicit
    conversion to/from bytearray, its implicit project which was
    introduced,
    insertion of null's to extend to schema specified (the above behavior),
    etc.
    So you would become dependent on the impl changes.


    I dont think BinStorage and PigStorage have been written with
    inheritance in mind ...


    Regards,
    Mridul




    On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:

    Hi,

    In Pig 0.6 you can extend the PigStorage and grab the name of the
    file
    with
    something like this:

    @Override
    public void bindTo(String fileName, BufferedPositionedInputStream
    is,
    long
    offset, long end)
    throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp
    and
    get
    the group with the name only (e.g. google, baidu)
    }

    @Override
    public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
    next.append(fileName);
    }

    return next;
    }

    Then you can group on the name and split on it.

    Thanks,

    Romain

    On Mon, Mar 1, 2010 at 3:09 AM, Jumping<quzhengping@gmail.com> wrote:
    Hi,
    Could pig recognize files name are importing ? If could, how to do ?
    I
    want
    to combine them according filename.
    Exp:
    google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
    baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....
    Sort and combine by name, then output two files: google_all.csv,
    baidu_all.csv in a pig script.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and
    under
    budget.)

    --
    Zaki Rahaman
  • Zaki Rahaman at Mar 4, 2010 at 1:59 am
    Even if you're using amazon elastic mapreduce you can specify
    additional named parameters when running scripts. You can specify
    variable placeholders in your script and then pass them on the
    console. Or specify defaults. Or you can always run your scripts in
    interactive mode so you have complete control over execution. And you
    can always hardcode when all else fails

    Sent from my iPhone
    On Mar 3, 2010, at 8:45 PM, Jumping wrote:

    I am using MapReduce on Amazon, there is another problem, like as
    how to
    use two "$INPUT" parameters in a pig script.

    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and
    under
    budget.)


    On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman
    wrote:
    Just curious,

    What solution did you use?

    Sent from my iPhone


    On Mar 3, 2010, at 8:06 PM, Jumping wrote:

    Thanks all of you guys.

    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and
    under
    budget.)


    On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman
    <zaki.rahaman@gmail.com>
    wrote:

    In this case, why wouldn't you simply use globbing in your load
    statements?
    Somethign like

    baidu = LOAD 'input/path/*baidu*' AS (schema);
    google = LOAD 'input/path/*google*' AS (schema);

    Store baidu INTO 'output/path/baidu_all';
    Store google INTO 'output/path/google_all';

    On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <romain.rigaux@gmail.com
    wrote:
    Actually I was using another loader and I just tried with
    PigStorage
    (Pig
    0.6) and it seems to work too.

    If your input file has two columns this will have the expected
    schema
    and
    data:

    A = load 'file' USING MyLoader() AS (f1:chararray,
    f2:chararray, fileName:chararray);

    A: {f1: chararray,f2: chararray,filename: chararray}

    If you do "tuple.set(tuple.getLength() - 1, fileName)" your
    third column
    will be null.

    So in practice the loader loads the data "independently" and then
    "casts"
    it
    to the schema you provided. After yes, I don't say that it is a
    very clean
    solution.

    Thanks,

    Romain

    2010/3/2 Mridul Muralidharan <mridulm@yahoo-inc.com>

    I am not sure if this will work as you expect.
    Depending on which implementation of PigStorage you end up
    using, it
    might exhibit different behavior.

    If I am not wrong, currently, for example, if you specify
    something
    like
    :
    A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
    fileName:chararray);


    your code will end up generating a tuple of 4 fields - the
    fileName
    always being 'null' and the actual filename you inserted through
    MyLoader ending up being the 4th field (and so not 'seen' by
    pig - not
    sure what happens if you do a join, etc with this tuple though !
    Essentially runtime is not consistent with script schema).


    Note - this is an implementation specific behavior, which could
    probably
    have been fixed by implementation specific hack
    "tuple.set(tuple.getLength() - 1, fileName)" [if you know
    fileName is
    the last field expected].

    As expected, it is brittle code.


    From a while back, I remember facing issues with pig's implicit
    conversion to/from bytearray, its implicit project which was
    introduced,
    insertion of null's to extend to schema specified (the above
    behavior),
    etc.
    So you would become dependent on the impl changes.


    I dont think BinStorage and PigStorage have been written with
    inheritance in mind ...


    Regards,
    Mridul




    On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:

    Hi,

    In Pig 0.6 you can extend the PigStorage and grab the name of
    the
    file
    with
    something like this:

    @Override
    public void bindTo(String fileName,
    BufferedPositionedInputStream
    is,
    long
    offset, long end)
    throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp
    and
    get
    the group with the name only (e.g. google, baidu)
    }

    @Override
    public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
    next.append(fileName);
    }

    return next;
    }

    Then you can group on the name and split on it.

    Thanks,

    Romain

    On Mon, Mar 1, 2010 at 3:09 AM, Jumping<quzhengping@gmail.com> wrote:
    Hi,
    Could pig recognize files name are importing ? If could, how
    to do ?
    I
    want
    to combine them according filename.
    Exp:
    google_2009_12_21.csv, google_2010_01_21.csv,
    google_2010_02_21.csv,
    baidu_2009_11_22.csv, baidu_2010_01_01.csv,
    baidu_2010_02_03.csv, ....
    Sort and combine by name, then output two files:
    google_all.csv,
    baidu_all.csv in a pig script.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time
    and
    under
    budget.)

    --
    Zaki Rahaman
  • Romain Rigaux at Mar 4, 2010 at 6:48 am
    Or you can just call the script twice with:

    $INPUT= 'input/path/*baidu*'
    $OUTPUT='output/path/baidu_all'

    then

    $INPUT= 'input/path/*google*'
    $OUTPUT='output/path/google_all'

    Thanks,

    Romain
    On Wed, Mar 3, 2010 at 5:58 PM, Zaki Rahaman wrote:

    Even if you're using amazon elastic mapreduce you can specify additional
    named parameters when running scripts. You can specify variable placeholders
    in your script and then pass them on the console. Or specify defaults. Or
    you can always run your scripts in interactive mode so you have complete
    control over execution. And you can always hardcode when all else fails

    Sent from my iPhone


    On Mar 3, 2010, at 8:45 PM, Jumping wrote:

    I am using MapReduce on Amazon, there is another problem, like as how to
    use two "$INPUT" parameters in a pig script.

    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and under
    budget.)


    On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman <zaki.rahaman@gmail.com>
    wrote:

    Just curious,
    What solution did you use?

    Sent from my iPhone


    On Mar 3, 2010, at 8:06 PM, Jumping wrote:

    Thanks all of you guys.

    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and under
    budget.)


    On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <zaki.rahaman@gmail.com>
    wrote:

    In this case, why wouldn't you simply use globbing in your load
    statements?
    Somethign like

    baidu = LOAD 'input/path/*baidu*' AS (schema);
    google = LOAD 'input/path/*google*' AS (schema);

    Store baidu INTO 'output/path/baidu_all';
    Store google INTO 'output/path/google_all';

    On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <romain.rigaux@gmail.com

    wrote:
    Actually I was using another loader and I just tried with PigStorage
    (Pig
    0.6) and it seems to work too.

    If your input file has two columns this will have the expected schema
    and
    data:

    A = load 'file' USING MyLoader() AS (f1:chararray,
    f2:chararray, fileName:chararray);

    A: {f1: chararray,f2: chararray,filename: chararray}

    If you do "tuple.set(tuple.getLength() - 1, fileName)" your third
    column
    will be null.

    So in practice the loader loads the data "independently" and then
    "casts"
    it
    to the schema you provided. After yes, I don't say that it is a very

    clean solution.
    Thanks,

    Romain

    2010/3/2 Mridul Muralidharan <mridulm@yahoo-inc.com>


    I am not sure if this will work as you expect.
    Depending on which implementation of PigStorage you end up using, it
    might exhibit different behavior.

    If I am not wrong, currently, for example, if you specify something

    like
    :
    A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
    fileName:chararray);


    your code will end up generating a tuple of 4 fields - the fileName
    always being 'null' and the actual filename you inserted through
    MyLoader ending up being the 4th field (and so not 'seen' by pig -
    not
    sure what happens if you do a join, etc with this tuple though !
    Essentially runtime is not consistent with script schema).


    Note - this is an implementation specific behavior, which could

    probably
    have been fixed by implementation specific hack
    "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
    the last field expected].

    As expected, it is brittle code.


    From a while back, I remember facing issues with pig's implicit
    conversion to/from bytearray, its implicit project which was

    introduced,
    insertion of null's to extend to schema specified (the above
    behavior),
    etc.
    So you would become dependent on the impl changes.


    I dont think BinStorage and PigStorage have been written with
    inheritance in mind ...


    Regards,
    Mridul





    On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:

    Hi,
    In Pig 0.6 you can extend the PigStorage and grab the name of the

    file
    with
    something like this:
    @Override
    public void bindTo(String fileName, BufferedPositionedInputStream

    is,
    long
    offset, long end)
    throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp

    and
    get
    the group with the name only (e.g. google, baidu)
    }

    @Override
    public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
    next.append(fileName);
    }

    return next;
    }

    Then you can group on the name and split on it.

    Thanks,

    Romain

    On Mon, Mar 1, 2010 at 3:09 AM, Jumping<quzhengping@gmail.com>

    wrote:
    Hi,
    Could pig recognize files name are importing ? If could, how to do
    ?

    I
    want
    to combine them according filename.
    Exp:
    google_2009_12_21.csv, google_2010_01_21.csv,
    google_2010_02_21.csv,
    baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,

    ....
    Sort and combine by name, then output two files: google_all.csv,
    baidu_all.csv in a pig script.


    Best Regards,
    Jumping Qu

    ------
    Don't tell me how many enemies we have, but where they are!
    (ADV:Perl -- It's like Java, only it lets you deliver on time and

    under
    budget.)
    --
    Zaki Rahaman

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 1, '10 at 11:10a
activeMar 4, '10 at 6:48a
posts9
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase