Grokbase Groups Pig user June 2011
FAQ
Hi,

I hava some files in the hdfs://path/load/ like this:
file_29_00001
file_47_00001
file_16_00001
...
These files are generate by other M/R jobs. The files are only contains one
column, and the number in the file name between 'file_' and '_00001' is a
id.
I want to add the id into its input format like this(I think I should to
write a LoadFunc to get the id):
a = load '/path/load/' as com.company.pig.GetIDFromFileName();
dump a;
//here the parameter 'a' will have two columns:one is the origin column and
the other is the id.

And my question are these:
1, Does there have the existing func that I can get the id from the file
name?
2, I think the method in pig 0.6.0 can help me:
*bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
org.apache.pig.impl.io.BufferedPositionedInputStream, long,
long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>
in,
long offset, long end)
Specifies a portion of an InputStream to read tuples.
but I can't find the same method in pig 0.8.1.
Which method can I use to operate the input file in the pig 0.8.1 API?

Thanks very much.

Search Discussions

  • Daniel Dai at Jun 14, 2011 at 5:27 pm
    Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you will
    need to disable split combination: -Dpig.noSplitCombination=true

    Daniel
    On 06/13/2011 04:07 AM, Jameson Li wrote:
    Hi,

    I hava some files in the hdfs://path/load/ like this:
    file_29_00001
    file_47_00001
    file_16_00001
    ...
    These files are generate by other M/R jobs. The files are only contains one
    column, and the number in the file name between 'file_' and '_00001' is a
    id.
    I want to add the id into its input format like this(I think I should to
    write a LoadFunc to get the id):
    a = load '/path/load/' as com.company.pig.GetIDFromFileName();
    dump a;
    //here the parameter 'a' will have two columns:one is the origin column and
    the other is the id.

    And my question are these:
    1, Does there have the existing func that I can get the id from the file
    name?
    2, I think the method in pig 0.6.0 can help me:
    *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
    org.apache.pig.impl.io.BufferedPositionedInputStream, long,
    long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
    fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>
    in,
    long offset, long end)
    Specifies a portion of an InputStream to read tuples.
    but I can't find the same method in pig 0.8.1.
    Which method can I use to operate the input file in the pig 0.8.1 API?

    Thanks very much.
  • Jameson Li at Jun 16, 2011 at 1:10 pm
    Great. Depend on the
    wiki:http://wiki.apache.org/pig/PigStorageWithInputPath and
    the setting:-Dpig.noSplitCombination=true, I can get the filename in the
    pig.

    But I have another problem.
    I modify the UDF code and ant it and generate the newest jar file(I am sure
    the jar file has updated)
    pig -x local
    register /home/user/project/lib/myUDF.jar
    a = load 'aaa';
    b = foreach a generate com.company.pig.myUDF();
    dump b;

    I found that the result has been using the old jar file and UDF class, and I
    think UDF classes has been caced somewhere.

    Am I right?
    And how to using the really newest jar file after re-compile?

    Thanks very much.

    2011/6/15 Daniel Dai <jianyong@yahoo-inc.com>
    Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you will
    need to disable split combination: -Dpig.noSplitCombination=true

    Daniel


    On 06/13/2011 04:07 AM, Jameson Li wrote:

    Hi,

    I hava some files in the hdfs://path/load/ like this:
    file_29_00001
    file_47_00001
    file_16_00001
    ...
    These files are generate by other M/R jobs. The files are only contains one
    column, and the number in the file name between 'file_' and '_00001' is a
    id.
    I want to add the id into its input format like this(I think I should to
    write a LoadFunc to get the id):
    a = load '/path/load/' as com.company.pig.
    GetIDFromFileName();
    dump a;
    //here the parameter 'a' will have two columns:one is the origin column and
    the other is the id.

    And my question are these:
    1, Does there have the existing func that I can get the id from the file
    name?
    2, I think the method in pig 0.6.0 can help me:
    *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
    org.apache.pig.impl.io.BufferedPositionedInputStream, long,
    long)> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,org.apache.pig.impl.io.BufferedPositionedInputStream,long,long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true> <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
    fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>


    in,
    long offset, long end)
    Specifies a portion of an InputStream to read tuples.
    but I can't find the same method in pig 0.8.1.
    Which method can I use to operate the input file in the pig 0.8.1 API?

    Thanks very much.

  • Daniel Dai at Jun 16, 2011 at 6:26 pm
    Should not be. Pig does not cache myUDF.jar. Every run will pick
    myUDF.jar again from /home/user/project/lib.

    Daniel
    On 06/16/2011 06:09 AM, Jameson Li wrote:
    Great. Depend onthe
    wiki:http://wiki.apache.org/pig/PigStorageWithInputPath and the
    setting:-Dpig.noSplitCombination=true, I can get the filename in the pig.

    But I have another problem.
    I modify the UDF code and ant it and generate the newest jar file(I am
    sure the jar file has updated)
    pig -x local
    register /home/user/project/lib/myUDF.jar
    a = load 'aaa';
    b = foreach a generate com.company.pig.myUDF();
    dump b;

    I found that the result has been using the old jar file and UDF class,
    and I think UDF classes has been caced somewhere.

    Am I right?
    And how to using the really newest jar file after re-compile?

    Thanks very much.

    2011/6/15 Daniel Dai <jianyong@yahoo-inc.com

    Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you
    will need to disable split combination: -Dpig.noSplitCombination=true

    Daniel

    On 06/13/2011 04:07 AM, Jameson Li wrote:
    Hi, I hava some files in the hdfs://path/load/ like this:
    file_29_00001 file_47_00001 file_16_00001 ... These files are
    generate by other M/R jobs. The files are only contains one
    column, and the number in the file name between 'file_' and
    '_00001' is a id. I want to add the id into its input format like
    this(I think I should to write a LoadFunc to get the id): a =
    load '/path/load/' as com.company.pig.
    GetIDFromFileName();
    dump a;
    //here the parameter 'a' will have two columns:one is the origin column and
    the other is the id.

    And my question are these:
    1, Does there have the existing func that I can get the id from the file
    name?
    2, I think the method in pig 0.6.0 can help me:
    *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
    org.apache.pig.impl.io.BufferedPositionedInputStream, long,
    long)> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo%28java.lang.String,org.apache.pig.impl.io.BufferedPositionedInputStream,long,long%29>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
    fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>
    in, long offset, long end) Specifies a portion of an InputStream
    to read tuples. but I can't find the same method in pig 0.8.1.
    Which method can I use to operate the input file in the pig 0.8.1
    API? Thanks very much.
  • Jameson Li at Jun 17, 2011 at 2:06 am
    I am sorry that I have a fault.
    My newest jar file is in the dir /home/user/project/lib/myUDF.jar, but there
    has an old jar file in the pig lib dir $PIG-HOME/lib(/opt/pig/lib ).
    Unfortunately after registering the jar
    file--/home/user/project/lib/myUDF.jar, when the pig code execuded, it will
    first scan the UDF classes in the pig lib jar files.

    2011/6/17 Daniel Dai <jianyong@yahoo-inc.com>
    Should not be. Pig does not cache myUDF.jar. Every run will pick myUDF.jar
    again from /home/user/project/lib.
  • Jameson Li at Jun 17, 2011 at 9:47 am
    Another question:

    The class *org.apache.pig.piggybank.storage.MultiStorage *can help me to store
    the Pig output into
    different directories.
    But the I want to let the file not contain the 'splitFieldIndex'.
    For example:
    Input file:
    id name
    --------
    1 jack
    1 tom
    1 lily
    2 cat
    2 pig
    2 bird

    After using MultiStorage('/my/home/output','0', 'bz2', '\\t') , normally, I
    will get the below files and their contents:
    1/1-0
    ------
    1 jack
    1 tom
    1 lily

    2/2-0
    ------
    2 cat
    2 pig
    2 bird

    I want to get the files and their contents:
    1/1-0
    ------
    jack
    tom
    lily

    2/2-0
    ------
    cat
    pig
    bird

    Is there a switch that I can use to generate the store file that do or do
    not contains the 'splitFieldIndex'?

    I have seen the code it seems that the answer is No.
    Maybe I have to write another class like
    MultiStorageSwithWriteKey to extends the class MultiStorageSwithKey.
    Am I right?

    Thanks very much.


    2011/6/17 Jameson Li <hovlj.ei@gmail.com>
    I am sorry that I have a fault.
    My newest jar file is in the dir /home/user/project/lib/myUDF.jar, but
    there has an old jar file in the pig lib dir $PIG-HOME/lib(/opt/pig/lib ).
    Unfortunately after registering the jar
    file--/home/user/project/lib/myUDF.jar, when the pig code execuded, it will
    first scan the UDF classes in the pig lib jar files.

    2011/6/17 Daniel Dai <jianyong@yahoo-inc.com>
    Should not be. Pig does not cache myUDF.jar. Every run will pick myUDF.jar
    again from /home/user/project/lib.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 13, '11 at 11:08a
activeJun 17, '11 at 9:47a
posts6
users2
websitepig.apache.org

2 users in discussion

Jameson Li: 4 posts Daniel Dai: 2 posts

People

Translate

site design / logo © 2022 Grokbase