FAQ
Hey,

I have a bunch of files where the filename is significant. I'm loading the
files by supplying the top level directory that contains the files. Is
there a way to capture the filename of the file and append to the tuple of
data that's in that file?

-Kim

Search Discussions

  • Dmitriy Ryaboy at Feb 4, 2011 at 3:49 am
    In pig 6, you can hook into bindTo() and save the file name.

    In pig 8 you have to find your way to the underlying InputSplit via
    PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
    on it.. I think. Haven't done this.

    This will totally break if you have splitCombination turned on, of
    course, as pig can silently move to a different file under you, so
    you'd have to turn that off.

    D
    On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt wrote:
    Hey,

    I have a bunch of files where the filename is significant.  I'm loading the
    files by supplying the top level directory that contains the files.  Is
    there a way to capture the filename of the file and append to the tuple of
    data that's in that file?

    -Kim
  • Kim Vogt at Feb 4, 2011 at 4:09 am
    Thanks Dmitriy!

    I'm using pig 8 and no splitCombination (I don't think). I accept this challenge and will keep you pig'ites updated.

    -Kim
    On Feb 3, 2011, at 7:49 PM, Dmitriy Ryaboy wrote:

    In pig 6, you can hook into bindTo() and save the file name.

    In pig 8 you have to find your way to the underlying InputSplit via
    PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
    on it.. I think. Haven't done this.

    This will totally break if you have splitCombination turned on, of
    course, as pig can silently move to a different file under you, so
    you'd have to turn that off.

    D
    On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt wrote:
    Hey,

    I have a bunch of files where the filename is significant. I'm loading the
    files by supplying the top level directory that contains the files. Is
    there a way to capture the filename of the file and append to the tuple of
    data that's in that file?

    -Kim
  • Dexin Wang at Feb 4, 2011 at 4:33 am
    Similarly, is it possible to insert some literal values to a tuple stream?

    For example, when I invoke my Pig script, I already know what data source is
    (say, it's from filename_2011-02-03), so I can just pass it to Pig using
    -param, and I want to insert this known file name to the tuple stream. How
    can I do that?

    Example, I have:

    grunt> A = LOAD 'aa' AS (f1, f2);
    grunt> DUMP A;
    (aa,bb)
    (cc,dd)

    I want to do something like:

    grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";

    Thanks.
    On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy wrote:

    In pig 6, you can hook into bindTo() and save the file name.

    In pig 8 you have to find your way to the underlying InputSplit via
    PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
    on it.. I think. Haven't done this.

    This will totally break if you have splitCombination turned on, of
    course, as pig can silently move to a different file under you, so
    you'd have to turn that off.

    D
    On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt wrote:
    Hey,

    I have a bunch of files where the filename is significant. I'm loading the
    files by supplying the top level directory that contains the files. Is
    there a way to capture the filename of the file and append to the tuple of
    data that's in that file?

    -Kim
  • Kim Vogt at Feb 4, 2011 at 5:40 am
    This should work:

    grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';

    or

    grunt> B = FOREACH A GENERATE f1, '$paramName';

    -Kim
    On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang wrote:

    Similarly, is it possible to insert some literal values to a tuple stream?

    For example, when I invoke my Pig script, I already know what data source
    is
    (say, it's from filename_2011-02-03), so I can just pass it to Pig using
    -param, and I want to insert this known file name to the tuple stream. How
    can I do that?

    Example, I have:

    grunt> A = LOAD 'aa' AS (f1, f2);
    grunt> DUMP A;
    (aa,bb)
    (cc,dd)

    I want to do something like:

    grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";

    Thanks.
    On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy wrote:

    In pig 6, you can hook into bindTo() and save the file name.

    In pig 8 you have to find your way to the underlying InputSplit via
    PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
    on it.. I think. Haven't done this.

    This will totally break if you have splitCombination turned on, of
    course, as pig can silently move to a different file under you, so
    you'd have to turn that off.

    D
    On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt wrote:
    Hey,

    I have a bunch of files where the filename is significant. I'm loading the
    files by supplying the top level directory that contains the files. Is
    there a way to capture the filename of the file and append to the tuple of
    data that's in that file?

    -Kim
  • Dexin Wang at Feb 4, 2011 at 5:44 am
    wow, I almost got it right. Double quote, fails. Single quote, works.

    Thanks.
    On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt wrote:

    This should work:

    grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';

    or

    grunt> B = FOREACH A GENERATE f1, '$paramName';

    -Kim
    On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang wrote:

    Similarly, is it possible to insert some literal values to a tuple stream?
    For example, when I invoke my Pig script, I already know what data source
    is
    (say, it's from filename_2011-02-03), so I can just pass it to Pig using
    -param, and I want to insert this known file name to the tuple stream. How
    can I do that?

    Example, I have:

    grunt> A = LOAD 'aa' AS (f1, f2);
    grunt> DUMP A;
    (aa,bb)
    (cc,dd)

    I want to do something like:

    grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";

    Thanks.
    On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy wrote:

    In pig 6, you can hook into bindTo() and save the file name.

    In pig 8 you have to find your way to the underlying InputSplit via
    PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
    on it.. I think. Haven't done this.

    This will totally break if you have splitCombination turned on, of
    course, as pig can silently move to a different file under you, so
    you'd have to turn that off.

    D
    On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt wrote:
    Hey,

    I have a bunch of files where the filename is significant. I'm
    loading
    the
    files by supplying the top level directory that contains the files.
    Is
    there a way to capture the filename of the file and append to the
    tuple
    of
    data that's in that file?

    -Kim
  • Kim Vogt at Feb 4, 2011 at 5:54 am
    And to include the filename in the tuple with the data, I copied PigStorage
    (I'm loading csv), created a private PigSplit object, set this object in
    "prepareToRead", and added this code before returning the tuple in
    "getNext",

    if (mSplit != null) {
    FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
    Path p = fs.getPath();
    mProtoTuple.add(p.toString());
    }

    And it works! Thanks again :-)

    -Kim
    On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang wrote:

    wow, I almost got it right. Double quote, fails. Single quote, works.

    Thanks.
    On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt wrote:

    This should work:

    grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';

    or

    grunt> B = FOREACH A GENERATE f1, '$paramName';

    -Kim
    On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang wrote:

    Similarly, is it possible to insert some literal values to a tuple stream?
    For example, when I invoke my Pig script, I already know what data
    source
    is
    (say, it's from filename_2011-02-03), so I can just pass it to Pig
    using
    -param, and I want to insert this known file name to the tuple stream. How
    can I do that?

    Example, I have:

    grunt> A = LOAD 'aa' AS (f1, f2);
    grunt> DUMP A;
    (aa,bb)
    (cc,dd)

    I want to do something like:

    grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";

    Thanks.

    On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    In pig 6, you can hook into bindTo() and save the file name.

    In pig 8 you have to find your way to the underlying InputSplit via
    PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
    on it.. I think. Haven't done this.

    This will totally break if you have splitCombination turned on, of
    course, as pig can silently move to a different file under you, so
    you'd have to turn that off.

    D
    On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt wrote:
    Hey,

    I have a bunch of files where the filename is significant. I'm
    loading
    the
    files by supplying the top level directory that contains the files.
    Is
    there a way to capture the filename of the file and append to the
    tuple
    of
    data that's in that file?

    -Kim
  • Dmitriy Ryaboy at Feb 4, 2011 at 6:11 am
    There's a CSV loader in the piggybank that does proper CSV escaping,
    if you are interested.
    On Thu, Feb 3, 2011 at 9:53 PM, Kim Vogt wrote:
    And to include the filename in the tuple with the data, I copied PigStorage
    (I'm loading csv), created a private PigSplit object, set this object in
    "prepareToRead", and added this code before returning the tuple in
    "getNext",

    if (mSplit != null) {
    FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
    Path p = fs.getPath();
    mProtoTuple.add(p.toString());
    }

    And it works!  Thanks again :-)

    -Kim
    On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang wrote:

    wow, I almost got it right. Double quote, fails. Single quote, works.

    Thanks.
    On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt wrote:

    This should work:

    grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';

    or

    grunt> B = FOREACH A GENERATE f1, '$paramName';

    -Kim
    On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang wrote:

    Similarly, is it possible to insert some literal values to a tuple stream?
    For example, when I invoke my Pig script, I already know what data
    source
    is
    (say, it's from filename_2011-02-03), so I can just pass it to Pig
    using
    -param, and I want to insert this known file name to the tuple stream. How
    can I do that?

    Example, I have:

    grunt> A = LOAD 'aa' AS (f1, f2);
    grunt> DUMP A;
    (aa,bb)
    (cc,dd)

    I want to do something like:

    grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";

    Thanks.

    On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    In pig 6, you can hook into bindTo() and save the file name.

    In pig 8 you have to find your way to the underlying InputSplit via
    PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
    on it.. I think. Haven't done this.

    This will totally break if you have splitCombination turned on, of
    course, as pig can silently move to a different file under you, so
    you'd have to turn that off.

    D
    On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt wrote:
    Hey,

    I have a bunch of files where the filename is significant.  I'm
    loading
    the
    files by supplying the top level directory that contains the files.
    Is
    there a way to capture the filename of the file and append to the
    tuple
    of
    data that's in that file?

    -Kim
  • Kim Vogt at Feb 4, 2011 at 7:30 pm
    I switched to using the CSVLoader in piggybank, and appended the filepath to
    the current RecordReader instead.

    -Kim
    On Thu, Feb 3, 2011 at 10:11 PM, Dmitriy Ryaboy wrote:

    There's a CSV loader in the piggybank that does proper CSV escaping,
    if you are interested.
    On Thu, Feb 3, 2011 at 9:53 PM, Kim Vogt wrote:
    And to include the filename in the tuple with the data, I copied
    PigStorage
    (I'm loading csv), created a private PigSplit object, set this object in
    "prepareToRead", and added this code before returning the tuple in
    "getNext",

    if (mSplit != null) {
    FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
    Path p = fs.getPath();
    mProtoTuple.add(p.toString());
    }

    And it works! Thanks again :-)

    -Kim
    On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang wrote:

    wow, I almost got it right. Double quote, fails. Single quote, works.

    Thanks.
    On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt wrote:

    This should work:

    grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';

    or

    grunt> B = FOREACH A GENERATE f1, '$paramName';

    -Kim
    On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang wrote:

    Similarly, is it possible to insert some literal values to a tuple stream?
    For example, when I invoke my Pig script, I already know what data
    source
    is
    (say, it's from filename_2011-02-03), so I can just pass it to Pig
    using
    -param, and I want to insert this known file name to the tuple
    stream.
    How
    can I do that?

    Example, I have:

    grunt> A = LOAD 'aa' AS (f1, f2);
    grunt> DUMP A;
    (aa,bb)
    (cc,dd)

    I want to do something like:

    grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";

    Thanks.

    On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    In pig 6, you can hook into bindTo() and save the file name.

    In pig 8 you have to find your way to the underlying InputSplit
    via
    PigSplit.getWrappedSplit(), cast it as FileSplit, and call
    getPath()
    on it.. I think. Haven't done this.

    This will totally break if you have splitCombination turned on, of
    course, as pig can silently move to a different file under you, so
    you'd have to turn that off.

    D
    On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt wrote:
    Hey,

    I have a bunch of files where the filename is significant. I'm
    loading
    the
    files by supplying the top level directory that contains the
    files.
    Is
    there a way to capture the filename of the file and append to
    the
    tuple
    of
    data that's in that file?

    -Kim

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 3, '11 at 11:53p
activeFeb 4, '11 at 7:30p
posts9
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase