FAQ
Hi,
I want to know how to get the actual line number of the input file in
the mapper.

The key, which TextInputFormat generates, is the bytes offset in the
file. So, how can I find the global line offset in the mapper?

Thanks
- -
Pei

Search Discussions

  • Harsh J at Apr 28, 2011 at 4:39 am
    Hello Pei,
    On Thu, Apr 28, 2011 at 6:58 AM, Pei HE wrote:
    The key, which TextInputFormat generates, is the bytes offset in the
    file. So, how can I find the global line offset in the mapper?
    This is not achievable unless you have fixed byte records (in which
    case you should be able to divide and find). You can try pre-building
    and maintaining an index otherwise, but looking up these forms of
    structure for every record may get slow.

    Sometimes its also alright to process complete documents in mappers
    instead of letting it split across, as a solution (your task's input
    record counter could be used as line number).

    --
    Harsh J
  • Soren Flexner at Apr 28, 2011 at 6:00 am
    This is a bit out of left field, but you could add a 'key' field at the
    beginning of each record (which you would arrange to be the record
    "number"), and then use the keyValue input format. Now your keys are the
    record number.

    This might be prohibitive if your data is already on HDFS, and you have a
    lot of it, since adding the counter key and copying the new dataset to HDFS
    might be a significant time investment in itself.

    -s
    On Wed, Apr 27, 2011 at 9:38 PM, Harsh J wrote:

    Hello Pei,
    On Thu, Apr 28, 2011 at 6:58 AM, Pei HE wrote:
    The key, which TextInputFormat generates, is the bytes offset in the
    file. So, how can I find the global line offset in the mapper?
    This is not achievable unless you have fixed byte records (in which
    case you should be able to divide and find). You can try pre-building
    and maintaining an index otherwise, but looking up these forms of
    structure for every record may get slow.

    Sometimes its also alright to process complete documents in mappers
    instead of letting it split across, as a solution (your task's input
    record counter could be used as line number).

    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 28, '11 at 1:29a
activeApr 28, '11 at 6:00a
posts3
users3
websitehadoop.apache.org...
irc#hadoop

3 users in discussion

Pei HE: 1 post Soren Flexner: 1 post Harsh J: 1 post

People

Translate

site design / logo © 2022 Grokbase