FAQ
Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
--------------------------------------------------------------------------------------

Key: HADOOP-4010
URL: https://issues.apache.org/jira/browse/HADOOP-4010
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Affects Versions: 0.19.0
Reporter: Abdul Qadeer
Assignee: Abdul Qadeer
Fix For: 0.19.0


The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)

In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Abdul Qadeer (JIRA) at Aug 23, 2008 at 12:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Abdul Qadeer updated HADOOP-4010:
    ---------------------------------

    Attachment: Hadoop-4010.patch

    Code to implement the suggested changes in LineRecordReader.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Aug 23, 2008 at 12:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Abdul Qadeer updated HADOOP-4010:
    ---------------------------------

    Status: Patch Available (was: Open)
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Aug 26, 2008 at 5:48 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625634#action_12625634 ]

    Hadoop QA commented on HADOOP-4010:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12388782/Hadoop-4010.patch
    against trunk revision 688936.

    +1 @author. The patch does not contain any @author tags.

    -1 tests included. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    -1 core tests. The patch failed core unit tests.

    -1 contrib tests. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3103/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3103/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3103/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3103/console

    This message is automatically generated.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Aug 26, 2008 at 7:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4010:
    ----------------------------------

    Status: Open (was: Patch Available)

    Though TestMiniMRDFSSort is probably related to HADOOP-3950 and TestDatanodeDeath has been seen elsewhere (HADOOP-3628), the other unit tests should pass or be modified to reflect new semantics. In the latter case, this should be marked as an incompatible change.

    The comment in this patch explains the intent of the change more than the code it annotates. The reasoning is useful and appropriate to the JIRA, but the comment in the code should explain the algorithm.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Aug 31, 2008 at 6:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Abdul Qadeer updated HADOOP-4010:
    ---------------------------------

    Hadoop Flags: [Incompatible change]
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Aug 31, 2008 at 6:06 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Abdul Qadeer updated HADOOP-4010:
    ---------------------------------

    Attachment: Hadoop-4010_version2.patch

    Bug fixes.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Aug 31, 2008 at 6:08 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Abdul Qadeer updated HADOOP-4010:
    ---------------------------------

    Status: Patch Available (was: Open)
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Sep 1, 2008 at 11:07 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627549#action_12627549 ]

    Hadoop QA commented on HADOOP-4010:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12389242/Hadoop-4010_version2.patch
    against trunk revision 690641.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    -1 core tests. The patch failed core unit tests.

    -1 contrib tests. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3151/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3151/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3151/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3151/console

    This message is automatically generated.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 2, 2008 at 3:25 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4010:
    ----------------------------------

    Status: Open (was: Patch Available)

    Canceling patch while unit test failures are resolved.

    It looks like cacheString and cacheString2 aren't getting broken up as they should be from xargs. Does this handle files with a single line?

    I also don't understand the change to TestLineInputFormat. NLineInputFormat is a special case, but it should still work. The last split is a special case because N may not evenly divide the number of input lines; unless it's also the last split, the first shouldn't be a special case.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Sep 2, 2008 at 5:43 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627581#action_12627581 ]

    Abdul Qadeer commented on HADOOP-4010:
    --------------------------------------

    (1) In TestLineInputFormat, as you mentioned, equal number of lines
    are placed in a split, except the last one. Due to new LineRecordReader
    algorithm, the first split will process one more line as compared to other
    mappers. Due to this reason I am leaving the first split as well.

    (2) About the caching test failure, I am not really sure what is happening.
    I tried the LineRecordReader in isolation, for the same kind of test and it
    works. Something is going wrong in symlink stuff. I want to debug
    the test case but doing so in Eclipse gives error that WebApps are not
    not classpath, when infact I have put them on the eclipse classpath.
    Any suggestion to debug this test case?

    Thanks,
    Abdul Qadeer





    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 2, 2008 at 11:18 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627626#action_12627626 ]

    Chris Douglas commented on HADOOP-4010:
    ---------------------------------------

    bq. Due to new LineRecordReader algorithm, the first split will process one more line as compared to other mappers

    That's probably not going to be acceptable to users of NLineInputFormat. Users employing N formatted lines to initialize and run a mapper may find their jobs no longer work if the input is offset or if a map receives N+1 lines. If this is necessary for the new algorithm, rewriting or somehow accommodating this case may be required.

    bq. Something is going wrong in symlink stuff. I want to debug the test case but doing so in Eclipse gives error[...]

    Sorry, I don't use eclipse. It looks like the symlink resolution is working; both cache files are picked up as arguments from the input file. At a glance, what appears to be going wrong is newline detection or propagation between invocations of cat from xargs, a bad interaction with streaming (it also uses LineRecordReader, IIRC), or input exercising an edge case for LineRecordReader. Since it sounds like you've ruled out the latter, have you tried running a streaming job like the one in the testcase? I suspect the cache isn't necessary to reproduce this.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Sep 4, 2008 at 1:24 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628224#action_12628224 ]

    Abdul Qadeer commented on HADOOP-4010:
    --------------------------------------

    bq. Due to new LineRecordReader algorithm, the first split will process one more line as compared to other mappers

    bq. That's probably not going to be acceptable to users of NLineInputFormat. Users employing N formatted lines to initialize and run a mapper may find their jobs no longer work if the input is offset or if a map receives N+1 lines. If this is necessary for the new algorithm, rewriting or somehow accommodating this case may be required.

    I have changed NLineInputFormat to work it with the new LineRecordReader algorithm.
    The diff of the file is in the following. After this change I don't need to make any change
    in the TestLineInputFormat test case.


    --- src/mapred/org/apache/hadoop/mapred/lib/NLineInputFormat.java (revisio
    n 687954)
    +++ src/mapred/org/apache/hadoop/mapred/lib/NLineInputFormat.java (working
    copy)
    @@ -93,10 +93,19 @@
    long begin = 0;
    long length = 0;
    int num = -1;
    - while ((num = lr.readLine(line)) > 0) {
    + while ((num = lr.readLine(line)) > 0) {
    numLines++;
    length += num;
    if (numLines == N) {
    + //NLineInputFormat uses LineRecordReader, which
    + //always reads at least one character out of its
    + //upper split boundary. So to use LineRecordReader
    + // such that there are N lines in each split, we move
    + //back the upper split limits of each split by one
    + //character.
    + if(begin == 0){
    + length--;
    + }
    splits.add(new FileSplit(fileName, begin, length, new String[]{}));

    begin += length;
    length = 0;

    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Sep 16, 2008 at 8:00 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Abdul Qadeer updated HADOOP-4010:
    ---------------------------------

    Attachment: Hadoop-4010_version3.patch

    (1) The code comments in LineRecordReader are condensed.

    (2) NLineInputFormat is changed that it works with the new LineRecordReader. It is guaranteed that each mapper will get N lines, except for the last split.

    (3) TestMultipleCachefiles.java is updated. The output of this test case was dependent that how a file is assigned to a mapper. Please see this (https://issues.apache.org/jira/browse/HADOOP-4182) JIRA for the details.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Sep 16, 2008 at 8:02 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Abdul Qadeer updated HADOOP-4010:
    ---------------------------------

    Hadoop Flags: (was: [Incompatible change])
    Status: Patch Available (was: Open)
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Sep 17, 2008 at 9:22 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631700#action_12631700 ]

    Hadoop QA commented on HADOOP-4010:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12390169/Hadoop-4010_version3.patch
    against trunk revision 696149.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 core tests. The patch passed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3282/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3282/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3282/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3282/console

    This message is automatically generated.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Sep 18, 2008 at 10:45 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632426#action_12632426 ]

    owen.omalley edited comment on HADOOP-4010 at 9/18/08 3:44 PM:
    ----------------------------------------------------------------

    The back skip was put in to handle a strange corner case:

    {quote}
    a b c \r \n d e f \r \n g h i \r \n
    {quote}

    Assume the split is between the first \r and \n. The right answer is:

    first: "abc", "def"
    second: "ghi"

    But what I believe your patch will do is:

    first: "abc", "def"
    second: "def", "ghi"

    because it will spot the \n and assume the second line should be handled.


    was (Author: owen.omalley):
    The back skip was put in to handle a strange corner case:

    {quote}
    a b c \r \n d e f \r \n g h i \r \n
    {quote}

    Assume the split is between the first \r and \r. The right answer is:

    first: "abc", "def"
    second: "ghi"

    But what I believe your patch will do is:

    first: "abc", "def"
    second: "def", "ghi"

    because it will spot the \n and assume the second line should be handled.

    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Sep 18, 2008 at 10:45 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Owen O'Malley updated HADOOP-4010:
    ----------------------------------

    Status: Open (was: Patch Available)

    The back skip was put in to handle a strange corner case:

    {quote}
    a b c \r \n d e f \r \n g h i \r \n
    {quote}

    Assume the split is between the first \r and \r. The right answer is:

    first: "abc", "def"
    second: "ghi"

    But what I believe your patch will do is:

    first: "abc", "def"
    second: "def", "ghi"

    because it will spot the \n and assume the second line should be handled.

    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Abdul Qadeer (JIRA) at Sep 19, 2008 at 1:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632486#action_12632486 ]

    Abdul Qadeer commented on HADOOP-4010:
    --------------------------------------



    Just to make sure I understand correctly, you mean
    that if there are two splits such that

    a b c \r is one split while
    \n d e f \r \n g h i \r \n is the second split.

    start = 0; end = 3 for the first split
    start = 3; end = 14 for the second split

    For Split 1:

    (1) Constructor will not throw away first line because
    start != 0 will fail.
    (2) In the next method, the first read line will return
    abc and current pos = 5 (i.e. points to d)
    So in the next iteration of next(), the check that
    while (pos <= end) will fail because pos = 5; end = 3


    For Split 2:
    (1) Constructor will try to throw first line. After that
    pos = 5 (i.e. points to d)
    (2) next() will read def and gfi

    So it looks okay to me? Have I missed something?

    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.19.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Oct 1, 2008 at 2:00 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das updated HADOOP-4010:
    --------------------------------

    Fix Version/s: (was: 0.19.0)
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Jan 14, 2009 at 1:53 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4010:
    ----------------------------------

    Hadoop Flags: [Reviewed]
    Status: Patch Available (was: Open)
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Jan 14, 2009 at 1:53 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663601#action_12663601 ]

    Chris Douglas commented on HADOOP-4010:
    ---------------------------------------

    It looks like the [original|http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/mapred/TextInputFormat.java?view=log&pathrev=373800#rev209524] commit of the back skip is ancient (soon after Nutch was moved out of the Incubator). After going over possible cases with Owen, it looks like removing the backup and changing LineRecordReader to have its end condition as {{pos <= end}} will work. After reading your explanation in HADOOP-4182, the TestMultipleCachefiles change looks OK.

    +1, assuming all unit tests still pass.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Jan 16, 2009 at 6:07 am
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664411#action_12664411 ]

    Hadoop QA commented on HADOOP-4010:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12390169/Hadoop-4010_version3.patch
    against trunk revision 734870.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests. The patch passed core unit tests.

    -1 contrib tests. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3755/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3755/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3755/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3755/console

    This message is automatically generated.
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Jan 20, 2009 at 11:17 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4010:
    ----------------------------------

    Resolution: Fixed
    Fix Version/s: 0.21.0
    Hadoop Flags: [Incompatible change, Reviewed] (was: [Reviewed])
    Status: Resolved (was: Patch Available)

    I committed this. Thanks, Abdul
    Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
    --------------------------------------------------------------------------------------

    Key: HADOOP-4010
    URL: https://issues.apache.org/jira/browse/HADOOP-4010
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.19.0
    Reporter: Abdul Qadeer
    Assignee: Abdul Qadeer
    Fix For: 0.21.0

    Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch


    The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
    In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedAug 23, '08 at 12:41a
activeJan 20, '09 at 11:17p
posts24
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Chris Douglas (JIRA): 24 posts

People

Translate

site design / logo © 2022 Grokbase