FAQ
better fault tolerance for corrupted text files
-----------------------------------------------

Key: HADOOP-3144
URL: https://issues.apache.org/jira/browse/HADOOP-3144
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.15.3
Reporter: Joydeep Sen Sarma


every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..

so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Zheng Shao (JIRA) at Apr 5, 2008 at 12:17 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144.patch
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 5, 2008 at 12:17 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Assignee: Zheng Shao
    Release Note: Added a new configuration variable: mapred.linerecordreader.maxlength (default: Long.MAX_VALUE) to specify the length of the longest lines. All lines longer than this will be thrown away. Default behavior does not change.
    Status: Patch Available (was: Open)

    This patch also makes sure that the LineReader does not read too far into the next split when it's not necessary.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 5, 2008 at 4:40 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144.patch

    Updated.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 5, 2008 at 4:40 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: (was: 3144.patch)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Apr 5, 2008 at 7:06 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585895#action_12585895 ]

    Hadoop QA commented on HADOOP-3144:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12379465/3144.patch
    against trunk revision 643282.

    @author +1. The patch does not contain any @author tags.

    tests included -1. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs -1. The patch appears to cause Findbugs to fail.

    core tests -1. The patch failed core unit tests.

    contrib tests -1. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2173/testReport/
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2173/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2173/console

    This message is automatically generated.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 5, 2008 at 8:22 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144.patch

    Added a test.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 5, 2008 at 8:22 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: (was: 3144.patch)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 5, 2008 at 6:40 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144.patch

    Corrected a warning in "ant compile-contrib"
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch, 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 5, 2008 at 6:42 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: (was: 3144.patch)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Apr 7, 2008 at 6:25 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586249#action_12586249 ]

    dhruba borthakur commented on HADOOP-3144:
    ------------------------------------------

    Hi Zheng, I see that there are many changes in the patch that are changes only in whitespace characters. This makes the patch longer and is harder to review. Will it be possible for you to re-submit this patch with whitespace-only changes to lines removed? thanks.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 7, 2008 at 8:01 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144.ignore.spaces.patch

    Generated by "svn diff -x -uw"
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.ignore.spaces.patch, 3144.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 7, 2008 at 8:05 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: (was: 3144.patch)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144.ignore.spaces.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 9, 2008 at 9:56 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144-ignore-spaces-2.patch

    Fixed a bug.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144.ignore.spaces.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 9, 2008 at 9:58 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: (was: 3144-ignore-spaces-2.patch)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 9, 2008 at 9:58 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144-ignore-spaces-2.patch

    fixed a bug.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 9, 2008 at 9:58 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: (was: 3144.ignore.spaces.patch)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 11, 2008 at 9:15 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588123#action_12588123 ]

    Zheng Shao commented on HADOOP-3144:
    ------------------------------------

    Hi Dhruba,

    Did you get a chance to look at the new diff?

    Zheng

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 22, 2008 at 6:18 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144-ignore-spaces-3.patch

    changing name to make sure hadoop tester pick it up.

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 23, 2008 at 8:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Status: In Progress (was: Patch Available)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 23, 2008 at 8:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Status: Patch Available (was: In Progress)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Apr 23, 2008 at 10:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591830#action_12591830 ]

    Hadoop QA commented on HADOOP-3144:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12380714/3144-ignore-spaces-3.patch
    against trunk revision 645773.

    @author +1. The patch does not contain any @author tags.

    tests included +1. The patch appears to include 3 new or modified tests.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/console

    This message is automatically generated.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Apr 25, 2008 at 11:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592526#action_12592526 ]

    Chris Douglas commented on HADOOP-3144:
    ---------------------------------------

    * Do these limits really need to be a longs? Changing the public API of readLine seems unnecessary when an int should be- and has been- sufficient.
    * There is some odd spacing around LineRecordReader::157,268 that make it difficult to tell which block the closing brace belongs to
    * I'm not sure I understand the skip logic. For the case where a line is larger than 64k (the buffer size), it looks like this reads up to a threshold, then discards input that exceeds what was requested, then returns the next record as the segment between the point in the threshold and the following newline (i.e. the trailing bytes of the too-long record). Is this accurate? Instead of getting a random segment of a record, wouldn't it be preferred to discard input until the next record boundary is found?
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Apr 26, 2008 at 1:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592539#action_12592539 ]

    Zheng Shao commented on HADOOP-3144:
    ------------------------------------

    * It used to be sufficient does not mean that they will be sufficient in the future - that's why we have open64. The cost of using a long instead of an int is minimal, while we do avoid potential overflow problems. The only interesting usage of this return value is accumulating the number of bytes read, which definitely should be stored in a long. So I don't see a problem here.

    * I will fix the spacing problem when we get a consensus on other problems.

    * The skip logic is to skip the whole long line - not just "maxLineLength" of bytes.

    The reason for "maxBytesToConsume" is to tell readLine the end of this block - there is no reason for the readLine to go through tens of gigs of data search for an end of line, while the current block is only 128MB. This is actually what was happening on our cluster - for binary file that a user mistakenly treats as a text file. All map jobs just swamped the cluster. The only use of maxBytesToConsume is to let readLine know when to stop. What would be the best way to fix this?

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Apr 26, 2008 at 2:39 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592550#action_12592550 ]

    Chris Douglas commented on HADOOP-3144:
    ---------------------------------------

    bq. It used to be sufficient does not mean that they will be sufficient in the future - that's why we have open64. The cost of using a long instead of an int is minimal, while we do avoid potential overflow problems

    True, but it's accumulating bytes read from a text file into memory for a single record. It's not at all obvious to me that this requires a long. Future-proofing a case that will be a total disaster for the rest of the framework seems premature, particularly when the change is to a generic text parser. If someone truly needs to slurp >2GB of text data _per record_, surely their requirements justify a less general RecordReader. It's not the cost of the int that concerns me, but rather it's the API change to support a case that's not only degenerate, but implausible.

    bq. The reason for "maxBytesToConsume" is to tell readLine the end of this block - there is no reason for the readLine to go through tens of gigs of data search for an end of line, while the current block is only 128MB.

    A far more portable solution for what this expresses would be an InputFormat generating a subclass of FileSplit annotated with a hard limit enforced by the RecordReader (i.e. returns EOF at some position within the file). Some of this will inevitably be done as part of the Hadoop archive work (HADOOP-3307). As a workaround, don't point text readers at binary data. ;)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Joydeep Sen Sarma (JIRA) at Apr 26, 2008 at 6:59 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592554#action_12592554 ]

    Joydeep Sen Sarma commented on HADOOP-3144:
    -------------------------------------------

    we did not point the text reader at a binary file. we had a corrupted text file filled with long section of junk.

    given that this is a problem that can happen to anyone (we just happen to be the lucky first) - and everyone uses textinputformat to read text files - why shouldn't the safeguard be built into textinputformat? what's the downside? (does it make sense to buy insurance after an accident? - we wait for people to hit such a problem and then say - oh, but u should have used 'SafeTextInputFormat'?)

    portable across what? i looked at 3307 earlier today. i don't know how it's remotely related. enlighten us.

    i am sorry - but i am mildly irritated by the comments here. we are aware of the concept of subclassing. and we can write our own inputformat - thank u so much. the whole point of going through this procedure is to contribute back to the community something that is of general benefit. Either the argument is that this is not of general benefit - or that the cost outweighs the benefit. Neither argument has been made.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Apr 26, 2008 at 9:56 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592560#action_12592560 ]

    Chris Douglas commented on HADOOP-3144:
    ---------------------------------------

    I was going by Zheng's last comment, i.e. "This is actually what was happening on our cluster - for binary file that a user mistakenly treats as a text file." I didn't mean anything by it.

    Assuming we're discussing the third bullet in the last two comments:

    If one is loading corrupted data into HDFS, then I don't think it's fair to assume that the most generic of text readers can do anything with it. I mention the archive format because it seems unavoidable that opening a file in an archive will return bounded stream within a large, composite file, i.e. be agnostic to the particular InputFormat employed, but act a lot like a bounded FileSplit. If that's the sort of thing you could use to deal with sketchy data, then it seemed to be a useful issue to monitor. Alternatively, a new InputFormat that generated bounded splits for Text files to recover from this condition might work for your case, and probably for others' if you felt like contributing it.

    The insurance analogy doesn't seem to describe this error. It's not like a car accident; it's like filling one's gas tank with creamed corn. Though he had every reason to believe it was gasoline- and is understandably angry that his engine is full of creamed corn- anger at the car for failing to run on the creamed corn is misspent. Though I like the idea in general- i.e. skipping unexpectedly long lines, or even just truncating records- my original question was trying to determine whether it skipped to the next record, continued reading bytes into the next record from wherever it stopped, or quit for extremely long lines. At a glance, it looked like it continued reading at wherever it left off in the stream, but I haven't looked at it as closely as the contributor and wanted to ask after its behavior. I'm still curious how, exactly, this patch effects its solution.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Joydeep Sen Sarma (JIRA) at Apr 26, 2008 at 4:21 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592583#action_12592583 ]

    Joydeep Sen Sarma commented on HADOOP-3144:
    -------------------------------------------

    one of the founding principles of map-reduce as described in the google paper (and perhaps one of the most remarkable differences with general database systems) was the notion of being tolerant of bad data. if u see a few rows of bad data - skip it.

    we try to do this in the application land as much as possible. however, it is not possible for us to do anything if Hadoop throws an out of memory error. Hence this fix belongs in hadoop core. Zheng's fix does skip to the next available record (if it falls within the split). Otherwise an EOF is returned.

    3307 is way out there man - it's a solution for small files. if the file was small - we wouldn't have a problem to begin with (as u say - the input is bounded). this problem only affects large files. if u read the description of 3307 carefully - u will notice it says that it has no impact on map-reduce. The problem we are trying to solve is a map-reduce problem - it applies whether the file comes from an archive (3307) or from local file system or hdfs (or any file system for that matter).
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Apr 28, 2008 at 12:08 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592703#action_12592703 ]

    Chris Douglas commented on HADOOP-3144:
    ---------------------------------------

    bq. one of the founding principles of map-reduce as described in the google paper (and perhaps one of the most remarkable differences with general database systems) was the notion of being tolerant of bad data. if u see a few rows of bad data - skip it.

    It's been awhile since I read the paper, but isn't recovery effected by skipping the record that caused a failure on the map (HADOOP-153)? Recovery from corrupted data without re-executing the map sounds like a solution for a less generic format than LineRecordReader; detecting and failing/discarding a map because its output is corrupt is application code, I agree, and this looks like Zheng has a very reasonable, general workaround (more below).

    Given the re-execution model, the "correct" and more general fix would be to fail the map- with an OOM exception- and skip the range that had already been read. If it read into the following split, then it need not be rescheduled because we know that another task had already scanned up to the next record boundary (or failed trying). If one wants to fail the task earlier, then specifying a "SafeTextInputFormat" isn't a terrible burden, but you have a point: a property that controls special cases for TextInputFormat is more usable. Without HADOOP-153, the point is moot, and perhaps this fix is more pressing as a consequence.

    bq. Zheng's fix does skip to the next available record (if it falls within the split). Otherwise an EOF is returned.

    That's not a full description of what it does, though. I took a closer look, and it doesn't do what I had assumed, i.e. define both a max line length and force a hard limit for reading into the following split (which is why the archive format didn't seem like a non sequitur). It defines a single new property that defines the maximum line length, which prevents the situation in this JIRA by terminating the record reader if it's past the end of the split, having consumed the maximum line length. Since it takes the maximum of what remains in the split and the aforementioned length as the limit, the situation I asked after (i.e. returning the trailing part of a record as a single record) doesn't occur. Since it defaults to Long.MAX_VALUE, there's no issue with existing code. That's all I was trying to determine. The API change (changing the return type of readLine from {{int}} to {{long}}) makes more sense in this context, but it still seems unnecessary.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Joydeep Sen Sarma (JIRA) at Apr 28, 2008 at 4:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592732#action_12592732 ]

    Joydeep Sen Sarma commented on HADOOP-3144:
    -------------------------------------------
    isn't recovery effected by skipping the record that caused a failure on the map (HADOOP-153)?
    thanks for pointing this out. this jira is not fixed and looks like there's still a debate on what the right approach is .. it seems that even if the jira were fixed - the linerecordreader would have to implement an additional api to skip to the next record boundary (to skip the bad record on map re-try) - so looks like we would need similar code - albeit under a different api.

    that said - i am not sure i agree with the design of 153. it's not clear to me why it doesn't suffice to let the recordreaders skip bad records (as they must be able to even with 153's additional apis). but that's a separate discussion ..

    what's the status of 153? seems like depending on where it goes - these changes may conflict or overlap ..
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Apr 30, 2008 at 6:14 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3144:
    ----------------------------------

    Status: Open (was: Patch Available)

    I don't think HADOOP-153 can completely replace this fix, particularly in the near term. The \*\-3 patch looks good, but the API change should be reverted and the indentation fixed. It might also make sense to upgrade the warning from INFO to WARN.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 1, 2008 at 9:34 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Status: Patch Available (was: Open)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 1, 2008 at 9:34 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144-4.patch

    Fixed the problems mentioned in Chris's comment.

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at May 2, 2008 at 1:35 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593692#action_12593692 ]

    Hadoop QA commented on HADOOP-3144:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12381276/3144-4.patch
    against trunk revision 645773.

    @author +1. The patch does not contain any @author tags.

    tests included +1. The patch appears to include 3 new or modified tests.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/console

    This message is automatically generated.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 2, 2008 at 7:16 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593851#action_12593851 ]

    Zheng Shao commented on HADOOP-3144:
    ------------------------------------

    Chris, will you have time to take a look at this today?

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at May 3, 2008 at 12:30 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593938#action_12593938 ]

    Chris Douglas commented on HADOOP-3144:
    ---------------------------------------

    This looks good, but I don't understand this change:
    {noformat}
    @@ -166,7 +207,7 @@
    boolean skipFirstLine = false;
    if (codec != null) {
    in = new LineReader(codec.createInputStream(fileIn), job);
    - end = Long.MAX_VALUE;
    + end = Integer.MAX_VALUE;
    } else {
    if (start != 0) {
    skipFirstLine = true;
    {noformat}

    Is this to avoid the overflow for the cast to int in next()?

    Instead, in the call to readLine in next:
    {noformat}
    + int newSize = in.readLine(value, maxLineLength,
    + (int)Math.max(end-pos, (long)maxLineLength));
    {noformat}

    It might be better to use (with appropriate casts):
    {noformat}
    + int newSize = in.readLine(value, maxLineLength,
    + Math.max(Math.min(end - pos, Integer.MAX_VALUE), maxLineLength));
    {noformat}

    Which makes Long.MAX_VALUE correct while avoiding the overflow, right?
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 5, 2008 at 9:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144-5.patch

    Fixed the 2 problems in Chris's comments:

    1. Restored Long.MAX_VALUE;
    2. That line contains a bug: I should have used min instead of max.

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 5, 2008 at 9:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Release Note: (was: Added a new configuration variable: mapred.linerecordreader.maxlength (default: Long.MAX_VALUE) to specify the length of the longest lines. All lines longer than this will be thrown away. Default behavior does not change.)
    Status: Patch Available (was: In Progress)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 5, 2008 at 9:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Status: In Progress (was: Patch Available)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at May 5, 2008 at 10:54 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594377#action_12594377 ]

    Hadoop QA commented on HADOOP-3144:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12381447/3144-5.patch
    against trunk revision 653607.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    +1 core tests. The patch passed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/console

    This message is automatically generated.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 5, 2008 at 10:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594380#action_12594380 ]

    Zheng Shao commented on HADOOP-3144:
    ------------------------------------

    Chris, got time to take a look again?

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at May 6, 2008 at 12:36 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594414#action_12594414 ]

    Chris Douglas commented on HADOOP-3144:
    ---------------------------------------

    The actual calculated for maxBytesToConsume in next() can still cause the overflow error you mentioned above; it's not sufficient to cast it to int, unfortunately. The purpose of this patch is to prevent LineRecordReader from reading absurdly far into the next split, right? Is there something wrong with max(min(end\-pos, MAX_INT), maxLineLen) ?
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 6, 2008 at 1:08 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594430#action_12594430 ]

    Zheng Shao commented on HADOOP-3144:
    ------------------------------------

    You are right. I was looking at line 221 and fixed a bug there in the last patch.

    The new patch is coming.

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 6, 2008 at 1:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Status: In Progress (was: Patch Available)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-6.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 6, 2008 at 1:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144-6.patch
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-6.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 6, 2008 at 1:14 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Status: Patch Available (was: In Progress)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-6.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at May 6, 2008 at 2:32 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594446#action_12594446 ]

    Hadoop QA commented on HADOOP-3144:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12381468/3144-6.patch
    against trunk revision 653638.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    +1 core tests. The patch passed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/console

    This message is automatically generated.
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-6.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at May 6, 2008 at 2:32 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594445#action_12594445 ]

    Chris Douglas commented on HADOOP-3144:
    ---------------------------------------

    I didn't see that bytesConsumed in readLine had changed from long to int in the last patch. If maxBytesToConsume is set to Integer.MAX_VALUE, as many cases would have it, then the overflow might still be missed (in the unlikely event this is hit before an OOM exception). Either bytesConsumed should be a long and min(bytesConsumed, Integer.MAX_VALUE) returned, or the overflow should be detected.

    The indentation at LineRecordReader:157 is still funky.

    Has this been tested against the corrupted data?
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-6.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 6, 2008 at 6:46 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Attachment: 3144-7.patch

    changed bytesConsumed back to long.

    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-6.patch, 3144-7.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 6, 2008 at 6:46 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Status: In Progress (was: Patch Available)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-6.patch, 3144-7.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at May 6, 2008 at 6:46 am
    [ https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Zheng Shao updated HADOOP-3144:
    -------------------------------

    Status: Patch Available (was: In Progress)
    better fault tolerance for corrupted text files
    -----------------------------------------------

    Key: HADOOP-3144
    URL: https://issues.apache.org/jira/browse/HADOOP-3144
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.15.3
    Reporter: Joydeep Sen Sarma
    Assignee: Zheng Shao
    Attachments: 3144-4.patch, 3144-5.patch, 3144-6.patch, 3144-7.patch, 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch


    every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..
    so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 31, '08 at 11:54p
activeMay 7, '08 at 12:23p
posts57
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 57 posts

People

Translate

site design / logo © 2022 Grokbase