FAQ
Hello,

I have a simple map-reduce job that reads in zipped files and converts them
to lzo compression. Some of the files are not properly zipped which results
in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
error" and causes the job to fail. Is there a way to catch this exception
and tell hadoop to just ignore the file and move on? I think the exception
is being thrown by the class reading in the Gzip file and not my mapper
class. Is this correct? Is there a way to handle this type of error
gracefully?

Thank you!

~Ed

Search Discussions

  • Harsh J at Oct 21, 2010 at 10:37 am
    If it occurs eventually as your record reader reads it, then you may
    use a MapRunner class instead of a Mapper IFace/Subclass. This way,
    you may try/catch over the record reader itself, and call your map
    function only on valid next()s. I think this ought to work.

    You can set it via JobConf.setMapRunnerClass(...).

    Ref: MapRunner API @
    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
    On Wed, Oct 20, 2010 at 4:14 AM, ed wrote:
    Hello,

    I have a simple map-reduce job that reads in zipped files and converts them
    to lzo compression.  Some of the files are not properly zipped which results
    in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
    error" and causes the job to fail.  Is there a way to catch this exception
    and tell hadoop to just ignore the file and move on?  I think the exception
    is being thrown by the class reading in the Gzip file and not my mapper
    class.  Is this correct?  Is there a way to handle this type of error
    gracefully?

    Thank you!

    ~Ed


    --
    Harsh J
    www.harshj.com
  • Ed at Oct 21, 2010 at 3:24 pm
    Hello,

    The MapRunner classes looks promising. I noticed it is in the deprecated
    mapred package but I didn't see an equivalent class in the mapreduce
    package. Is this going to ported to mapreduce or is it no longer being
    supported? Thanks!

    ~Ed
    On Thu, Oct 21, 2010 at 6:36 AM, Harsh J wrote:

    If it occurs eventually as your record reader reads it, then you may
    use a MapRunner class instead of a Mapper IFace/Subclass. This way,
    you may try/catch over the record reader itself, and call your map
    function only on valid next()s. I think this ought to work.

    You can set it via JobConf.setMapRunnerClass(...).

    Ref: MapRunner API @

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
    On Wed, Oct 20, 2010 at 4:14 AM, ed wrote:
    Hello,

    I have a simple map-reduce job that reads in zipped files and converts them
    to lzo compression. Some of the files are not properly zipped which results
    in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
    error" and causes the job to fail. Is there a way to catch this exception
    and tell hadoop to just ignore the file and move on? I think the exception
    is being thrown by the class reading in the Gzip file and not my mapper
    class. Is this correct? Is there a way to handle this type of error
    gracefully?

    Thank you!

    ~Ed


    --
    Harsh J
    www.harshj.com
  • Ed at Oct 21, 2010 at 4:15 pm
    Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
    before) and it doesn't look like MapRunner is deprecated so I'll try
    catching the error there and will report back if it's a good solution.
    Thanks!

    ~Ed
    On Thu, Oct 21, 2010 at 11:23 AM, ed wrote:

    Hello,

    The MapRunner classes looks promising. I noticed it is in the deprecated
    mapred package but I didn't see an equivalent class in the mapreduce
    package. Is this going to ported to mapreduce or is it no longer being
    supported? Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 6:36 AM, Harsh J wrote:

    If it occurs eventually as your record reader reads it, then you may
    use a MapRunner class instead of a Mapper IFace/Subclass. This way,
    you may try/catch over the record reader itself, and call your map
    function only on valid next()s. I think this ought to work.

    You can set it via JobConf.setMapRunnerClass(...).

    Ref: MapRunner API @

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
    On Wed, Oct 20, 2010 at 4:14 AM, ed wrote:
    Hello,

    I have a simple map-reduce job that reads in zipped files and converts them
    to lzo compression. Some of the files are not properly zipped which results
    in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
    error" and causes the job to fail. Is there a way to catch this exception
    and tell hadoop to just ignore the file and move on? I think the exception
    is being thrown by the class reading in the Gzip file and not my mapper
    class. Is this correct? Is there a way to handle this type of error
    gracefully?

    Thank you!

    ~Ed


    --
    Harsh J
    www.harshj.com
  • Ed at Oct 21, 2010 at 5:29 pm
    Sorry to keep spamming this thread. It looks like the correct way to
    implement MapRunnable using the new mapreduce classes (instead of the
    deprecated mapred) is to override the run() method of the mapper class.
    This is actually nice and convenient since everyone should already be using
    Mapper class (org.apache.hadoop.mapreduce.Maper<KEYIN, VALUEIN, KEYOUT,
    VALUEOUT> for their mappers.

    ~Ed
    On Thu, Oct 21, 2010 at 12:14 PM, ed wrote:

    Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
    before) and it doesn't look like MapRunner is deprecated so I'll try
    catching the error there and will report back if it's a good solution.
    Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 11:23 AM, ed wrote:

    Hello,

    The MapRunner classes looks promising. I noticed it is in the deprecated
    mapred package but I didn't see an equivalent class in the mapreduce
    package. Is this going to ported to mapreduce or is it no longer being
    supported? Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 6:36 AM, Harsh J wrote:

    If it occurs eventually as your record reader reads it, then you may
    use a MapRunner class instead of a Mapper IFace/Subclass. This way,
    you may try/catch over the record reader itself, and call your map
    function only on valid next()s. I think this ought to work.

    You can set it via JobConf.setMapRunnerClass(...).

    Ref: MapRunner API @

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
    On Wed, Oct 20, 2010 at 4:14 AM, ed wrote:
    Hello,

    I have a simple map-reduce job that reads in zipped files and converts them
    to lzo compression. Some of the files are not properly zipped which results
    in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
    error" and causes the job to fail. Is there a way to catch this exception
    and tell hadoop to just ignore the file and move on? I think the exception
    is being thrown by the class reading in the Gzip file and not my mapper
    class. Is this correct? Is there a way to handle this type of error
    gracefully?

    Thank you!

    ~Ed


    --
    Harsh J
    www.harshj.com
  • Ed at Oct 21, 2010 at 5:29 pm
    Thanks Tom! Didn't see your post before posting =)
    On Thu, Oct 21, 2010 at 1:28 PM, ed wrote:

    Sorry to keep spamming this thread. It looks like the correct way to
    implement MapRunnable using the new mapreduce classes (instead of the
    deprecated mapred) is to override the run() method of the mapper class.
    This is actually nice and convenient since everyone should already be using
    Mapper class (org.apache.hadoop.mapreduce.Maper<KEYIN, VALUEIN, KEYOUT,
    VALUEOUT> for their mappers.

    ~Ed

    On Thu, Oct 21, 2010 at 12:14 PM, ed wrote:

    Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
    before) and it doesn't look like MapRunner is deprecated so I'll try
    catching the error there and will report back if it's a good solution.
    Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 11:23 AM, ed wrote:

    Hello,

    The MapRunner classes looks promising. I noticed it is in the deprecated
    mapred package but I didn't see an equivalent class in the mapreduce
    package. Is this going to ported to mapreduce or is it no longer being
    supported? Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 6:36 AM, Harsh J wrote:

    If it occurs eventually as your record reader reads it, then you may
    use a MapRunner class instead of a Mapper IFace/Subclass. This way,
    you may try/catch over the record reader itself, and call your map
    function only on valid next()s. I think this ought to work.

    You can set it via JobConf.setMapRunnerClass(...).

    Ref: MapRunner API @

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
    On Wed, Oct 20, 2010 at 4:14 AM, ed wrote:
    Hello,

    I have a simple map-reduce job that reads in zipped files and converts them
    to lzo compression. Some of the files are not properly zipped which results
    in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
    error" and causes the job to fail. Is there a way to catch this exception
    and tell hadoop to just ignore the file and move on? I think the exception
    is being thrown by the class reading in the Gzip file and not my mapper
    class. Is this correct? Is there a way to handle this type of error
    gracefully?

    Thank you!

    ~Ed


    --
    Harsh J
    www.harshj.com
  • Ed at Oct 21, 2010 at 6:08 pm
    I overwrote the run() method in the mapper with a run() method (below) that
    catches the EOFException. The mapper and reducer now complete but the
    outputted lzo file from the reducer throws an "Unexpected End of File error"
    when decompressing it indicating something did not clean up properly. I
    can't think of why this could be happening as the map() method should only
    be called on input that was properly decompressed (anything that can't be
    decompressed will throw an Exception that is being caught). The reducer
    then should not even know that the mapper hit an EOFException in the input
    gzip file, and yet the output lzo file still has the unexpected end of file
    problem (I'm using the kevinweil lzo libraries). Is there some call that
    needs to be made that will close out the mapper and ensure that the lzo
    output from the reducer is formatted properly? Thank you!

    @Override
    public void run(Context context) throw InterruptedException{
    try{
    setup(context);
    while(context.nextKeyValue()){
    map(context.getCurrentKey(), context.getCurrentValue(),
    context);
    }
    cleanup(context);
    } catch(EOFException){
    logError(context, "EOFException: Corrupt gzip file" + mFileName);
    }
    }

    On Thu, Oct 21, 2010 at 1:29 PM, ed wrote:

    Thanks Tom! Didn't see your post before posting =)

    On Thu, Oct 21, 2010 at 1:28 PM, ed wrote:

    Sorry to keep spamming this thread. It looks like the correct way to
    implement MapRunnable using the new mapreduce classes (instead of the
    deprecated mapred) is to override the run() method of the mapper class.
    This is actually nice and convenient since everyone should already be using
    Mapper class (org.apache.hadoop.mapreduce.Maper<KEYIN, VALUEIN, KEYOUT,
    VALUEOUT> for their mappers.

    ~Ed

    On Thu, Oct 21, 2010 at 12:14 PM, ed wrote:

    Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
    before) and it doesn't look like MapRunner is deprecated so I'll try
    catching the error there and will report back if it's a good solution.
    Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 11:23 AM, ed wrote:

    Hello,

    The MapRunner classes looks promising. I noticed it is in the
    deprecated mapred package but I didn't see an equivalent class in the
    mapreduce package. Is this going to ported to mapreduce or is it no longer
    being supported? Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 6:36 AM, Harsh J wrote:

    If it occurs eventually as your record reader reads it, then you may
    use a MapRunner class instead of a Mapper IFace/Subclass. This way,
    you may try/catch over the record reader itself, and call your map
    function only on valid next()s. I think this ought to work.

    You can set it via JobConf.setMapRunnerClass(...).

    Ref: MapRunner API @

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
    On Wed, Oct 20, 2010 at 4:14 AM, ed wrote:
    Hello,

    I have a simple map-reduce job that reads in zipped files and
    converts them
    to lzo compression. Some of the files are not properly zipped which results
    in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
    error" and causes the job to fail. Is there a way to catch this exception
    and tell hadoop to just ignore the file and move on? I think the exception
    is being thrown by the class reading in the Gzip file and not my mapper
    class. Is this correct? Is there a way to handle this type of error
    gracefully?

    Thank you!

    ~Ed


    --
    Harsh J
    www.harshj.com
  • Ed at Oct 21, 2010 at 10:00 pm
    So the overwritten run() method was a red herring. The real problem appears
    to be that I use MultipleOutputs (the new mapreduce API version) for my
    reducer output. I posted a different thread since it's not really related
    to the original question here. For everyone that was curious, it turns our
    overriding the run() method and catching the EOFException works beautifully
    for processing files that might be corrupt or have errors. Thanks!

    ~Ed
    On Thu, Oct 21, 2010 at 2:07 PM, ed wrote:

    I overwrote the run() method in the mapper with a run() method (below) that
    catches the EOFException. The mapper and reducer now complete but the
    outputted lzo file from the reducer throws an "Unexpected End of File error"
    when decompressing it indicating something did not clean up properly. I
    can't think of why this could be happening as the map() method should only
    be called on input that was properly decompressed (anything that can't be
    decompressed will throw an Exception that is being caught). The reducer
    then should not even know that the mapper hit an EOFException in the input
    gzip file, and yet the output lzo file still has the unexpected end of file
    problem (I'm using the kevinweil lzo libraries). Is there some call that
    needs to be made that will close out the mapper and ensure that the lzo
    output from the reducer is formatted properly? Thank you!

    @Override
    public void run(Context context) throw InterruptedException{
    try{
    setup(context);
    while(context.nextKeyValue()){
    map(context.getCurrentKey(), context.getCurrentValue(),
    context);
    }
    cleanup(context);
    } catch(EOFException){
    logError(context, "EOFException: Corrupt gzip file" +
    mFileName);

    }
    }

    On Thu, Oct 21, 2010 at 1:29 PM, ed wrote:

    Thanks Tom! Didn't see your post before posting =)

    On Thu, Oct 21, 2010 at 1:28 PM, ed wrote:

    Sorry to keep spamming this thread. It looks like the correct way to
    implement MapRunnable using the new mapreduce classes (instead of the
    deprecated mapred) is to override the run() method of the mapper class.
    This is actually nice and convenient since everyone should already be using
    Mapper class (org.apache.hadoop.mapreduce.Maper<KEYIN, VALUEIN, KEYOUT,
    VALUEOUT> for their mappers.

    ~Ed

    On Thu, Oct 21, 2010 at 12:14 PM, ed wrote:

    Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
    before) and it doesn't look like MapRunner is deprecated so I'll try
    catching the error there and will report back if it's a good solution.
    Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 11:23 AM, ed wrote:

    Hello,

    The MapRunner classes looks promising. I noticed it is in the
    deprecated mapred package but I didn't see an equivalent class in the
    mapreduce package. Is this going to ported to mapreduce or is it no longer
    being supported? Thanks!

    ~Ed

    On Thu, Oct 21, 2010 at 6:36 AM, Harsh J wrote:

    If it occurs eventually as your record reader reads it, then you may
    use a MapRunner class instead of a Mapper IFace/Subclass. This way,
    you may try/catch over the record reader itself, and call your map
    function only on valid next()s. I think this ought to work.

    You can set it via JobConf.setMapRunnerClass(...).

    Ref: MapRunner API @

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
    On Wed, Oct 20, 2010 at 4:14 AM, ed wrote:
    Hello,

    I have a simple map-reduce job that reads in zipped files and
    converts them
    to lzo compression. Some of the files are not properly zipped which results
    in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
    error" and causes the job to fail. Is there a way to catch this exception
    and tell hadoop to just ignore the file and move on? I think the exception
    is being thrown by the class reading in the Gzip file and not my mapper
    class. Is this correct? Is there a way to handle this type of error
    gracefully?

    Thank you!

    ~Ed


    --
    Harsh J
    www.harshj.com
  • Tom White at Oct 21, 2010 at 4:44 pm

    On Thu, Oct 21, 2010 at 8:23 AM, ed wrote:
    Hello,

    The MapRunner classes looks promising.  I noticed it is in the deprecated
    mapred package but I didn't see an equivalent class in the mapreduce
    package.  Is this going to ported to mapreduce or is it no longer being
    supported?  Thanks!
    The equivalent functionality is in org.apache.hadoop.mapreduce.Mapper#run.

    Cheers
    Tom
    ~Ed
    On Thu, Oct 21, 2010 at 6:36 AM, Harsh J wrote:

    If it occurs eventually as your record reader reads it, then you may
    use a MapRunner class instead of a Mapper IFace/Subclass. This way,
    you may try/catch over the record reader itself, and call your map
    function only on valid next()s. I think this ought to work.

    You can set it via JobConf.setMapRunnerClass(...).

    Ref: MapRunner API @

    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html
    On Wed, Oct 20, 2010 at 4:14 AM, ed wrote:
    Hello,

    I have a simple map-reduce job that reads in zipped files and converts them
    to lzo compression.  Some of the files are not properly zipped which results
    in Hadoop throwing an "java.io.EOFException: Unexpected end of input stream
    error" and causes the job to fail.  Is there a way to catch this exception
    and tell hadoop to just ignore the file and move on?  I think the exception
    is being thrown by the class reading in the Gzip file and not my mapper
    class.  Is this correct?  Is there a way to handle this type of error
    gracefully?

    Thank you!

    ~Ed


    --
    Harsh J
    www.harshj.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 19, '10 at 10:45p
activeOct 21, '10 at 10:00p
posts9
users3
websitehadoop.apache.org...
irc#hadoop

3 users in discussion

Ed: 7 posts Harsh J: 1 post Tom White: 1 post

People

Translate

site design / logo © 2021 Grokbase