FAQ
Hi Everybody,

I'm doing a project where I have to read a large set of compress files
(gz). I'm using python and streaming to achieve my goals. However, I
have a problem, there are corrupt compress files that are killing my
map/reduce jobs.
My environment is the following:
Hadoop-0.18.3 (CDH1)


Do you guys have some recommendations how to manage this case?
How I can catch that exception using python so that my jobs don't fail?
How I can identify these files using python and move them to a corrupt
file folder?

I really appreciate any recommendation

Xavier

Search Discussions

  • Jeff Hammerbacher at Oct 19, 2009 at 6:02 pm
    Hey Xavier,

    The functionality you are looking for was added to 0.19 and above:
    http://issues.apache.org/jira/browse/HADOOP-3828. If you upgrade your
    cluster to CDH2, you should be good to go.

    Regards,
    Jeff
    On Mon, Oct 19, 2009 at 10:58 AM, wrote:

    Hi Everybody,

    I'm doing a project where I have to read a large set of compress files
    (gz). I'm using python and streaming to achieve my goals. However, I
    have a problem, there are corrupt compress files that are killing my
    map/reduce jobs.
    My environment is the following:
    Hadoop-0.18.3 (CDH1)


    Do you guys have some recommendations how to manage this case?
    How I can catch that exception using python so that my jobs don't fail?
    How I can identify these files using python and move them to a corrupt
    file folder?

    I really appreciate any recommendation

    Xavier
  • Xavier Quintuna at Oct 19, 2009 at 6:53 pm
    Hi Jeff,
    Thanks for the suggestion, however, I'm running a small (2 machines)
    cluster with CDH2 with a folder that contains two files one corrupt and
    the other not but I'm still have the exception and the streamjob is
    kill.
    This is ok but I want to know a way to manage this exception
    (java.util.zip.ZipException: invalid block type or any other) using
    streaming (python).

    I really appreciate If you point me to some way to catch the exception,

    Thanks again

    Xavier



    -----Original Message-----
    From: Jeff Hammerbacher
    Sent: Monday, October 19, 2009 11:02 AM
    To: common-user@hadoop.apache.org
    Subject: Re: How to IO catch exceptions using python

    Hey Xavier,

    The functionality you are looking for was added to 0.19 and above:
    http://issues.apache.org/jira/browse/HADOOP-3828. If you upgrade your
    cluster to CDH2, you should be good to go.

    Regards,
    Jeff

    On Mon, Oct 19, 2009 at 10:58 AM,
    wrote:
    Hi Everybody,

    I'm doing a project where I have to read a large set of compress files
    (gz). I'm using python and streaming to achieve my goals. However, I
    have a problem, there are corrupt compress files that are killing my
    map/reduce jobs.
    My environment is the following:
    Hadoop-0.18.3 (CDH1)


    Do you guys have some recommendations how to manage this case?
    How I can catch that exception using python so that my jobs don't fail?
    How I can identify these files using python and move them to a corrupt
    file folder?

    I really appreciate any recommendation

    Xavier

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 19, '09 at 5:58p
activeOct 19, '09 at 6:53p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase