FAQ
Hello everyone,

I just recently started running the balancer to fix job errors where a particular task runs out of local disk; however, I've noticed that I usually end up with a significant amount of corruption after it completes. Has anyone else observed this behavior?

I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer completes successfully.

Thanks.

Nick Jones

Search Discussions

  • Allen Wittenauer at Oct 27, 2010 at 1:44 pm
    On Oct 27, 2010, at 6:40 AM, Jones, Nick wrote:

    Hello everyone,

    I just recently started running the balancer to fix job errors where a particular task runs out of local disk; however, I've noticed that I usually end up with a significant amount of corruption after it completes. Has anyone else observed this behavior?
    With apache, no.
    I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer completes successfully.
    Sounds like a bug in their distro.
    Thanks.

    Nick Jones
  • Patrick Angeles at Oct 27, 2010 at 1:54 pm
    Nick,

    The corruption may have been caused by running out of disk space. At that
    point, even after rebalancing, you will still have corruption. Under normal
    circumstances, balancing by itself should not result in corruption.

    Regards,

    - Patrick
    On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick wrote:

    Hello everyone,

    I just recently started running the balancer to fix job errors where a
    particular task runs out of local disk; however, I've noticed that I usually
    end up with a significant amount of corruption after it completes. Has
    anyone else observed this behavior?

    I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
    completes successfully.

    Thanks.

    Nick Jones
  • Jones, Nick at Oct 27, 2010 at 2:02 pm
    Hi Patrick,

    I first started by running fsck / which reported healthy. I also know from jobtracker that nothing was running during the balancing time. I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.

    Thanks again,

    Nick Jones

    -----Original Message-----
    From: patrickangeles@gmail.com On Behalf Of Patrick Angeles
    Sent: Wednesday, October 27, 2010 8:54 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Large amount of corruption after balancer

    Nick,

    The corruption may have been caused by running out of disk space. At that
    point, even after rebalancing, you will still have corruption. Under normal
    circumstances, balancing by itself should not result in corruption.

    Regards,

    - Patrick
    On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick wrote:

    Hello everyone,

    I just recently started running the balancer to fix job errors where a
    particular task runs out of local disk; however, I've noticed that I usually
    end up with a significant amount of corruption after it completes. Has
    anyone else observed this behavior?

    I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
    completes successfully.

    Thanks.

    Nick Jones
  • Michael Segel at Oct 27, 2010 at 2:18 pm
    Uhm...

    I see that you're still running CDH2.
    You may want to go to CDH3b3.

    We tended to see corruption too, albeit in our HBase files.
    What happens if you bounce your cloud, wait 5-10 mins for things to sort themselves out and then try running an FSCK?

    From: nick.jones@amd.com
    To: common-user@hadoop.apache.org
    Date: Wed, 27 Oct 2010 09:01:15 -0500
    Subject: RE: Large amount of corruption after balancer

    Hi Patrick,

    I first started by running fsck / which reported healthy. I also know from jobtracker that nothing was running during the balancing time. I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.

    Thanks again,

    Nick Jones

    -----Original Message-----
    From: patrickangeles@gmail.com On Behalf Of Patrick Angeles
    Sent: Wednesday, October 27, 2010 8:54 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Large amount of corruption after balancer

    Nick,

    The corruption may have been caused by running out of disk space. At that
    point, even after rebalancing, you will still have corruption. Under normal
    circumstances, balancing by itself should not result in corruption.

    Regards,

    - Patrick
    On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick wrote:

    Hello everyone,

    I just recently started running the balancer to fix job errors where a
    particular task runs out of local disk; however, I've noticed that I usually
    end up with a significant amount of corruption after it completes. Has
    anyone else observed this behavior?

    I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
    completes successfully.

    Thanks.

    Nick Jones
  • Brian Bockelman at Oct 27, 2010 at 2:48 pm
    Hi Nick,

    What do you mean by "corruption" and how do you determine this? The way the balancer is implemented, I would be surprised if it could cause corruption without you also seeing corruption day-to-day.

    Brian
    On Oct 27, 2010, at 9:01 AM, Jones, Nick wrote:

    Hi Patrick,

    I first started by running fsck / which reported healthy. I also know from jobtracker that nothing was running during the balancing time. I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.

    Thanks again,

    Nick Jones

    -----Original Message-----
    From: patrickangeles@gmail.com On Behalf Of Patrick Angeles
    Sent: Wednesday, October 27, 2010 8:54 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Large amount of corruption after balancer

    Nick,

    The corruption may have been caused by running out of disk space. At that
    point, even after rebalancing, you will still have corruption. Under normal
    circumstances, balancing by itself should not result in corruption.

    Regards,

    - Patrick
    On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick wrote:

    Hello everyone,

    I just recently started running the balancer to fix job errors where a
    particular task runs out of local disk; however, I've noticed that I usually
    end up with a significant amount of corruption after it completes. Has
    anyone else observed this behavior?

    I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
    completes successfully.

    Thanks.

    Nick Jones
  • Jones, Nick at Oct 27, 2010 at 4:58 pm
    Hi Brian,

    I'm seeing thousands of corrupt blocks reported (not under replication errors) from fsck. We haven't been seeing corruption at all until I started running the balancer.

    I did try Michael's comment about bouncing the cloud. I originally saw ~25% under replication, but I still have about the same number of blocks showing up as corrupted after the replication leveled out.

    Nick Jones

    -----Original Message-----
    From: Brian Bockelman
    Sent: Wednesday, October 27, 2010 9:48 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Large amount of corruption after balancer

    Hi Nick,

    What do you mean by "corruption" and how do you determine this? The way the balancer is implemented, I would be surprised if it could cause corruption without you also seeing corruption day-to-day.

    Brian
    On Oct 27, 2010, at 9:01 AM, Jones, Nick wrote:

    Hi Patrick,

    I first started by running fsck / which reported healthy. I also know from jobtracker that nothing was running during the balancing time. I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.

    Thanks again,

    Nick Jones

    -----Original Message-----
    From: patrickangeles@gmail.com On Behalf Of Patrick Angeles
    Sent: Wednesday, October 27, 2010 8:54 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Large amount of corruption after balancer

    Nick,

    The corruption may have been caused by running out of disk space. At that
    point, even after rebalancing, you will still have corruption. Under normal
    circumstances, balancing by itself should not result in corruption.

    Regards,

    - Patrick
    On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick wrote:

    Hello everyone,

    I just recently started running the balancer to fix job errors where a
    particular task runs out of local disk; however, I've noticed that I usually
    end up with a significant amount of corruption after it completes. Has
    anyone else observed this behavior?

    I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
    completes successfully.

    Thanks.

    Nick Jones
  • Brian Bockelman at Oct 27, 2010 at 6:46 pm
    Hi Nick,

    Sorry, I can only come up with a longshot theory. If the files are corrupted, then that would have happened when block A got copied from node X to Y. The copying logic is independent of the balancer - the balancer just requests the copy gets made. After the transfer, the destination block gets checksummed and the checksum is reported to the NN. If the file is over-replicated, then one other copy gets deleted.

    If there's a bug in the logic in the deletion logic and the new copy is corrupt, you could end up deleting the wrong copy. A bug like this was fixed in 0.18/0.19. If you additionally only have 2 replicas, you would end up with a corrupt block.

    You should be able to see this sequence in your NN logs. Look to see when the NN realized a given block was first corrupted. Pick your favorite corrupt block name, and grep out its history from the logs.

    However, let's say your cluster is corrupting data at a network level at a large scale. Then, why would you see it only with the balancer running?

    It's hard to see this as a plausible scenario, but, on the other hand, something happened. It's possible it's just an outright coincidence.

    Brian
    On Oct 27, 2010, at 11:31 AM, Jones, Nick wrote:

    Hi Brian,

    I'm seeing thousands of corrupt blocks reported (not under replication errors) from fsck. We haven't been seeing corruption at all until I started running the balancer.

    I did try Michael's comment about bouncing the cloud. I originally saw ~25% under replication, but I still have about the same number of blocks showing up as corrupted after the replication leveled out.

    Nick Jones

    -----Original Message-----
    From: Brian Bockelman
    Sent: Wednesday, October 27, 2010 9:48 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Large amount of corruption after balancer

    Hi Nick,

    What do you mean by "corruption" and how do you determine this? The way the balancer is implemented, I would be surprised if it could cause corruption without you also seeing corruption day-to-day.

    Brian
    On Oct 27, 2010, at 9:01 AM, Jones, Nick wrote:

    Hi Patrick,

    I first started by running fsck / which reported healthy. I also know from jobtracker that nothing was running during the balancing time. I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.

    Thanks again,

    Nick Jones

    -----Original Message-----
    From: patrickangeles@gmail.com On Behalf Of Patrick Angeles
    Sent: Wednesday, October 27, 2010 8:54 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Large amount of corruption after balancer

    Nick,

    The corruption may have been caused by running out of disk space. At that
    point, even after rebalancing, you will still have corruption. Under normal
    circumstances, balancing by itself should not result in corruption.

    Regards,

    - Patrick
    On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick wrote:

    Hello everyone,

    I just recently started running the balancer to fix job errors where a
    particular task runs out of local disk; however, I've noticed that I usually
    end up with a significant amount of corruption after it completes. Has
    anyone else observed this behavior?

    I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
    completes successfully.

    Thanks.

    Nick Jones

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 27, '10 at 1:41p
activeOct 27, '10 at 6:46p
posts8
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase