FAQ
We are continuing to see a small, consistent amount of block
corruption leading to file loss. We have been upgrading our cluster
lately, which means we've been doing a rolling de-commissioning of our
nodes (and then adding them back with more disks!).

Previously, when I've had time to investigate this very deeply, I've
found issues like these:

https://issues.apache.org/jira/browse/HADOOP-4692
https://issues.apache.org/jira/browse/HADOOP-4543

I suspect that this causes some or all of our problems.

I also saw that one of our nodes was at 100.2% full; I think this is
due to the same issue; Hadoop's actual usage of the file system is
greater than the max capacity because some of the blocks were truncated.

I'd have to check with our sysadmins, but I think we've lost about
200-300 files during the upgrade process. Right now, there are about
900 chronically under-replicated blocks; in the past, that's meant the
only replica is actually corrupt, and Hadoop is trying to relentlessly
retransfer it, failing to, but not realizing the source is corrupt.
To some extent, this whole issue is caused because we only have enough
space for 2 replicas; I'd imagine that at 3 replicas, the issue would
be much harder to trigger.

Any suggestions? For us, file loss is something we can deal with (not
necessarily fun to deal with, of course), but it might not be the case
in the future.

Brian

Search Discussions

  • Doug Cutting at Dec 8, 2008 at 6:03 pm

    Brian Bockelman wrote:
    To some extent, this whole issue is caused because we only have enough
    space for 2 replicas; I'd imagine that at 3 replicas, the issue would be
    much harder to trigger.
    The unfortunate reality is that if you run a configuration that's
    different than most you'll likely run into bugs that others do not. (As
    a side note, this is why we should try to minimize configuration
    options, so that everyone is running much the same system.) Hopefully
    squashing the two bugs you've filed will substantially help things.

    Doug
  • Steve Loughran at Dec 9, 2008 at 12:32 pm

    Doug Cutting wrote:
    Brian Bockelman wrote:
    To some extent, this whole issue is caused because we only have enough
    space for 2 replicas; I'd imagine that at 3 replicas, the issue would
    be much harder to trigger.
    The unfortunate reality is that if you run a configuration that's
    different than most you'll likely run into bugs that others do not. (As
    a side note, this is why we should try to minimize configuration
    options, so that everyone is running much the same system.)
    Alternatively, "why we should be exploring the configuration space more
    widely"
  • Doug Cutting at Dec 9, 2008 at 5:47 pm

    Steve Loughran wrote:
    Alternatively, "why we should be exploring the configuration space more
    widely"
    Are you volunteering?

    Doug
  • Arv Mistry at Dec 9, 2008 at 6:51 pm
    I'm using hadoop 0.17.0. Unfortunately I cant upgrade to 0.19.0 just
    yet.

    I'm trying to control the amount of extraneous files. I noticed there
    are the following log files produced by hadoop;

    On Slave
    - userlogs (for each map/reduce job)
    - stderr
    - stdout
    - syslog
    - datanode .log file
    - datanode .out file
    - tasktracker .log file
    - tasktracker .out file

    On Master
    - jobtracker .log file
    - jobtracker .out file
    - namenode .log file
    - namenode .out file
    - secondarynamenode .log file
    - secondarynamenode .out file
    - job .xml file
    - history
    - xml file for job


    Does any body know of how to configure hadoop so I don't have to delete
    these files manually? Or just so that they don't get created at all.

    For the history files, I set hadoop.job.history.user.location to none in
    the hadoop-site.xml file but I still get the history files created.
    Also I set in the log4j.properties the hadoop.root.logger=WARN but I
    still see INFO messages in datanode,jobtracker etc logs

    Thanks, in advance

    Cheers Arv
  • Amareshwari Sriramadasu at Dec 10, 2008 at 4:06 am

    Arv Mistry wrote:

    I'm using hadoop 0.17.0. Unfortunately I cant upgrade to 0.19.0 just
    yet.

    I'm trying to control the amount of extraneous files. I noticed there
    are the following log files produced by hadoop;

    On Slave
    - userlogs (for each map/reduce job)
    - stderr
    - stdout
    - syslog
    - datanode .log file
    - datanode .out file
    - tasktracker .log file
    - tasktracker .out file

    On Master
    - jobtracker .log file
    - jobtracker .out file
    - namenode .log file
    - namenode .out file
    - secondarynamenode .log file
    - secondarynamenode .out file
    - job .xml file
    - history
    - xml file for job


    Does any body know of how to configure hadoop so I don't have to delete
    these files manually? Or just so that they don't get created at all.

    For the history files, I set hadoop.job.history.user.location to none in
    the hadoop-site.xml file but I still get the history files created.
    Setting hadoop.job.history.user.location to "none", makes only history
    location specified for user. JT still has history location. History will
    be cleanup after a month.

    Userlogs will be cleaned up after "mapred.userlog.retain.hours", by
    default , 24hrs.

    Thanks
    Amareshwari
    Also I set in the log4j.properties the hadoop.root.logger=WARN but I
    still see INFO messages in datanode,jobtracker etc logs

    Thanks, in advance

    Cheers Arv
  • Steve Loughran at Dec 10, 2008 at 11:35 am

    Doug Cutting wrote:
    Steve Loughran wrote:
    Alternatively, "why we should be exploring the configuration space
    more widely"
    Are you volunteering?

    Doug
    Not yet. I think we have a fair bit of what is needed, and it would make
    for some interesting uses of the Yahoo!-HP-Intel cirrus testbed.

    There was some good work done at U Maryland on Skoll; a tool for
    efficiently exploring the configuration space of an application so as to
    find defects:

    http://www.cs.umd.edu/~aporter/Docs/skollJournal.pdf
    http://www.cs.umd.edu/~aporter/MemonPorterSkoll.ppt

    What you need is something like that applied to the hadoop config space
    and a set of tests that will shop up problems in a timely manner.

    -steve
  • Edward Capriolo at Dec 9, 2008 at 6:59 pm
    Also it might be useful to strongly word hadoop-default.conf as many
    people might not know a downside exists for using 2 rather then 3 as
    the replication factor. Before reading this thread I would have
    thought 2 to be sufficient.
  • Brian Bockelman at Dec 9, 2008 at 7:19 pm

    On Dec 9, 2008, at 4:58 PM, Edward Capriolo wrote:

    Also it might be useful to strongly word hadoop-default.conf as many
    people might not know a downside exists for using 2 rather then 3 as
    the replication factor. Before reading this thread I would have
    thought 2 to be sufficient.
    I think 2 should be sufficient, but running with 2 replicas instead of
    3 exposes some namenode bugs which are harder to trigger.

    For example, let's say your system has 100 nodes and 1M blocks. Let's
    say a namenode bug affects replica of block X on node Y and the
    namenode doesn't realize it. Then, there is a 1% chance that when
    another node goes down, the block becomes missing. If this bug is
    cumulative or affects many blocks (I suspect about 500-1000 blocks are
    problematic out of 1M), you're almost guaranteed to lose data whenever
    a single node goes down.

    On the other hand, if you have 1000 block replica problems on the same
    cluster with 3 replicas, in order to lose files, two of the block
    replica problems must be the same block and the node which goes down
    must hold the third block. The probability of this happening is
    (1e-6) * (1e-6) * (1/100) = 1e-14, or 0.0000000000001%.

    So, even assuming that I did all my probability calculations wrong, a
    site running with 2 replicas is more than 10 orders of magnitude more
    likely to discover inconsistencies or other bugs in the name node than
    a site with 3 replicas.

    Accordingly, these sites are the "canaries in the coal mine" to
    discover NameNode bugs.

    Brian
  • Raghu Angadi at Dec 9, 2008 at 7:33 pm

    Brian Bockelman wrote:
    On Dec 9, 2008, at 4:58 PM, Edward Capriolo wrote:

    Also it might be useful to strongly word hadoop-default.conf as many
    people might not know a downside exists for using 2 rather then 3 as
    the replication factor. Before reading this thread I would have
    thought 2 to be sufficient.
    I think 2 should be sufficient, but running with 2 replicas instead of 3
    exposes some namenode bugs which are harder to trigger.
    Whether 2 is sufficient or not, I completely agree with later part. We
    should treat this as what I think it fundamentally is : fixing Namenode.

    I guess lately some of these bugs either got more likely or some similar
    bugs crept in.

    Sticking with 3 is a very good advise for maximizing reliability.. but
    from a opportunistic developer point of view a big cluster running with
    replication of 2 is great test case :-).. over all I think is a good
    thing for Hadoop.

    Raghu.
  • Songting Chen at Dec 9, 2008 at 7:35 pm
    Is there a way for the Map process to know it's the end of records?

    I need to flush some additional data at the end of the Map process, but wondering where I should put that code.

    Thanks,
    -Songting
  • Owen O'Malley at Dec 9, 2008 at 7:43 pm

    On Dec 9, 2008, at 11:35 AM, Songting Chen wrote:

    Is there a way for the Map process to know it's the end of records?

    I need to flush some additional data at the end of the Map process,
    but wondering where I should put that code.
    The close() method is called at the end of the map.

    -- Owen
  • Aaron Kimball at Dec 10, 2008 at 3:35 am
    That's true, but you should be aware that you no longer have an
    OutputCollector available in the close() method. So if you are planning to
    have each mapper emit some sort of "end" record along to the reducer, you
    can't do so there. In general, there is not a good solution to that; you
    should rethink your algorithm if possible so that you don't need to do that.


    (I am not sure what happens if you memoize the OutputCollector you got as a
    parameter to your map() method and try to use it. Probably nothing good.)

    - Aaron

    On Tue, Dec 9, 2008 at 11:42 AM, Owen O'Malley wrote:


    On Dec 9, 2008, at 11:35 AM, Songting Chen wrote:

    Is there a way for the Map process to know it's the end of records?
    I need to flush some additional data at the end of the Map process, but
    wondering where I should put that code.
    The close() method is called at the end of the map.

    -- Owen
  • Owen O'Malley at Dec 10, 2008 at 4:19 am

    On Dec 9, 2008, at 7:34 PM, Aaron Kimball wrote:

    That's true, but you should be aware that you no longer have an
    OutputCollector available in the close() method.
    True, but in practice you can keep a handle to it from the map method
    and it will work perfectly. This is required for both streaming and
    pipes to work. (Both of them do their processing asynchronously, so
    the close needs to wait for the subprocess to finish. Because of this,
    the contract with the Mapper and Reducer are very loose and the
    collect method may be called in between calls to the map method.) In
    the context object api (hadoop-1230), the api will include the context
    object in cleanup, to make it clear that cleanup can also write records.

    -- Owen
  • Brian Bockelman at Dec 9, 2008 at 7:38 pm

    On Dec 9, 2008, at 5:31 PM, Raghu Angadi wrote:

    Brian Bockelman wrote:
    On Dec 9, 2008, at 4:58 PM, Edward Capriolo wrote:
    Also it might be useful to strongly word hadoop-default.conf as many
    people might not know a downside exists for using 2 rather then 3 as
    the replication factor. Before reading this thread I would have
    thought 2 to be sufficient.
    I think 2 should be sufficient, but running with 2 replicas instead
    of 3 exposes some namenode bugs which are harder to trigger.
    Whether 2 is sufficient or not, I completely agree with later part.
    We should treat this as what I think it fundamentally is : fixing
    Namenode.

    I guess lately some of these bugs either got more likely or some
    similar bugs crept in.

    Sticking with 3 is a very good advise for maximizing reliability..
    but from a opportunistic developer point of view a big cluster
    running with replication of 2 is great test case :-).. over all I
    think is a good thing for Hadoop.
    Well, we're most likely here to stay: this is the secondary site for
    most of these files. As long as we can indeed identify lost files,
    it's fairly automated to retransfer. The amount of unique files on
    this site is around .1% or less of total, and we plan on setting only
    those to 3 replicas.

    So, we'll be happy to provide whatever logs or debugging info is
    needed, as long as someone cares to keep on fixing bugs.

    Brian

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedDec 6, '08 at 1:00a
activeDec 10, '08 at 11:35a
posts15
users10
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase