FAQ
Secondary Namenode fails to fetch image and edits files
-------------------------------------------------------

Key: HDFS-1065
URL: https://issues.apache.org/jira/browse/HDFS-1065
Project: Hadoop HDFS
Issue Type: Bug
Affects Versions: 0.20.2
Reporter: Dmytro Molkov


We recently started experiencing problems where Secondary Namenode fails to fetch the image from the NameNode. The basic problem is described in HDFS-1024, but that JIRA was only dealing with possible data corruption, since then we got to a place where we could not compact the fsimage anymore because failures were 100% of the time.

Here is what we have found out:
The fetch still fails with the same exception as the HDFS-1024 (Jetty closes the connection before the file is sent)
We suspect the underlying reason to be extensive garbage collection on the NameNode (1/5 of all time is being spent in garbage collection). And the reason for that might be the bug that is solved with HADOOP-6577 (we have a lot of large RPC requests, which means we allocate and free a lot of memory all the time).
Because of GC the speed of the transfer drops to 700Kb/s

Having said all of that current mechanism of fetching the image is still potentially flawed. When dealing with large images namenode is under stress of sending multigig files over the wire to the client while still serving requests.

This JIRA is to discuss the possible ways of separating NameNode and the image fetching by the secondary namenode.
One thought we had was fetching the image using SCP rather than HTTP download from the NameNode.
This way the NameNode will have less pressure on it, on the other hand this will introduce new components that are not exactly under hadoop control (ssh client and server).
To deal with possible data corruption with SCP copy we would also want to extend CheckpointSignature to have checksum on the file, so it can be checked on the client side.

Please let me know what you think.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Dmytro Molkov (JIRA) at Nov 5, 2010 at 6:30 pm
    [ https://issues.apache.org/jira/browse/HDFS-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Dmytro Molkov resolved HDFS-1065.
    ---------------------------------

    Resolution: Duplicate

    This issue is being worked on in HDFS-1481, so closing this one as duplicate
    Secondary Namenode fails to fetch image and edits files
    -------------------------------------------------------

    Key: HDFS-1065
    URL: https://issues.apache.org/jira/browse/HDFS-1065
    Project: Hadoop HDFS
    Issue Type: Bug
    Affects Versions: 0.20.2
    Reporter: Dmytro Molkov

    We recently started experiencing problems where Secondary Namenode fails to fetch the image from the NameNode. The basic problem is described in HDFS-1024, but that JIRA was only dealing with possible data corruption, since then we got to a place where we could not compact the fsimage anymore because failures were 100% of the time.
    Here is what we have found out:
    The fetch still fails with the same exception as the HDFS-1024 (Jetty closes the connection before the file is sent)
    We suspect the underlying reason to be extensive garbage collection on the NameNode (1/5 of all time is being spent in garbage collection). And the reason for that might be the bug that is solved with HADOOP-6577 (we have a lot of large RPC requests, which means we allocate and free a lot of memory all the time).
    Because of GC the speed of the transfer drops to 700Kb/s
    Having said all of that current mechanism of fetching the image is still potentially flawed. When dealing with large images namenode is under stress of sending multigig files over the wire to the client while still serving requests.
    This JIRA is to discuss the possible ways of separating NameNode and the image fetching by the secondary namenode.
    One thought we had was fetching the image using SCP rather than HTTP download from the NameNode.
    This way the NameNode will have less pressure on it, on the other hand this will introduce new components that are not exactly under hadoop control (ssh client and server).
    To deal with possible data corruption with SCP copy we would also want to extend CheckpointSignature to have checksum on the file, so it can be checked on the client side.
    Please let me know what you think.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-dev @
categorieshadoop
postedMar 24, '10 at 6:39p
activeNov 5, '10 at 6:30p
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Dmytro Molkov (JIRA): 2 posts

People

Translate

site design / logo © 2022 Grokbase