FAQ
ReplicationMonitor thread goes away
------------------------------------

Key: HADOOP-1486
URL: https://issues.apache.org/jira/browse/HADOOP-1486
Project: Hadoop
Issue Type: Bug
Components: dfs
Affects Versions: 0.12.3
Reporter: Koji Noguchi
Fix For: 0.14.0


Saw many over/under replicated blocks in fsck output.

.out file showed


Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
at java.lang.Thread.run(Thread.java:619)

(same as HADOOP-1232)

And, jstack showed no ReplicationMonitor thread.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • dhruba borthakur (JIRA) at Jun 12, 2007 at 5:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503926 ]

    dhruba borthakur commented on HADOOP-1486:
    ------------------------------------------

    The Replication Monitor got a Runtime Exception as described in HADOOP-1232. The Namenode server threads do not catch RuntimeExceptions. The real fix is to find the cause of HADOOP-1232, but there are a few additional things that we can do to address in this issue:

    1. Make the namenode exit when a system thread encounters a RuntimeException. Have another deamon that monitors HDFS processes and restarts them if they die.

    2. Make the namenode fall into safemode when a system thread encounters a runtime exception.

    3. Make the namenode exit when a system thread encounters a RuntimeException. It will remain dead until administrator manually intervenes.

    I prefer option 1.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Fix For: 0.14.0


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jun 15, 2007 at 7:04 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Attachment: catchThrowable.patch

    The ReplicationMonitor thread catches all types of exceptions, logs them, sleep for 5 seconds and then continue from the beginning.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Fix For: 0.14.0

    Attachments: catchThrowable.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jun 18, 2007 at 10:27 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Priority: Blocker (was: Major)

    This one might merit going into the 0.13 release as early as possible. Marking it as a blocker pending further discussion,.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Jun 20, 2007 at 9:20 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506680 ]

    Hairong Kuang commented on HADOOP-1486:
    ---------------------------------------
    The ReplicationMonitor thread catches all types of exceptions, logs them, sleep for 5 seconds and then continue from the beginning.
    This solution makes sure that ReplicationMonitor does not go away in case of RuntimeErrors. But is it possible that this solution leaves namenode in an inconsistent state? What if ReplicationMonitor is in the middle of updating some data structures when RuntimeError occurs. If this is possible, option 1 might be a safer solution.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jun 21, 2007 at 6:26 am
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur reassigned HADOOP-1486:
    ----------------------------------------

    Assignee: dhruba borthakur
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jun 25, 2007 at 6:54 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Attachment: catchThrowable2.patch

    merged patch with latest trunk.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jun 25, 2007 at 6:54 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Attachment: (was: catchThrowable.patch)
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jun 25, 2007 at 6:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Status: Patch Available (was: Open)

    I think this patch alleviates this problem in the short term. Longer term, we can experiment with Java Service Wrapper (HADOOP-1525) to create a service process that monitors and recreates hadoop daemons as and when necessary.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Jun 25, 2007 at 7:32 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507964 ]

    Hadoop QA commented on HADOOP-1486:
    -----------------------------------

    +1

    http://issues.apache.org/jira/secure/attachment/12360507/catchThrowable2.patch applied and successfully tested against trunk revision r549977.

    Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/327/testReport/
    Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/327/console
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jun 25, 2007 at 11:09 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508015 ]

    Doug Cutting commented on HADOOP-1486:
    --------------------------------------

    Like Hairong, I am not completely comfortable with this patch. Wouldn't it be safer to, in the added catch clause, set fsRunning to be false so that the namenode exits when an unexpected exception is encountered? And, also, shouldn't we explicitly try to fix the IllegalArgumentException problem that caused this?
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jun 25, 2007 at 11:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508019 ]

    dhruba borthakur commented on HADOOP-1486:
    ------------------------------------------

    I too like the option 1 that I had listed earlier: make the namenode exit when it encounters a runtime exception. The question that remained unanswered is whether to have a monitoring daemon that restarts namenode automatically without any manual intervention. Do you have any suggestions in this regard?

    Also, i would rather fix the real cause of this bug: HADOOP-1232. But we do not yet have a fix for that one. Leaving the namenode up and running without the ReplcaitionMonitor thread is not an option because blocks do not get replicated.


    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Jun 26, 2007 at 1:40 am
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508033 ]

    Raghu Angadi commented on HADOOP-1486:
    --------------------------------------

    Should we log StringUtils.stringifyException(t) instead of "t"?

    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jun 26, 2007 at 6:07 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508264 ]

    Doug Cutting commented on HADOOP-1486:
    --------------------------------------
    whether to have a monitoring daemon that restarts namenode automatically
    It seems safe to restart the namenode in this case. I'd simply add a loop to NameNode.main() that creates and starts a new NameNode when the existing namenode exits unexpectedly. We should only restart if it's stopping due to an error, and not due to an explicit call to stop(). So perhaps NameNode#join() could return a boolean indicating whether it's exiting normally or should be restarted, and the catch in the ReplicationMonitor should call a NameNode method to trigger that kind of exit. Does this sound workable?
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jun 28, 2007 at 12:09 am
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508690 ]

    dhruba borthakur commented on HADOOP-1486:
    ------------------------------------------

    I think it might be dangerous to make NameNode.main() create a new NameNode object if the original one dies. The original instance of the Namenode would have used up lots of old-memory. If we create a new instance of the NameNode within the same JVM, then the GC process might take a while before the memory situation stabilizes. Is it ok if I exit the namenode-jvm completely and leave it to the administrator to restart the namenode if necessary?
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jun 28, 2007 at 7:15 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508912 ]

    Doug Cutting commented on HADOOP-1486:
    --------------------------------------
    If we create a new instance of the NameNode within the same JVM, then the GC process might take a while before the memory situation stabilizes.
    That's possible, I suppose, it's also possible that the GC might handle this well. GC time is often proportional to the amount of non-garbage, which would be small on restart.
    Is it ok if I exit the namenode-jvm completely and leave it to the administrator to restart the namenode if necessary?
    Sure, that'd be okay. But, if the namenode auto-restarts slowly, the admin can always kill & restart it manually, so I don't see the harm in it attempting to auto-restart. Restarting slowly isn't worse than being down, is it? So my instinct would be to try auto-restarting. It shouldn't cause data loss, and might indeed help in many cases, so, why not?
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: catchThrowable2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 3, 2007 at 12:06 am
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Status: Open (was: Patch Available)
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 3, 2007 at 12:06 am
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Attachment: (was: catchThrowable2.patch)
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 3, 2007 at 12:08 am
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Attachment: namenodeRestart.patch

    The Replication Monitor catches the RuntimeException and signals the namenode to restart. The namenode gracefully shuts down existing threads and starts all over again.

    A unit test to test this feature is attached.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 10, 2007 at 10:13 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Attachment: namenodeRestart2.patch

    Exit the JVM when the ReplicationMonitor thread encounters a runtime exception.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 10, 2007 at 10:13 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Attachment: (was: namenodeRestart.patch)
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 10, 2007 at 10:17 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511586 ]

    dhruba borthakur edited comment on HADOOP-1486 at 7/10/07 3:15 PM:
    -------------------------------------------------------------------

    Exit the JVM when the ReplicationMonitor thread encounters a runtime exception. Restarting the namenode within the same JVM instance was not easy, especially because not all resources were getting existed by NameNode.stop().



    was:
    Exit the JVM when the ReplicationMonitor thread encounters a runtime exception.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 10, 2007 at 10:19 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511586 ]

    dhruba borthakur edited comment on HADOOP-1486 at 7/10/07 3:18 PM:
    -------------------------------------------------------------------

    Exit the JVM when the ReplicationMonitor thread encounters a runtime exception. Restarting the namenode within the same JVM instance was not easy, especially because not all resources were getting released by NameNode.stop().



    was:
    Exit the JVM when the ReplicationMonitor thread encounters a runtime exception. Restarting the namenode within the same JVM instance was not easy, especially because not all resources were getting existed by NameNode.stop().

    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 10, 2007 at 10:29 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-1486:
    -------------------------------------

    Status: Patch Available (was: Open)

    Exit the Namenode when the ReplicationMonitor thread encounters a RuntimeException. It would have been nice to be able to restart the namenode within the context of the same JVM, but a lot of work is needed to gracefully release all previously allocated resources.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Jul 10, 2007 at 11:25 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511600 ]

    dhruba borthakur commented on HADOOP-1486:
    ------------------------------------------

    Changed two invocations of throw new RuntimeException() from FSEditLog because they actually need to exit the JVM. These were introduced by HADOOP-1414.
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Jul 10, 2007 at 11:52 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511612 ]

    Hadoop QA commented on HADOOP-1486:
    -----------------------------------

    +1

    http://issues.apache.org/jira/secure/attachment/12361532/namenodeRestart2.patch applied and successfully tested against trunk revision r555077.

    Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/392/testReport/
    Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/392/console
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jul 11, 2007 at 9:31 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Doug Cutting updated HADOOP-1486:
    ---------------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Dhruba!
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Jul 12, 2007 at 11:42 am
    [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512050 ]

    Hudson commented on HADOOP-1486:
    --------------------------------

    Integrated in Hadoop-Nightly #152 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/152/])
    ReplicationMonitor thread goes away
    ------------------------------------

    Key: HADOOP-1486
    URL: https://issues.apache.org/jira/browse/HADOOP-1486
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.12.3
    Reporter: Koji Noguchi
    Assignee: dhruba borthakur
    Priority: Blocker
    Fix For: 0.14.0

    Attachments: namenodeRestart2.patch


    Saw many over/under replicated blocks in fsck output.
    .out file showed
    Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
    at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
    at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
    at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
    at java.lang.Thread.run(Thread.java:619)
    (same as HADOOP-1232)
    And, jstack showed no ReplicationMonitor thread.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJun 12, '07 at 4:44p
activeJul 12, '07 at 11:42a
posts28
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 28 posts

People

Translate

site design / logo © 2022 Grokbase