FAQ
Hi,

I am running a Hadoop cluster with 20 Data node. Yesterday I found that the
Namenode was not responding ( No write/read to HDFS is happening). It got
stuck for few hours, then I shut down the Namenode and found the following
error from the Name node log.

2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call
getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from
xx.xx.xx.xx:13568: output error

This error was coming for every data node and data nodes are not able to
communicate with the Name node

After I restart the Namenode

2011-07-21 16:31:54,110 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2011-07-21 16:31:54,223 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
xx.xx.xx.xx:9000
2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2011-07-21 16:31:54,226 INFO
org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
2011-07-21 16:31:54,280 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
2011-07-21 16:31:54,280 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2011-07-21 16:31:54,280 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isPermissionEnabled=false
2011-07-21 16:31:54,287 INFO
org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
Initializing FSNamesystemMetrics using context
object:org.apache.hadoop.metrics.spi.NullContext
2011-07-21 16:31:54,289 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
FSNamesystemStatusMBean
2011-07-21 16:31:54,880 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 15817482
2011-07-21 16:34:38,463 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files under construction = 82
2011-07-21 16:34:41,177 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 2042701824 loaded in 166 seconds.
2011-07-21 16:58:07,624 INFO org.apache.hadoop.hdfs.server.common.Storage:
Edits file /home/hadoop/current/edits of size 12751835 edits # 138217 loaded
in 1406 seconds.

And it goes for a long halt. After about an hour it starts working again.

My question is when the error "IPC Server Responde" comes and is there a way
to deal with it.
Also if my Namenode is busy doing something then what is the way to find out
what it is doing.

Regards,
Rahul

Search Discussions

  • Joey Echeverria at Jul 22, 2011 at 12:11 pm
    Do you have an instance of the SecondaryNamenode in your cluster?

    -Joey
    On Fri, Jul 22, 2011 at 3:15 AM, Rahul Das wrote:

    Hi,

    I am running a Hadoop cluster with 20 Data node. Yesterday I found that the
    Namenode was not responding ( No write/read to HDFS is happening). It got
    stuck for few hours, then I shut down the Namenode and found the following
    error from the Name node log.

    2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder, call
    getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from
    xx.xx.xx.xx:13568: output error

    This error was coming for every data node and data nodes are not able to
    communicate with the Name node

    After I restart the Namenode

    2011-07-21 16:31:54,110 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
    2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2011-07-21 16:31:54,223 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
    xx.xx.xx.xx:9000
    2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2011-07-21 16:31:54,226 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
    NameNodeMeterics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    isPermissionEnabled=false
    2011-07-21 16:31:54,287 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
    Initializing FSNamesystemMetrics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,289 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
    FSNamesystemStatusMBean
    2011-07-21 16:31:54,880 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Number of files = 15817482
    2011-07-21 16:34:38,463 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Number of files under construction = 82
    2011-07-21 16:34:41,177 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 2042701824 loaded in 166 seconds.
    2011-07-21 16:58:07,624 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /home/hadoop/current/edits of size 12751835 edits # 138217 loaded
    in 1406 seconds.

    And it goes for a long halt. After about an hour it starts working again.

    My question is when the error "IPC Server Responde" comes and is there a
    way to deal with it.
    Also if my Namenode is busy doing something then what is the way to find
    out what it is doing.

    Regards,
    Rahul



    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Rahul Das at Jul 22, 2011 at 12:29 pm
    Yes I have a secondary Namenode running. Here are the log for
    SecondaryNamenode

    2011-07-21 16:02:47,908 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /home/hadoop/tmp/dfs/namesecondary/current/edits of size 12751835
    edits # 138217 loaded in 1581 seconds.
    2011-07-21 16:03:21,925 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 2045516451 saved in 29 seconds.
    2011-07-21 16:03:24,974 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
    0 Total time for transactions(ms): 0Number of transactions batched in Syncs:
    0 Number of syncs: 0 SyncTimes(ms): 0
    2011-07-21 16:03:25,545 INFO
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
    xx.xx.xx.xx:50070putimage=1&port=50090&machine=xx.xx.xx.xx&token=-18:1554828842:0:1311242583000:1311240481442
    2011-07-21 16:29:24,356 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
    doCheckpoint:
    2011-07-21 16:29:24,358 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
    java.io.IOException: Call to xx.xx.xx.xx:9000 failed on local exception:
    java.io.IOException: Connection reset by peer

    Regards,
    Rahul
    On Fri, Jul 22, 2011 at 5:40 PM, Joey Echeverria wrote:

    Do you have an instance of the SecondaryNamenode in your cluster?

    -Joey

    On Fri, Jul 22, 2011 at 3:15 AM, Rahul Das wrote:

    Hi,

    I am running a Hadoop cluster with 20 Data node. Yesterday I found that
    the Namenode was not responding ( No write/read to HDFS is happening). It
    got stuck for few hours, then I shut down the Namenode and found the
    following error from the Name node log.

    2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder, call
    getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from
    xx.xx.xx.xx:13568: output error

    This error was coming for every data node and data nodes are not able to
    communicate with the Name node

    After I restart the Namenode

    2011-07-21 16:31:54,110 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
    2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2011-07-21 16:31:54,223 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
    xx.xx.xx.xx:9000
    2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2011-07-21 16:31:54,226 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
    NameNodeMeterics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    isPermissionEnabled=false
    2011-07-21 16:31:54,287 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
    Initializing FSNamesystemMetrics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,289 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
    FSNamesystemStatusMBean
    2011-07-21 16:31:54,880 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Number of files = 15817482
    2011-07-21 16:34:38,463 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Number of files under construction = 82
    2011-07-21 16:34:41,177 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 2042701824 loaded in 166 seconds.
    2011-07-21 16:58:07,624 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /home/hadoop/current/edits of size 12751835 edits # 138217 loaded
    in 1406 seconds.

    And it goes for a long halt. After about an hour it starts working again.

    My question is when the error "IPC Server Responde" comes and is there a
    way to deal with it.
    Also if my Namenode is busy doing something then what is the way to find
    out what it is doing.

    Regards,
    Rahul



    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Joey Echeverria at Jul 22, 2011 at 4:04 pm
    The long startup time after the restart looks like it was caused because the
    SecondaryNameNode hasn't been able to roll the edits log for some time. Can
    you post your Namenode log from around the same time in this
    SecondaryNameNode log (2011-07-21 16:00-16:30)?

    -Joey
    On Fri, Jul 22, 2011 at 8:29 AM, Rahul Das wrote:

    Yes I have a secondary Namenode running. Here are the log for
    SecondaryNamenode

    2011-07-21 16:02:47,908 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /home/hadoop/tmp/dfs/namesecondary/current/edits of size 12751835
    edits # 138217 loaded in 1581 seconds.
    2011-07-21 16:03:21,925 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 2045516451 saved in 29 seconds.
    2011-07-21 16:03:24,974 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
    0 Total time for transactions(ms): 0Number of transactions batched in Syncs:
    0 Number of syncs: 0 SyncTimes(ms): 0
    2011-07-21 16:03:25,545 INFO
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
    xx.xx.xx.xx:50070putimage=1&port=50090&machine=xx.xx.xx.xx&token=-18:1554828842:0:1311242583000:1311240481442
    2011-07-21 16:29:24,356 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
    doCheckpoint:
    2011-07-21 16:29:24,358 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
    java.io.IOException: Call to xx.xx.xx.xx:9000 failed on local exception:
    java.io.IOException: Connection reset by peer

    Regards,
    Rahul

    On Fri, Jul 22, 2011 at 5:40 PM, Joey Echeverria wrote:

    Do you have an instance of the SecondaryNamenode in your cluster?

    -Joey

    On Fri, Jul 22, 2011 at 3:15 AM, Rahul Das wrote:

    Hi,

    I am running a Hadoop cluster with 20 Data node. Yesterday I found that
    the Namenode was not responding ( No write/read to HDFS is happening). It
    got stuck for few hours, then I shut down the Namenode and found the
    following error from the Name node log.

    2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder, call
    getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from
    xx.xx.xx.xx:13568: output error

    This error was coming for every data node and data nodes are not able to
    communicate with the Name node

    After I restart the Namenode

    2011-07-21 16:31:54,110 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
    2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2011-07-21 16:31:54,223 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
    xx.xx.xx.xx:9000
    2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2011-07-21 16:31:54,226 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
    NameNodeMeterics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    isPermissionEnabled=false
    2011-07-21 16:31:54,287 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
    Initializing FSNamesystemMetrics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,289 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
    FSNamesystemStatusMBean
    2011-07-21 16:31:54,880 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files = 15817482
    2011-07-21 16:34:38,463 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 82
    2011-07-21 16:34:41,177 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    2042701824 loaded in 166 seconds.
    2011-07-21 16:58:07,624 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Edits file
    /home/hadoop/current/edits of size 12751835 edits # 138217 loaded in 1406
    seconds.

    And it goes for a long halt. After about an hour it starts working again.

    My question is when the error "IPC Server Responde" comes and is there a
    way to deal with it.
    Also if my Namenode is busy doing something then what is the way to find
    out what it is doing.

    Regards,
    Rahul



    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Rahul Das at Jul 28, 2011 at 12:07 pm
    Hi Joey,

    The log is too big to attach into mail. What I found that there is no error
    during this time.
    Only few Warnings are coming like

    2011-07-21 14:13:47,814 WARN
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    PendingReplicationMonitor timed out block blk_-6058282241824946206_13375223
    ...
    ...
    2011-07-21 14:30:49,511 WARN
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for
    block blk_8615896953045629213_15838442 reported from xx.xx.xx.xx:50010
    current size is 1950720 reported size is 2448907

    I think the edit file size was too huge thats why it took long time.

    Regards,
    Rahul
    On Fri, Jul 22, 2011 at 9:33 PM, Joey Echeverria wrote:

    The long startup time after the restart looks like it was caused because
    the SecondaryNameNode hasn't been able to roll the edits log for some time.
    Can you post your Namenode log from around the same time in this
    SecondaryNameNode log (2011-07-21 16:00-16:30)?

    -Joey

    On Fri, Jul 22, 2011 at 8:29 AM, Rahul Das wrote:

    Yes I have a secondary Namenode running. Here are the log for
    SecondaryNamenode

    2011-07-21 16:02:47,908 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /home/hadoop/tmp/dfs/namesecondary/current/edits of size 12751835
    edits # 138217 loaded in 1581 seconds.
    2011-07-21 16:03:21,925 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 2045516451 saved in 29 seconds.
    2011-07-21 16:03:24,974 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
    0 Total time for transactions(ms): 0Number of transactions batched in Syncs:
    0 Number of syncs: 0 SyncTimes(ms): 0
    2011-07-21 16:03:25,545 INFO
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
    xx.xx.xx.xx:50070putimage=1&port=50090&machine=xx.xx.xx.xx&token=-18:1554828842:0:1311242583000:1311240481442
    2011-07-21 16:29:24,356 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
    doCheckpoint:
    2011-07-21 16:29:24,358 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
    java.io.IOException: Call to xx.xx.xx.xx:9000 failed on local exception:
    java.io.IOException: Connection reset by peer

    Regards,
    Rahul

    On Fri, Jul 22, 2011 at 5:40 PM, Joey Echeverria wrote:

    Do you have an instance of the SecondaryNamenode in your cluster?

    -Joey

    On Fri, Jul 22, 2011 at 3:15 AM, Rahul Das wrote:

    Hi,

    I am running a Hadoop cluster with 20 Data node. Yesterday I found that
    the Namenode was not responding ( No write/read to HDFS is happening). It
    got stuck for few hours, then I shut down the Namenode and found the
    following error from the Name node log.

    2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder, call
    getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from
    xx.xx.xx.xx:13568: output error

    This error was coming for every data node and data nodes are not able to
    communicate with the Name node

    After I restart the Namenode

    2011-07-21 16:31:54,110 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
    2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2011-07-21 16:31:54,223 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
    xx.xx.xx.xx:9000
    2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2011-07-21 16:31:54,226 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
    NameNodeMeterics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    isPermissionEnabled=false
    2011-07-21 16:31:54,287 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
    Initializing FSNamesystemMetrics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,289 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
    FSNamesystemStatusMBean
    2011-07-21 16:31:54,880 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files = 15817482
    2011-07-21 16:34:38,463 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 82
    2011-07-21 16:34:41,177 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    2042701824 loaded in 166 seconds.
    2011-07-21 16:58:07,624 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Edits file
    /home/hadoop/current/edits of size 12751835 edits # 138217 loaded in 1406
    seconds.

    And it goes for a long halt. After about an hour it starts working
    again.

    My question is when the error "IPC Server Responde" comes and is there a
    way to deal with it.
    Also if my Namenode is busy doing something then what is the way to find
    out what it is doing.

    Regards,
    Rahul



    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Joey Echeverria at Jul 28, 2011 at 12:47 pm
    Nothing from around 1630?

    -Joey


    On Jul 28, 2011, at 5:06, Rahul Das wrote:

    Hi Joey,

    The log is too big to attach into mail. What I found that there is no error during this time.
    Only few Warnings are coming like

    2011-07-21 14:13:47,814 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: PendingReplicationMonitor timed out block blk_-6058282241824946206_13375223
    ...
    ...
    2011-07-21 14:30:49,511 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for block blk_8615896953045629213_15838442 reported from xx.xx.xx.xx:50010 current size is 1950720 reported size is 2448907

    I think the edit file size was too huge thats why it took long time.

    Regards,
    Rahul

    On Fri, Jul 22, 2011 at 9:33 PM, Joey Echeverria wrote:
    The long startup time after the restart looks like it was caused because the SecondaryNameNode hasn't been able to roll the edits log for some time. Can you post your Namenode log from around the same time in this SecondaryNameNode log (2011-07-21 16:00-16:30)?

    -Joey


    On Fri, Jul 22, 2011 at 8:29 AM, Rahul Das wrote:
    Yes I have a secondary Namenode running. Here are the log for SecondaryNamenode

    2011-07-21 16:02:47,908 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /home/hadoop/tmp/dfs/namesecondary/current/edits of size 12751835 edits # 138217 loaded in 1581 seconds.
    2011-07-21 16:03:21,925 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2045516451 saved in 29 seconds.
    2011-07-21 16:03:24,974 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
    2011-07-21 16:03:25,545 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL xx.xx.xx.xx:50070putimage=1&port=50090&machine=xx.xx.xx.xx&token=-18:1554828842:0:1311242583000:1311240481442
    2011-07-21 16:29:24,356 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint:
    2011-07-21 16:29:24,358 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.IOException: Call to xx.xx.xx.xx:9000 failed on local exception: java.io.IOException: Connection reset by peer

    Regards,
    Rahul


    On Fri, Jul 22, 2011 at 5:40 PM, Joey Echeverria wrote:
    Do you have an instance of the SecondaryNamenode in your cluster?

    -Joey


    On Fri, Jul 22, 2011 at 3:15 AM, Rahul Das wrote:
    Hi,

    I am running a Hadoop cluster with 20 Data node. Yesterday I found that the Namenode was not responding ( No write/read to HDFS is happening). It got stuck for few hours, then I shut down the Namenode and found the following error from the Name node log.

    2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from xx.xx.xx.xx:13568: output error

    This error was coming for every data node and data nodes are not able to communicate with the Name node

    After I restart the Namenode

    2011-07-21 16:31:54,110 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
    2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000
    2011-07-21 16:31:54,223 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: xx.xx.xx.xx:9000
    2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null
    2011-07-21 16:31:54,226 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,280 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
    2011-07-21 16:31:54,280 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
    2011-07-21 16:31:54,280 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=false
    2011-07-21 16:31:54,287 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,289 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean
    2011-07-21 16:31:54,880 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 15817482
    2011-07-21 16:34:38,463 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 82
    2011-07-21 16:34:41,177 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2042701824 loaded in 166 seconds.
    2011-07-21 16:58:07,624 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /home/hadoop/current/edits of size 12751835 edits # 138217 loaded in 1406 seconds.

    And it goes for a long halt. After about an hour it starts working again.

    My question is when the error "IPC Server Responde" comes and is there a way to deal with it.
    Also if my Namenode is busy doing something then what is the way to find out what it is doing.

    Regards,
    Rahul



    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434





    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Rahul Das at Jul 29, 2011 at 4:42 am
    No there was no error only following things happens.

    2011-07-21 14:14:30,039 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
    ugi=hadoop,hadoop ip=/xx.xx.xx.xx cmd=create
    src=/user/hdfs/files/d954x328-85x8-4dfe-b73c-34a7a2c1xb0f
    dst=null perm=hadoop:supergroup:rw-r--r--
    2011-07-21 14:14:30,041 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
    NameSystem.allocateBlock: /user/hdfs/files/d954x328-85x8-4dfe-b73c-34a7a2c1xb0f
    . blk_-3217626427379030207_15834365
    2011-07-21 14:14:30,120 INFO org.apache.hadoop.hdfs.StateChange: DIR*
    NameSystem.completeFile: file
    /user/hdfs/files/d954x328-85x8-4dfe-b73c-34a7a2c1xb0f
    is closed by DFSClient_1277823200

    Is there any way I can find out from the log when the safe mode gets over.

    Regards,
    Rahul
    On Thu, Jul 28, 2011 at 6:16 PM, Joey Echeverria wrote:

    Nothing from around 1630?

    -Joey



    On Jul 28, 2011, at 5:06, Rahul Das wrote:

    Hi Joey,

    The log is too big to attach into mail. What I found that there is no error
    during this time.
    Only few Warnings are coming like

    2011-07-21 14:13:47,814 WARN
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    PendingReplicationMonitor timed out block blk_-6058282241824946206_13375223
    ...
    ...
    2011-07-21 14:30:49,511 WARN
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for
    block blk_8615896953045629213_15838442 reported from xx.xx.xx.xx:50010
    current size is 1950720 reported size is 2448907

    I think the edit file size was too huge thats why it took long time.

    Regards,
    Rahul

    On Fri, Jul 22, 2011 at 9:33 PM, Joey Echeverria < <joey@cloudera.com>
    joey@cloudera.com> wrote:
    The long startup time after the restart looks like it was caused because
    the SecondaryNameNode hasn't been able to roll the edits log for some time.
    Can you post your Namenode log from around the same time in this
    SecondaryNameNode log (2011-07-21 16:00-16:30)?

    -Joey


    On Fri, Jul 22, 2011 at 8:29 AM, Rahul Das < <rahul.hdpq@gmail.com>
    rahul.hdpq@gmail.com> wrote:
    Yes I have a secondary Namenode running. Here are the log for
    SecondaryNamenode

    2011-07-21 16:02:47,908 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Edits file
    /home/hadoop/tmp/dfs/namesecondary/current/edits of size 12751835 edits #
    138217 loaded in 1581 seconds.
    2011-07-21 16:03:21,925 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    2045516451 saved in 29 seconds.
    2011-07-21 16:03:24,974 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
    0 Total time for transactions(ms): 0Number of transactions batched in Syncs:
    0 Number of syncs: 0 SyncTimes(ms): 0
    2011-07-21 16:03:25,545 INFO
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
    xx.xx.xx.xx:50070putimage=1&port=50090&machine=xx.xx.xx.xx&token=-18:1554828842:0:1311242583000:1311240481442
    2011-07-21 16:29:24,356 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
    doCheckpoint:
    2011-07-21 16:29:24,358 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
    java.io.IOException: Call to xx.xx.xx.xx:9000 failed on local exception:
    java.io.IOException: Connection reset by peer

    Regards,
    Rahul


    On Fri, Jul 22, 2011 at 5:40 PM, Joey Echeverria < <joey@cloudera.com>
    joey@cloudera.com> wrote:
    Do you have an instance of the SecondaryNamenode in your cluster?

    -Joey


    On Fri, Jul 22, 2011 at 3:15 AM, Rahul Das < <rahul.hdpq@gmail.com>
    rahul.hdpq@gmail.com> wrote:
    Hi,

    I am running a Hadoop cluster with 20 Data node. Yesterday I found that
    the Namenode was not responding ( No write/read to HDFS is happening). It
    got stuck for few hours, then I shut down the Namenode and found the
    following error from the Name node log.

    2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder, call
    getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from
    xx.xx.xx.xx:13568: output error

    This error was coming for every data node and data nodes are not able
    to communicate with the Name node

    After I restart the Namenode

    2011-07-21 16:31:54,110 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
    2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2011-07-21 16:31:54,223 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
    xx.xx.xx.xx:9000
    2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2011-07-21 16:31:54,226 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
    NameNodeMeterics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    isPermissionEnabled=false
    2011-07-21 16:31:54,287 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
    Initializing FSNamesystemMetrics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,289 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
    FSNamesystemStatusMBean
    2011-07-21 16:31:54,880 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files = 15817482
    2011-07-21 16:34:38,463 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 82
    2011-07-21 16:34:41,177 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    2042701824 loaded in 166 seconds.
    2011-07-21 16:58:07,624 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Edits file
    /home/hadoop/current/edits of size 12751835 edits # 138217 loaded in 1406
    seconds.

    And it goes for a long halt. After about an hour it starts working
    again.

    My question is when the error "IPC Server Responde" comes and is there
    a way to deal with it.
    Also if my Namenode is busy doing something then what is the way to
    find out what it is doing.

    Regards,
    Rahul



    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Joey Echeverria at Aug 3, 2011 at 7:38 pm
    Yes, it should print something along the lines of:

    The reported blocks 11 has reached the threshold 0.9990 of total
    blocks 11. Safe mode will be turned off automatically in 8 seconds.

    -Joey
    On Fri, Jul 29, 2011 at 12:26 AM, Rahul Das wrote:
    No there was no error only following things happens.

    2011-07-21 14:14:30,039 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
    ugi=hadoop,hadoop       ip=/xx.xx.xx.xx  cmd=create
    src=/user/hdfs/files/d954x328-85x8-4dfe-b73c-34a7a2c1xb0f
    dst=null        perm=hadoop:supergroup:rw-r--r--
    2011-07-21 14:14:30,041 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
    NameSystem.allocateBlock:
    /user/hdfs/files/d954x328-85x8-4dfe-b73c-34a7a2c1xb0f .
    blk_-3217626427379030207_15834365
    2011-07-21 14:14:30,120 INFO org.apache.hadoop.hdfs.StateChange: DIR*
    NameSystem.completeFile: file
    /user/hdfs/files/d954x328-85x8-4dfe-b73c-34a7a2c1xb0f is closed by
    DFSClient_1277823200

    Is there any way I can find out from the log when the safe mode gets over.

    Regards,
    Rahul
    On Thu, Jul 28, 2011 at 6:16 PM, Joey Echeverria wrote:

    Nothing from around 1630?
    -Joey



    On Jul 28, 2011, at 5:06, Rahul Das wrote:

    Hi Joey,

    The log is too big to attach into mail. What I found that there is no
    error during this time.
    Only few Warnings are coming like

    2011-07-21 14:13:47,814 WARN
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    PendingReplicationMonitor timed out block blk_-6058282241824946206_13375223
    ...
    ...
    2011-07-21 14:30:49,511 WARN
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for
    block blk_8615896953045629213_15838442 reported from xx.xx.xx.xx:50010
    current size is 1950720 reported size is 2448907

    I think the edit file size was too huge thats why it took long time.

    Regards,
    Rahul

    On Fri, Jul 22, 2011 at 9:33 PM, Joey Echeverria <joey@cloudera.com>
    wrote:
    The long startup time after the restart looks like it was caused because
    the SecondaryNameNode hasn't been able to roll the edits log for some time.
    Can you post your Namenode log from around the same time in this
    SecondaryNameNode log (2011-07-21 16:00-16:30)?
    -Joey
    On Fri, Jul 22, 2011 at 8:29 AM, Rahul Das wrote:

    Yes I have a secondary Namenode running. Here are the log for
    SecondaryNamenode

    2011-07-21 16:02:47,908 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Edits file
    /home/hadoop/tmp/dfs/namesecondary/current/edits of size 12751835 edits #
    138217 loaded in 1581 seconds.
    2011-07-21 16:03:21,925 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2045516451
    saved in 29 seconds.
    2011-07-21 16:03:24,974 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
    0 Total time for transactions(ms): 0Number of transactions batched in Syncs:
    0 Number of syncs: 0 SyncTimes(ms): 0
    2011-07-21 16:03:25,545 INFO
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
    xx.xx.xx.xx:50070putimage=1&port=50090&machine=xx.xx.xx.xx&token=-18:1554828842:0:1311242583000:1311240481442
    2011-07-21 16:29:24,356 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
    doCheckpoint:
    2011-07-21 16:29:24,358 ERROR
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
    java.io.IOException: Call to xx.xx.xx.xx:9000 failed on local exception:
    java.io.IOException: Connection reset by peer

    Regards,
    Rahul

    On Fri, Jul 22, 2011 at 5:40 PM, Joey Echeverria <joey@cloudera.com>
    wrote:
    Do you have an instance of the SecondaryNamenode in your cluster?
    -Joey

    On Fri, Jul 22, 2011 at 3:15 AM, Rahul Das <rahul.hdpq@gmail.com>
    wrote:
    Hi,

    I am running a Hadoop cluster with 20 Data node. Yesterday I found
    that the Namenode was not responding ( No write/read to HDFS is happening).
    It got stuck for few hours, then I shut down the Namenode and found the
    following error from the Name node log.

    2011-07-21 16:15:31,500 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder, call
    getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 41) from
    xx.xx.xx.xx:13568: output error

    This error was coming for every data node and data nodes are not able
    to communicate with the Name node

    After I restart the Namenode

    2011-07-21 16:31:54,110 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
    2011-07-21 16:31:54,216 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2011-07-21 16:31:54,223 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
    xx.xx.xx.xx:9000
    2011-07-21 16:31:54,225 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2011-07-21 16:31:54,226 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
    NameNodeMeterics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
    2011-07-21 16:31:54,280 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
    isPermissionEnabled=false
    2011-07-21 16:31:54,287 INFO
    org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
    Initializing FSNamesystemMetrics using context
    object:org.apache.hadoop.metrics.spi.NullContext
    2011-07-21 16:31:54,289 INFO
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
    FSNamesystemStatusMBean
    2011-07-21 16:31:54,880 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files = 15817482
    2011-07-21 16:34:38,463 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 82
    2011-07-21 16:34:41,177 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2042701824
    loaded in 166 seconds.
    2011-07-21 16:58:07,624 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Edits file
    /home/hadoop/current/edits of size 12751835 edits # 138217 loaded in 1406
    seconds.

    And it goes for a long halt. After about an hour it starts working
    again.

    My question is when the error "IPC Server Responde" comes and is there
    a way to deal with it.
    Also if my Namenode is busy doing something then what is the way to
    find out what it is doing.

    Regards,
    Rahul

    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Florin P at Jul 22, 2011 at 12:35 pm
    Hello!
    I would like to ask you, how can you obtain the filenames that is processed by Map class when CombineFileInputFormat is used?
    As far as I know when using CombineFileInputFormat, multiple files will be processed by the same mapper. In my case, I would like to know how to obtain these file names.

    I look forward for your answers. Thank you.
    Regards,
    Florin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedJul 22, '11 at 7:16a
activeAug 3, '11 at 7:38p
posts9
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase