FAQ
Hey guys,

We are trying to figure out why many of our Map/Reduce job on the cluster are failing.
In log we are getting this message I n the failing jobs:


org.apache.hadoop.ipc.RemoteException: java.io.IOException: File **a filename*** could only be replicated to 0 nodes, instead of 1

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1282)

at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469)

at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:396)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)



at org.apache.hadoop.ipc.Client.call(Client.java:818)

at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)

at $Proxy1.addBlock(Unknown Source)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)

at $Proxy1.addBlock(Unknown Source)

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2932)

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2807)

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2087)

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2274)



Where should we look?
What are the candidates to be the root of this message?

Thanks, Guy

Search Discussions

  • Elton sky at Apr 5, 2011 at 7:18 am
    check the FAQ (
    http://wiki.apache.org/hadoop/FAQ#What_does_.22file_could_only_be_replicated_to_0_nodes.2C_instead_of_1.22_mean.3F
    )
    On Tue, Apr 5, 2011 at 4:53 PM, Guy Doulberg wrote:

    Hey guys,

    We are trying to figure out why many of our Map/Reduce job on the cluster
    are failing.
    In log we are getting this message I n the failing jobs:


    org.apache.hadoop.ipc.RemoteException: java.io.IOException: File **a
    filename*** could only be replicated to 0 nodes, instead of 1

    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1282)

    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469)

    at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)

    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)



    at org.apache.hadoop.ipc.Client.call(Client.java:818)

    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)

    at $Proxy1.addBlock(Unknown Source)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)

    at
    org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

    at
    org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)

    at $Proxy1.addBlock(Unknown Source)

    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2932)

    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2807)

    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2087)

    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2274)



    Where should we look?
    What are the candidates to be the root of this message?

    Thanks, Guy


  • Guy Doulberg at Apr 5, 2011 at 9:55 am
    Thanks,
    We think the problem is,
    We have unbalanced HDFS cluster, some of the data nodes are in more 90%, and some are less than 30% - it happened because the nodes with free space are newer.
    We think that when a task tracker is getting a task, it tries to write its map output first to its local data node, and since many of the nodes are full, the task tracker fails.

    Does this diagnosis sounds logical?
    Are there workarounds?

    We are running the blancer, but it takes a lot of time... in this time the cluster not working

    We are using the CDH2 of cloudera

    Thanks




    -----Original Message-----
    From: elton sky
    Sent: Tuesday, April 05, 2011 10:18 AM
    To: [email protected]
    Subject: Re: We are looking to the root of the problem that caused us IOException

    check the FAQ (
    http://wiki.apache.org/hadoop/FAQ#What_does_.22file_could_only_be_replicated_to_0_nodes.2C_instead_of_1.22_mean.3F
    )
    On Tue, Apr 5, 2011 at 4:53 PM, Guy Doulberg wrote:

    Hey guys,

    We are trying to figure out why many of our Map/Reduce job on the cluster
    are failing.
    In log we are getting this message I n the failing jobs:


    org.apache.hadoop.ipc.RemoteException: java.io.IOException: File **a
    filename*** could only be replicated to 0 nodes, instead of 1

    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1282)

    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469)

    at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)

    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)



    at org.apache.hadoop.ipc.Client.call(Client.java:818)

    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)

    at $Proxy1.addBlock(Unknown Source)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)

    at
    org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

    at
    org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)

    at $Proxy1.addBlock(Unknown Source)

    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2932)

    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2807)

    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2087)

    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2274)



    Where should we look?
    What are the candidates to be the root of this message?

    Thanks, Guy


  • Ted Dunning at Apr 5, 2011 at 3:46 pm
    YOu can configure the balancer to use higher bandwidth. That can speed it
    up by 10x
    On Tue, Apr 5, 2011 at 2:54 AM, Guy Doulberg wrote:

    We are running the blancer, but it takes a lot of time... in this time the
    cluster not working
  • Guy Doulberg at Apr 6, 2011 at 6:35 am
    Thanks,
    That is what we actually did,

    Worked!
    We are back on track...

    Do you think we should always run the balance, with low bandwidth, and not only after adding new nodes?



    From: Ted Dunning
    Sent: Tuesday, April 05, 2011 6:46 PM
    To: [email protected]
    Cc: Guy Doulberg
    Subject: Re: We are looking to the root of the problem that caused us IOException

    YOu can configure the balancer to use higher bandwidth. That can speed it up by 10x
    On Tue, Apr 5, 2011 at 2:54 AM, Guy Doulberg wrote:
    We are running the blancer, but it takes a lot of time... in this time the cluster not working
  • Ted Dunning at Apr 6, 2011 at 7:56 am
    yes. At least periodically.

    You now have a situation where the age distribution of blocks in each
    datanode is quite different. This will lead to different evolution of which
    files are retained and that is likely to cause imbalances again. It will
    also cause the performance of your system to be degraded since any given
    program probably will have a non-uniform distribution of content in your
    cluster.

    Over time, this effect will probably decrease unless you keep your old files
    forever.
    On Tue, Apr 5, 2011 at 11:34 PM, Guy Doulberg wrote:

    Do you think we should always run the balance, with low bandwidth, and not
    only after adding new nodes?
  • Guy Doulberg at Apr 6, 2011 at 8:01 am
    Great,
    Thanks

    From: Ted Dunning
    Sent: Wednesday, April 06, 2011 10:55 AM
    To: [email protected]
    Cc: Guy Doulberg
    Subject: Re: We are looking to the root of the problem that caused us IOException

    yes. At least periodically.

    You now have a situation where the age distribution of blocks in each datanode is quite different. This will lead to different evolution of which files are retained and that is likely to cause imbalances again. It will also cause the performance of your system to be degraded since any given program probably will have a non-uniform distribution of content in your cluster.

    Over time, this effect will probably decrease unless you keep your old files forever.
    On Tue, Apr 5, 2011 at 11:34 PM, Guy Doulberg wrote:
    Do you think we should always run the balance, with low bandwidth, and not only after adding new nodes?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 5, '11 at 6:54a
activeApr 6, '11 at 8:01a
posts7
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase