On 03/31/2011 05:13 PM, W.P. McNeill wrote:
I'm running a big job on my cluster and a handful of attempts are failing
with a "Too many fetch-failures" error message. They're all on the same
node, but that node doesn't appear to be down. Subsequent attempts succeed,
so this looks like a transient stress issue rather than a problem with my
code. I'm guessing it's something like HDFS not being able to keep up, but
I'm not sure, and Googling only turns up people just as confused as I am.
What does this error mean and how do I dig into it more?
We've seen that happen in a number of situations, and it's a bit tricky
In the general sense it means that a machine wasn't able to fetch a
block from HDFS - i.e., there was a network problem that prevented the
machine from communicating with the other machine and fetch the block.
The reasons why this could happen though are numerous. We've seen this
in at least 2 situations: 1) the HDFS machine was having a huge load
spike and so didn't respond, and 2) we accidentally gave several nodes
the same name, so Hadoop wasn't able to correctly contact the "real"
node for that name.
Your specific issue may be different, though, so you'll need to debug
the network error yourself.