FAQ
It's a cluster being used for a university course; there are 30 students
all running code which (to be polite) probably tests the limits of
Hadoop's failure recovery logic. :)

The current assignment is PageRank over Wikipedia; a 20 GB input corpus.
Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50
reduce tasks.

I wrote a patch to address the NPE in JobTracker.killJob() and compiled
it against TRUNK. I've put this on the cluster and it's now been holding
steady for the last hour or so.. so that plus whatever other differences
there are between 18.1 and TRUNK may have fixed things. (I'll submit the
patch to the JIRA as soon as it finishes cranking against the JUnit tests)

- Aaron


Devaraj Das wrote:
On 10/30/08 3:13 AM, "Aaron Kimball" wrote:

The system load and memory consumption on the JT are both very close to
"idle" states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways into the
JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
ava:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I think
the problem is that all the handler threads are dying one by one due to this
NPE.
This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.
This something I can fix myself, or is a patch available?

- Aaron
On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy wrote:

It's possible that the JobTracker is under duress and unable to respond to
the TaskTrackers... what do the JobTracker logs say?

Arun


On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

Hi all,
I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that all
29
TaskTrackers will get disconnected from the JobTracker. All the tracker
daemon processes are still running on each machine; but the JobTracker
will
say "0 nodes available" on the web status screen. Restarting MapReduce
fixes
this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call failed on local exception
at java.lang.Throwable.<init>(Throwable.java:67)
at org.apache.hadoop.ipc.Client.call(Client.java:718)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>>
)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
at sun.nio.ch.IOUtil.read(IOUtil.java:207)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
at

org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
ava:55)
at

org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
at java.io.FilterInputStream.read(FilterInputStream.java:127)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>>
)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
at java.io.DataInputStream.readInt(DataInputStream.java:381)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
THREADS
((40-40+0)<1) on SocketListener0@0.0.0.0:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: SocketListener0@0.0.0.0:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode and
JobTracker
are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 6 of 12 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 29, '08 at 7:34p
activeOct 31, '08 at 7:57a
posts12
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase