FAQ
Hi,

I'm running nutch in pseudo cluster, eg all daemons are running on the same
server. I'm writing to the hadoop list, as it looks like a problem related
to hadoop

Some of my jobs partially fails and in the error log I get output like

2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_000000_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)

2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_000000_0 copy failed:
attempt_201106231520_0190_m_000000_0 from worker1
2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.UnknownHostException: worker1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1458)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1452)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1106)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
Caused by: java.net.UnknownHostException: worker1
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:175)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
... 4 more

2011-06-24 08:45:05,772 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201106231520_0190_r_000000_0: Failed fetch #1 from
attempt_201106231520_0190_m_000000_0


The above basically said that my worker is unknown, but I can't really make
any sense of it. Other jobs running before, at the same time or after
completes fine without any error messages and without any changes on the
server. Also other reduce task in the same run has succeded. So it looks
like that my worker sometimes 'disappear' and can't be reached.

My current theory is that it only happens when there are a couple of jobs
running at the same time. Is that a plausible explanation

Would anybody have some suggestions how I could get more infomation from the
system, or point me in a direction where I should look(I'm also quite new to
hadoop)

Best Regards
Niels

--
BinaryConstructors ApS
Vestergade 10a, 4th
1456 Kbh K
Denmark
phone: +4529722259
web: http://www.binaryconstructors.dk
mail: nb@binaryconstructors.dk
skype: nielsboldt

Search Discussions

  • Steve Loughran at Jun 27, 2011 at 11:43 am

    On 24/06/11 18:16, Niels Boldt wrote:
    Hi,

    I'm running nutch in pseudo cluster, eg all daemons are running on the same
    server. I'm writing to the hadoop list, as it looks like a problem related
    to hadoop

    Some of my jobs partially fails and in the error log I get output like

    2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201106231520_0190_r_000000_0 Scheduled 1 outputs (0 slow hosts and0
    dup hosts)

    2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201106231520_0190_r_000000_0 copy failed:
    attempt_201106231520_0190_m_000000_0 from worker1
    2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask:
    java.net.UnknownHostException: worker1
    The above basically said that my worker is unknown, but I can't really make
    any sense of it. Other jobs running before, at the same time or after
    completes fine without any error messages and without any changes on the
    server. Also other reduce task in the same run has succeded. So it looks
    like that my worker sometimes 'disappear' and can't be reached.
    If the worker had "disappeared" of the net, you'd be more likely to see
    a NoRouteToHost
    My current theory is that it only happens when there are a couple of jobs
    running at the same time. Is that a plausible explanation

    Would anybody have some suggestions how I could get more infomation from the
    system, or point me in a direction where I should look(I'm also quite new to
    hadoop)
    I'd assume that one machine in the cluster doesn't have an /etc/hosts
    entry to worker1, or that the DNS server is suffering under load. If you
    can, put the host lists into the /etc/hosts table instead of relying on
    DNS. If you do it on all machines, it avoids having to work out which
    one is playing up. That said, some better logging of which host is
    trying to make the connection would be nice
  • Niels Boldt at Jul 1, 2011 at 1:08 pm
    Hi Steve

    I'd assume that one machine in the cluster doesn't have an /etc/hosts entry
    to worker1, or that the DNS server is suffering under load. If you can, put
    the host lists into the /etc/hosts table instead of relying on DNS. If you
    do it on all machines, it avoids having to work out which one is playing up.
    That said, some better logging of which host is trying to make the
    connection would be nice
    Thanks for your answer, it was actually a wrongly configured /etc/hosts file
    that was the problem, so you pointed me in the absolute correct direction
    :-)

    Best Regards
    Niels

    --
    BinaryConstructors ApS
    Vestergade 10a, 4th
    1456 Kbh K
    Denmark
    phone: +4529722259
    web: http://www.binaryconstructors.dk
    mail: nb@binaryconstructors.dk
    skype: nielsboldt

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 24, '11 at 5:17p
activeJul 1, '11 at 1:08p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Niels Boldt: 2 posts Steve Loughran: 1 post

People

Translate

site design / logo © 2022 Grokbase