[
https://issues.apache.org/jira/browse/HADOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585546#action_12585546 ]
Runping Qi commented on HADOOP-3130:
------------------------------------
Lot of reducers failed with the following message:
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
I see a lot of the following exceptions in the log:
2008-04-04 13:50:03,796 WARN org.apache.hadoop.mapred.ReduceTask: task_200804041304_0005_r_000000_2 copy failed: task_200804041304_0005_m_000181_0 from xxxx.com
2008-04-04 13:50:03,823 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: Read timed out
at sun.reflect.GeneratedConstructorAccessor3.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1298)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1292)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:948)
at org.apache.hadoop.mapred.MapOutputLocation.getInputStream(MapOutputLocation.java:125)
at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:165)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:632)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:577)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1004)
... 4 more
Did you also change the timeout for read?
what is the value for Exceeded MAX_FAILED_UNIQUE_FETCHES?
Should that be some percentage of the total num of maps?
Anyhow, we need to revisit the policy for failing a reducer during shuffling.
Shuffling takes too long to get the last map output.
----------------------------------------------------
Key: HADOOP-3130
URL:
https://issues.apache.org/jira/browse/HADOOP-3130Project: Hadoop Core
Issue Type: Bug
Reporter: Runping Qi
Assignee: Amar Kamat
Attachments: HADOOP-3130-v2.patch, HADOOP-3130.patch, shuffling.log
I noticed that towards the end of shufflling, the map output fetcher of the reducer backs off too aggressively.
I attach a fraction of one reduce log of my job.
Noticed that the last map output was not fetched in 2 minutes.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.