Todd Lipcon updated HADOOP-8069:
Status: Open (was: Patch Available)
One issue here with nagling off is the following:
In the Server implementation, we write with maximum 8KB write() calls, to avoid a heap malloc inside the JDK's SocketOutputStream implementation (with less than 8K it uses a stack buffer instead).
So, if we write a 10KB response, we end up doing a write(8KB) followed by write(2KB)
The problem here, when NODELAY is on, is that the TCP MSS doesn't divide neatly into the 8K buffer size. So we get the following behavior:
sends 5 packets of MSS size (eg 1490 bytes)
sends 1 packet of around half MSS (around 750 bytes)
sends 1 packet of MSS
sends 1 packet around 1/3 MSS
although we should have fit the result in 7 packets, instead we used 8
The following thread about postfix discusses a similar issue:http://tech.groups.yahoo.com/group/postfix-users/message/224183
1) accept the inefficiency - it's bounded by one extra "small" packet for every 8KB in the response
2) try to set the write buffer size to an exact multiple of MSS. This is difficult because Java doesn't let you call getsockopt(TCP_MAXSEG)
3) use TCP_CORK and TCP_UNCORK to control the packet sending behavior. This is difficult because Java also doesn't expose those
4) in the Server.channelIO loop, turn off NODELAY while writing all but the last buffer worth, then turn on NODELAY for the last buffer. This should act as a flush of all the remaining buffered data
Canceling patch for now to work through this
Enable TCP_NODELAY by default for IPC
Project: Hadoop Common
Issue Type: Improvement
Affects Versions: 0.23.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
I think we should switch the default for the IPC client and server NODELAY options to true. As wikipedia says:
In general, since Nagle's algorithm is only a defense against careless applications, it will not benefit a carefully written application that takes proper care of buffering; the algorithm has either no effect, or negative effect on the application.
Since our IPC layer is well contained and does its own buffering, we shouldn't be careless.
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira