FAQ
Use Apache HttpClient for fetching map outputs
----------------------------------------------

Key: HADOOP-4888
URL: https://issues.apache.org/jira/browse/HADOOP-4888
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Reporter: Chris Douglas


It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Chris Douglas (JIRA) at Dec 16, 2008 at 10:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas reassigned HADOOP-4888:
    -------------------------------------

    Assignee: Chris Douglas
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Assignee: Chris Douglas
    Attachments: 4888-0.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Dec 16, 2008 at 10:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4888:
    ----------------------------------

    Attachment: 4888-0.patch

    Preliminary patch. There are a few choices that are worth revisiting:
    * This adds a HEAD to the fetch, rather than opening a connection and closing it if there's insufficient space. While this adds a round trip, the connection will likely be reused. If the impact is measurable, it is possible to restore the former behavior.
    * The hard-coded timeouts and retries remain, but could easily be made configurable.

    I haven't been able to benchmark this. Once that's possible:
    * The effect of the fetch buffer size on shuffle time should be quantified
    * The number of reduce copier threads should exceed the number of connection threads so the latter may be reused. A good ratio should be arrived at experimentally.
    * We should measure performance with HttpClient 3.1, too. If the data and S3 support it, we should upgrade.
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Attachments: 4888-0.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Dec 17, 2008 at 1:09 am
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4888:
    ----------------------------------

    Attachment: (was: 4888-0.patch)
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Assignee: Chris Douglas
    Attachments: 4888-0.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Dec 17, 2008 at 1:09 am
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4888:
    ----------------------------------

    Attachment: 4888-0.patch

    Merge with trunk
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Assignee: Chris Douglas
    Attachments: 4888-0.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Dec 17, 2008 at 8:07 am
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657310#action_12657310 ]

    Chris Douglas commented on HADOOP-4888:
    ---------------------------------------

    HttpClient with the current patch actually degraded performance in five runs of a shuffle benchmark on trunk.

    498 nodes, 256MB/map, 495 maps, no map-side merge, half of reduce input from memory, no intermediate compression.
    Version || 1 || 2 || 3 || 4 || 5 || avg || std.d ||
    r727228 | 406 | 485 | 360 | 448 | 411 | 422 | 48 |
    r727228 + patch | 418 | 357 | 501 | 446 | 442 | 433 | 52 |
    Stragglers were dominant. In both versions, output from the final few maps held up the reduce phase, so neither could distinguish itself with better throughput, connection reuse, protocol efficiency, etc. Larger benchmarks that might compensate for these effects, such as gridmix, cannot be run on available nodes.
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Assignee: Chris Douglas
    Attachments: 4888-0.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Zheng Shao (JIRA) at Dec 17, 2008 at 10:09 am
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657332#action_12657332 ]

    Zheng Shao commented on HADOOP-4888:
    ------------------------------------

    The difference is not statistically significant.
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Assignee: Chris Douglas
    Attachments: 4888-0.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Steve Loughran (JIRA) at Dec 24, 2008 at 12:01 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659090#action_12659090 ]

    Steve Loughran commented on HADOOP-4888:
    ----------------------------------------

    I use HttpClient a lot in other projects; It does understand HTTP better than java.net. But if you have implemented both ends of the communication and have no IIS-based proxy in between, its needs are less urgent.

    If you do use it, here are my ivy.xml settings used to drop all the logging and junit dependencies that the POMs imply are mandatory. You just need to run with commons-logging JAR on the classpath, nothing else

    <dependency org="commons-httpclient"
    name="commons-httpclient"
    rev="${commons-httpclient.version}"
    conf="compile->master;httpclient->default">
    <exclude org="commons-logging"/>
    <exclude org="junit"/>
    </dependency>

    HttpcCient does work with S3, because Restlet uses HttpClient, and restlet talks to S3. There is no built in support in HttpClient for AWS authentication, though no doubt this feature would be welcomed by many.
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Assignee: Chris Douglas
    Attachments: 4888-0.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Jan 11, 2009 at 2:42 am
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4888:
    ----------------------------------

    Attachment: 4888-1.patch

    @Zheng: You're right, I shouldn't have said "degraded."

    @Steve: Thanks for the ivy settings; I hadn't started to consider that, yet. The goal of this is identical to HADOOP-1338, really. Reimplementing the connection pooling in Hadoop could offer some advantages (e.g. more granular progress reporting), but appropriating all the work done in HttpClient seems like a clear win until that work is completed.

    I tried a similar, still preliminary patch, but with max connections per host set to 1 and on a job with different parameters, i.e. mapred.reduce.slowstart.completed.maps=1.0, 38272 maps, 448 reducers, 32MB (generated) per map on ~300 nodes. Times measured are from the start of the reduce (after all maps have finished, so the stragglers are not a factor) to end of the shuffle (avg / std.d):
    Version || 1 || 2 || 3 || 4 || 5 || avg || avg job ||
    r732838 | 786.89 / 45.55 | 842.596 / 70.69 | 1458.75 / 83.88 | 1140.93 / 44.22 | 1294.67 / 58.87 | 1104.77 | 2479.8 |
    r732838 + patch | 803.261 / 73.36 | 783.243 / 93.34 | 792.106 / 78.94 | 917.153 / 52.91 | 776.756 / 113.56 | 814.50 | 1955.2 |
    Many of the parameters need to be adjusted. In particular, the timeouts are worth revisiting, as are the number of connections and threads at the server and client. Whether the HEAD + GET imposes a measurable penalty may also merit consideration before this can be committed. However, the preceding demonstrates that a measurable improvement is possible, and that this part of the pipeline could be mined for performance improvements.
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Assignee: Chris Douglas
    Attachments: 4888-0.patch, 4888-1.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 23, 2009 at 3:10 am
    [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas resolved HADOOP-4888.
    -----------------------------------

    Resolution: Won't Fix

    Given HADOOP-5223, it's clear this isn't going anywhere.
    Use Apache HttpClient for fetching map outputs
    ----------------------------------------------

    Key: HADOOP-4888
    URL: https://issues.apache.org/jira/browse/HADOOP-4888
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: Chris Douglas
    Assignee: Chris Douglas
    Attachments: 4888-0.patch, 4888-1.patch


    It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedDec 16, '08 at 9:59p
activeFeb 23, '09 at 3:10a
posts10
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Chris Douglas (JIRA): 10 posts

People

Translate

site design / logo © 2022 Grokbase