FAQ
Hi,
Could someone help me to find some real Figures (transfer rate) about Hadoop File transfer from local filesystem to HDFS, S3 etc and among Storage Systems (HDFS to S3 etc)

Thanks,

Wasim

Search Discussions

  • Brian Bockelman at Feb 10, 2009 at 10:47 pm

    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,
    Could someone help me to find some real Figures (transfer rate)
    about Hadoop File transfer from local filesystem to HDFS, S3 etc
    and among Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim
    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using the
    Java client, transfer rate to/from HDFS is limited by the hardware you
    have and the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop
    instance to the WAN, but that's not very interesting if you don't have
    a 10Gbps pipe...

    Brian
  • Mark Kerzner at Feb 10, 2009 at 10:54 pm
    Brian, I have a similar question: why does transfer from a local filesystem
    to SequenceFile takes so long (about 1 second per Meg)?
    Thank you,
    Mark
    On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman wrote:


    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,
    Could someone help me to find some real Figures (transfer rate) about
    Hadoop File transfer from local filesystem to HDFS, S3 etc and among
    Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim
    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using the Java
    client, transfer rate to/from HDFS is limited by the hardware you have and
    the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop instance
    to the WAN, but that's not very interesting if you don't have a 10Gbps
    pipe...

    Brian
  • Brian Bockelman at Feb 10, 2009 at 11:02 pm

    On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote:

    Brian, I have a similar question: why does transfer from a local
    filesystem
    to SequenceFile takes so long (about 1 second per Meg)?
    Hey Mark,

    I saw your question about speed the other day ... unfortunately, I
    didn't have any specific advice so I stayed quiet :)

    In a correctly configured cluster, performance is mostly limited by
    available hardware. If it's obvious that performance is well below
    hardware limits (such as in your case), it's usually (a) you're not
    generating files fast enough or (b) something is configured wrong.

    Have you just tried hadoop fs -put .... for some large file hanging
    around locally? If that doesn't go more than 5MB/s or so (when your
    hardware can obviously do such a rate), then there's probably a
    configuration issue.

    Brian
    Thank you,
    Mark

    On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman
    wrote:
    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,
    Could someone help me to find some real Figures (transfer rate)
    about
    Hadoop File transfer from local filesystem to HDFS, S3 etc and
    among
    Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim
    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using the
    Java
    client, transfer rate to/from HDFS is limited by the hardware you
    have and
    the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop
    instance
    to the WAN, but that's not very interesting if you don't have a
    10Gbps
    pipe...

    Brian
  • Mark Kerzner at Feb 11, 2009 at 5:10 am
    Brian, large files using command-line hadoop go fast, so it is something
    about my computer or network. I won't worry about this now, especially in
    light of Amit reporting fast writes and reads.

    Mark
    On Tue, Feb 10, 2009 at 5:00 PM, Brian Bockelman wrote:


    On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote:

    Brian, I have a similar question: why does transfer from a local
    filesystem
    to SequenceFile takes so long (about 1 second per Meg)?
    Hey Mark,

    I saw your question about speed the other day ... unfortunately, I didn't
    have any specific advice so I stayed quiet :)

    In a correctly configured cluster, performance is mostly limited by
    available hardware. If it's obvious that performance is well below hardware
    limits (such as in your case), it's usually (a) you're not generating files
    fast enough or (b) something is configured wrong.

    Have you just tried hadoop fs -put .... for some large file hanging around
    locally? If that doesn't go more than 5MB/s or so (when your hardware can
    obviously do such a rate), then there's probably a configuration issue.

    Brian


    Thank you,
    Mark

    On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,
    Could someone help me to find some real Figures (transfer rate) about
    Hadoop File transfer from local filesystem to HDFS, S3 etc and among
    Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim
    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using the Java
    client, transfer rate to/from HDFS is limited by the hardware you have
    and
    the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop instance
    to the WAN, but that's not very interesting if you don't have a 10Gbps
    pipe...

    Brian

  • Brian Bockelman at Feb 11, 2009 at 5:15 am

    On Feb 10, 2009, at 11:09 PM, Mark Kerzner wrote:

    Brian, large files using command-line hadoop go fast, so it is
    something
    about my computer or network. I won't worry about this now,
    especially in
    light of Amit reporting fast writes and reads.
    You're creating files using SequenceFile, right? It might be that the
    creation of the sequence file is the portion which is slow, not the
    network I/O.

    I don't have much knowledge about optimization of SequenceFile
    creation. I assume that you'll want to start by tweaking compression
    on and off. Additionally, Jeff (I think) pointed to a Hadoop Archive
    file, which also might be an alternative for your system. I don't
    know enough to give you a set of pros and cons, just enough to mention
    it as an alternative to experiment with.

    Sorry I'm not useful here...

    Brian

    Mark

    On Tue, Feb 10, 2009 at 5:00 PM, Brian Bockelman
    wrote:
    On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote:

    Brian, I have a similar question: why does transfer from a local
    filesystem
    to SequenceFile takes so long (about 1 second per Meg)?
    Hey Mark,

    I saw your question about speed the other day ... unfortunately, I
    didn't
    have any specific advice so I stayed quiet :)

    In a correctly configured cluster, performance is mostly limited by
    available hardware. If it's obvious that performance is well below
    hardware
    limits (such as in your case), it's usually (a) you're not
    generating files
    fast enough or (b) something is configured wrong.

    Have you just tried hadoop fs -put .... for some large file hanging
    around
    locally? If that doesn't go more than 5MB/s or so (when your
    hardware can
    obviously do such a rate), then there's probably a configuration
    issue.

    Brian


    Thank you,
    Mark

    On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,
    Could someone help me to find some real Figures (transfer rate)
    about
    Hadoop File transfer from local filesystem to HDFS, S3 etc and
    among
    Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim
    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using
    the Java
    client, transfer rate to/from HDFS is limited by the hardware you
    have
    and
    the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop
    instance
    to the WAN, but that's not very interesting if you don't have a
    10Gbps
    pipe...

    Brian

  • Mark Kerzner at Feb 11, 2009 at 5:26 am
    Brian, I saw that Stuart
    here<http://stuartsierra.com/2008/04/24/a-million-little-files>mentions
    slow writes to SequenceFile. If so, I will either use his tar
    approach or try to parallelize it if I can.
    On Tue, Feb 10, 2009 at 11:14 PM, Brian Bockelman wrote:


    On Feb 10, 2009, at 11:09 PM, Mark Kerzner wrote:

    Brian, large files using command-line hadoop go fast, so it is something
    about my computer or network. I won't worry about this now, especially in
    light of Amit reporting fast writes and reads.
    You're creating files using SequenceFile, right? It might be that the
    creation of the sequence file is the portion which is slow, not the network
    I/O.

    I don't have much knowledge about optimization of SequenceFile creation. I
    assume that you'll want to start by tweaking compression on and off.
    Additionally, Jeff (I think) pointed to a Hadoop Archive file, which also
    might be an alternative for your system. I don't know enough to give you a
    set of pros and cons, just enough to mention it as an alternative to
    experiment with.

    Sorry I'm not useful here...

    Brian


    Mark

    On Tue, Feb 10, 2009 at 5:00 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote:

    Brian, I have a similar question: why does transfer from a local
    filesystem
    to SequenceFile takes so long (about 1 second per Meg)?
    Hey Mark,

    I saw your question about speed the other day ... unfortunately, I didn't
    have any specific advice so I stayed quiet :)

    In a correctly configured cluster, performance is mostly limited by
    available hardware. If it's obvious that performance is well below
    hardware
    limits (such as in your case), it's usually (a) you're not generating
    files
    fast enough or (b) something is configured wrong.

    Have you just tried hadoop fs -put .... for some large file hanging
    around
    locally? If that doesn't go more than 5MB/s or so (when your hardware
    can
    obviously do such a rate), then there's probably a configuration issue.

    Brian



    Thank you,
    Mark

    On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,

    Could someone help me to find some real Figures (transfer rate) about
    Hadoop File transfer from local filesystem to HDFS, S3 etc and among
    Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim


    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using the Java
    client, transfer rate to/from HDFS is limited by the hardware you have
    and
    the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop
    instance
    to the WAN, but that's not very interesting if you don't have a 10Gbps
    pipe...

    Brian


  • Amit Chandel at Feb 10, 2009 at 11:05 pm
    With my setup. I have been able to get 10MBps write speed, 40MBps read
    speed while writing multiple files (ranging a few Bytes to 100MB) into
    SequenceFiles, and reading them back. The cluster has 1Gbps backbone.
    On Tue, Feb 10, 2009 at 5:53 PM, Mark Kerzner wrote:

    Brian, I have a similar question: why does transfer from a local filesystem
    to SequenceFile takes so long (about 1 second per Meg)?
    Thank you,
    Mark

    On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,
    Could someone help me to find some real Figures (transfer rate) about
    Hadoop File transfer from local filesystem to HDFS, S3 etc and among
    Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim
    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using the Java
    client, transfer rate to/from HDFS is limited by the hardware you have and
    the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop instance
    to the WAN, but that's not very interesting if you don't have a 10Gbps
    pipe...

    Brian
  • Brian Bockelman at Feb 11, 2009 at 5:38 am
    Just to toss out some numbers.... (and because our users are making
    interesting numbers right now)

    Here's our external network router: http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets

    Here's the application-level transfer graph: http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska

    In a squeeze, we can move 20-50TB / day to/from other heterogenous
    sites. Usually, we run out of free space before we can find the upper
    limit for a 24-hour period.

    We use a protocol called GridFTP to move data back and forth between
    external (non-HDFS) clusters. The other sites we transfer with use
    niche software you probably haven't heard of (Castor, DPM, and dCache)
    because, well, it's niche software. I have no available data on HDFS<-
    S3 systems, but I'd again claim it's mostly a function of the amount
    of hardware you throw at it and the size of your network pipes.

    There are currently 182 datanodes; 180 are "traditional" ones of <3TB
    and 2 are big honking RAID arrays of 40TB. Transfers are load-
    balanced amongst ~ 7 GridFTP servers which each have 1Gbps connection.

    Does that help?

    Brian
    On Feb 10, 2009, at 4:46 PM, Brian Bockelman wrote:

    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,
    Could someone help me to find some real Figures (transfer rate)
    about Hadoop File transfer from local filesystem to HDFS, S3 etc
    and among Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim
    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using the
    Java client, transfer rate to/from HDFS is limited by the hardware
    you have and the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop
    instance to the WAN, but that's not very interesting if you don't
    have a 10Gbps pipe...

    Brian
  • Mark Kerzner at Feb 11, 2009 at 5:44 am
    I say, that's very interesting and useful.
    On Tue, Feb 10, 2009 at 11:37 PM, Brian Bockelman wrote:

    Just to toss out some numbers.... (and because our users are making
    interesting numbers right now)

    Here's our external network router:
    http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets

    Here's the application-level transfer graph:
    http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska

    In a squeeze, we can move 20-50TB / day to/from other heterogenous sites.
    Usually, we run out of free space before we can find the upper limit for a
    24-hour period.

    We use a protocol called GridFTP to move data back and forth between
    external (non-HDFS) clusters. The other sites we transfer with use niche
    software you probably haven't heard of (Castor, DPM, and dCache) because,
    well, it's niche software. I have no available data on HDFS<->S3 systems,
    but I'd again claim it's mostly a function of the amount of hardware you
    throw at it and the size of your network pipes.

    There are currently 182 datanodes; 180 are "traditional" ones of <3TB and 2
    are big honking RAID arrays of 40TB. Transfers are load-balanced amongst ~
    7 GridFTP servers which each have 1Gbps connection.

    Does that help?

    Brian


    On Feb 10, 2009, at 4:46 PM, Brian Bockelman wrote:

    On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

    Hi,
    Could someone help me to find some real Figures (transfer rate) about
    Hadoop File transfer from local filesystem to HDFS, S3 etc and among
    Storage Systems (HDFS to S3 etc)

    Thanks,

    Wasim
    What are you looking for? Maximum possible transfer rate? Maximum
    possible transfer rate per client? Generally, if you're using the Java
    client, transfer rate to/from HDFS is limited by the hardware you have and
    the network connection (if you have 1Gbps per client).

    I could give you a graph showing a peak of 9Gbps from our Hadoop instance
    to the WAN, but that's not very interesting if you don't have a 10Gbps
    pipe...

    Brian
  • Steve Loughran at Feb 11, 2009 at 11:19 am

    Brian Bockelman wrote:
    Just to toss out some numbers.... (and because our users are making
    interesting numbers right now)

    Here's our external network router:
    http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets


    Here's the application-level transfer graph:
    http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska


    In a squeeze, we can move 20-50TB / day to/from other heterogenous
    sites. Usually, we run out of free space before we can find the upper
    limit for a 24-hour period.

    We use a protocol called GridFTP to move data back and forth between
    external (non-HDFS) clusters. The other sites we transfer with use
    niche software you probably haven't heard of (Castor, DPM, and dCache)
    because, well, it's niche software. I have no available data on
    HDFS<->S3 systems, but I'd again claim it's mostly a function of the
    amount of hardware you throw at it and the size of your network pipes.

    There are currently 182 datanodes; 180 are "traditional" ones of <3TB
    and 2 are big honking RAID arrays of 40TB. Transfers are load-balanced
    amongst ~ 7 GridFTP servers which each have 1Gbps connection.
    GridFTP is optimised for high bandwidth network connections with
    negotiated packet size and multiple TCP connections, so when nagel's
    algorithm triggers backoff from a dropped packet, only a fraction of the
    transmission gets dropped. It is probably best-in-class for long haul
    transfers over the big university backbones where someone else pays for
    your traffic. You would be very hard pressed to get even close to that
    on any other protocol.

    I have no data on S3 xfers other than hearsay
    * write time to S3 can be slow as it doesn't return until the data is
    persisted "somewhere". That's a better guarantee than a posix write
    operation.
    * you have to rely on other people on your rack not wanting all the
    traffic for themselves. That's an EC2 API issue: you don't get to
    request/buy bandwidth to/from S3

    One thing to remember is that if you bring up a Hadoop cluster on any
    virtual server farm, disk IO is going to be way below physical IO rates.
    Even when the data is in HDFS, it will be slower to get at than
    dedicated high-RPM SCSI or SATA storage.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 10, '09 at 10:10p
activeFeb 11, '09 at 11:19a
posts11
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase