FAQ
Hi.

I have 2 questions about HDFS performance:

1) How fast are the read and write operations over network, in Mbps per
second?

2) If the chunk server is located on same host as the client, is there any
optimization in read operations?
For example, Kosmos FS describe the following functionality:

"Localhost optimization: One copy of data
is placed on the chunkserver on the same
host as the client doing the write

Helps reduce network traffic"

Regards.

Search Discussions

  • Ricky Ho at Apr 10, 2009 at 4:06 am
    I want to analyze the traffic pattern and statistics of a distributed application. I am thinking of having the application write the events as log entries into HDFS and then later I can use a Map/Reduce task to do the analysis in parallel. Is this a good approach ?

    In this case, does HDFS support concurrent write (append) to a file ? Another question is whether the write API thread-safe ?

    Rgds,
    Ricky
  • Alex Loddengaard at Apr 10, 2009 at 4:14 am
    This is a great idea and a common application, Ricky. Scribe is probably
    useful for you as well:

    <http://sourceforge.net/projects/scribeserver/>
    <
    http://images.google.com/imgres?imgurl=http://farm3.static.flickr.com/2211/2197670659_b42810b8ba.jpg&imgrefurl=http://www.flickr.com/photos/niallkennedy/2197670659/&usg=__WLc-p9Gi_p3AdA-YuKLRZ-bdgvo=&h=375&w=500&sz=131&hl=en&start=2&sig2=P22LVO1KObby6_DDy8ujYg&um=1&tbnid=QudxiEyFOk1EpM:&tbnh=98&tbnw=130&prev=/images%3Fq%3Dfacebook%2Bscribe%2Bhadoop%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1&ei=48beSa74L4H-swORnPmjDg
    >

    Scribe is what Facebook uses to get its Apache logs to Hadoop.
    Unfortunately, HDFS doesn't (yet) have append, so you'll have to batch log
    files and load them into HDFS in bulk.

    Alex
    On Thu, Apr 9, 2009 at 9:06 PM, Ricky Ho wrote:

    I want to analyze the traffic pattern and statistics of a distributed
    application. I am thinking of having the application write the events as
    log entries into HDFS and then later I can use a Map/Reduce task to do the
    analysis in parallel. Is this a good approach ?

    In this case, does HDFS support concurrent write (append) to a file ?
    Another question is whether the write API thread-safe ?

    Rgds,
    Ricky
  • Brian Bockelman at Apr 10, 2009 at 4:19 am
    Also, Chukwa (a project already in Hadoop contrib) is designed to do
    something similar with Hadoop directly:

    http://wiki.apache.org/hadoop/Chukwa

    I think some of the examples even mention Apache logs. Haven't used
    it personally, but it looks nice.

    Brian
    On Apr 9, 2009, at 11:14 PM, Alex Loddengaard wrote:

    This is a great idea and a common application, Ricky. Scribe is
    probably
    useful for you as well:

    <http://sourceforge.net/projects/scribeserver/>
    <
    http://images.google.com/imgres?imgurl=http://farm3.static.flickr.com/2211/2197670659_b42810b8ba.jpg&imgrefurl=http://www.flickr.com/photos/niallkennedy/2197670659/&usg=__WLc-p9Gi_p3AdA-YuKLRZ-bdgvo=&h=375&w=500&sz=131&hl=en&start=2&sig2=P22LVO1KObby6_DDy8ujYg&um=1&tbnid=QudxiEyFOk1EpM:&tbnh=98&tbnw=130&prev=/images%3Fq%3Dfacebook%2Bscribe%2Bhadoop%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1&ei=48beSa74L4H-swORnPmjDg
    Scribe is what Facebook uses to get its Apache logs to Hadoop.
    Unfortunately, HDFS doesn't (yet) have append, so you'll have to
    batch log
    files and load them into HDFS in bulk.

    Alex
    On Thu, Apr 9, 2009 at 9:06 PM, Ricky Ho wrote:

    I want to analyze the traffic pattern and statistics of a distributed
    application. I am thinking of having the application write the
    events as
    log entries into HDFS and then later I can use a Map/Reduce task to
    do the
    analysis in parallel. Is this a good approach ?

    In this case, does HDFS support concurrent write (append) to a file ?
    Another question is whether the write API thread-safe ?

    Rgds,
    Ricky
  • Ariel Rabkin at Apr 13, 2009 at 2:38 pm
    Chukwa is a Hadoop subproject aiming to do something similar, though
    particularly for the case of Hadoop logs. You may find it useful.

    Hadoop unfortunately does not support concurrent appends. As a
    result, the Chukwa project found itself creating a whole new demon,
    the chukwa collector, precisely to merge the event streams and write
    it out, just once. We're set to do a release within the next week or
    two, but in the meantime you can check it out from SVN at
    https://svn.apache.org/repos/asf/hadoop/chukwa/trunk

    --Ari
    On Fri, Apr 10, 2009 at 12:06 AM, Ricky Ho wrote:
    I want to analyze the traffic pattern and statistics of a distributed application.  I am thinking of having the application write the events as log entries into HDFS and then later I can use a Map/Reduce task to do the analysis in parallel.  Is this a good approach ?

    In this case, does HDFS support concurrent write (append) to a file ?  Another question is whether the write API thread-safe ?

    Rgds,
    Ricky


    --
    Ari Rabkin asrabkin@gmail.com
    UC Berkeley Computer Science Department
  • Ricky Ho at Apr 13, 2009 at 3:04 pm
    Ari, thanks for your note.

    Like to understand more how Chukwa group log entries ... If I have appA running in machine X, Y and appB running in machine Y, Z. Each of them calling the Chukwa log API.

    Do I have all entries going in the same HDFS file ? or 4 separated HDFS files based on the App/Machine combination ?

    If the answer of first Q is "yes", then what if appA and appB has different format of log entries ?
    If the answer of second Q is "yes", then are all these HDFS files cut at the same time boundary ?

    Looks like in Chukwa, application first log to a daemon, which buffer-write the log entries into a local file. And there is a separate process to ship these data to a remote collector daemon which issue the actual HDFS write. I observe the following overhead ...

    1) The overhead of extra write to local disk and ship the data over to the collector. If HDFS supports append, the application can then go directly to the HDFS

    2) The centralized collector establish a bottleneck to the otherwise perfectly parallel HDFS architecture.

    Am I missing something here ?

    Rgds,
    Ricky

    -----Original Message-----
    From: Ariel Rabkin
    Sent: Monday, April 13, 2009 7:38 AM
    To: core-user@hadoop.apache.org
    Subject: Re: HDFS as a logfile ??

    Chukwa is a Hadoop subproject aiming to do something similar, though
    particularly for the case of Hadoop logs. You may find it useful.

    Hadoop unfortunately does not support concurrent appends. As a
    result, the Chukwa project found itself creating a whole new demon,
    the chukwa collector, precisely to merge the event streams and write
    it out, just once. We're set to do a release within the next week or
    two, but in the meantime you can check it out from SVN at
    https://svn.apache.org/repos/asf/hadoop/chukwa/trunk

    --Ari
    On Fri, Apr 10, 2009 at 12:06 AM, Ricky Ho wrote:
    I want to analyze the traffic pattern and statistics of a distributed application.  I am thinking of having the application write the events as log entries into HDFS and then later I can use a Map/Reduce task to do the analysis in parallel.  Is this a good approach ?

    In this case, does HDFS support concurrent write (append) to a file ?  Another question is whether the write API thread-safe ?

    Rgds,
    Ricky


    --
    Ari Rabkin asrabkin@gmail.com
    UC Berkeley Computer Science Department
  • Ariel Rabkin at Apr 14, 2009 at 7:54 pm
    Everything gets dumped into the same files.

    We don't assume anything at all about the format of the input data; it
    gets dumped into Hadoop sequence files, tagged with some metadata to
    say what machine and app it came from, and where it was in the
    original stream.

    There is a slight penalty from the log-to-local disk. In practice, you
    often want a local copy anyway, both for redundancy and because it's
    much more convenient for human inspection. Having a separate
    collector process is indeed inelegant. However, HDFS copes badly with
    many small files, so that pushes you to merge entries across either
    hosts or time partitions. And since HDFS doesn't have a flush(),
    having one log per source means that log entries don't become visible
    quickly enough. Hence, collectors.

    It isn't gorgeous, but it seems to work fine in practice.
    On Mon, Apr 13, 2009 at 8:01 AM, Ricky Ho wrote:
    Ari, thanks for your note.

    Like to understand more how Chukwa group log entries ... If I have appA running in machine X, Y and appB running in machine Y, Z.  Each of them calling the Chukwa log API.

    Do I have all entries going in the same HDFS file ?  or 4 separated HDFS files based on the App/Machine combination ?

    If the answer of first Q is "yes", then what if appA and appB has different format of log entries ?
    If the answer of second Q is "yes", then are all these HDFS files cut at the same time boundary ?

    Looks like in Chukwa, application first log to a daemon, which buffer-write the log entries into a local file.  And there is a separate process to ship these data to a remote collector daemon which issue the actual HDFS write.  I observe the following overhead ...

    1) The overhead of extra write to local disk and ship the data over to the collector.  If HDFS supports append, the application can then go directly to the HDFS

    2) The centralized collector establish a bottleneck to the otherwise perfectly parallel HDFS architecture.

    Am I missing something here ?
    --
    Ari Rabkin asrabkin@gmail.com
    UC Berkeley Computer Science Department
  • Alex Loddengaard at Apr 10, 2009 at 4:07 am
    Answers in-line.

    Alex
    On Thu, Apr 9, 2009 at 3:45 PM, Stas Oskin wrote:

    Hi.

    I have 2 questions about HDFS performance:

    1) How fast are the read and write operations over network, in Mbps per
    second?
    Hypertable (a BigTable implementation) has a good KFS vs. HDFS breakdown: <
    http://code.google.com/p/hypertable/wiki/KFSvsHDFS>

    2) If the chunk server is located on same host as the client, is there any
    optimization in read operations?
    For example, Kosmos FS describe the following functionality:

    "Localhost optimization: One copy of data
    is placed on the chunkserver on the same
    host as the client doing the write

    Helps reduce network traffic"
    In Hadoop-speak, we're interested in DataNodes (storage nodes) and
    TaskTrackers (compute nodes). In terms of MapReduce, Hadoop does try and
    schedule tasks such that the data being processed by a given task on a given
    machine is also on that machine. As for loading data onto a DataNode,
    loading data from a DataNode will put a replica on that node. However, if
    you're loading data from, say, your local machine, Hadoop will choose a
    DataNode at random.

    Regards.
  • Stas Oskin at Apr 10, 2009 at 2:38 pm
    Hi.
    Hypertable (a BigTable implementation) has a good KFS vs. HDFS breakdown: <
    http://code.google.com/p/hypertable/wiki/KFSvsHDFS>
    From this comparison it seems KFS is quite faster then HDFS for small data
    transfers (for SQL commands).

    Any idea if same holds true for small-medium (20Mb - 150 MB) files?


    2) If the chunk server is located on same host as the client, is there any
    optimization in read operations?
    For example, Kosmos FS describe the following functionality:

    "Localhost optimization: One copy of data
    is placed on the chunkserver on the same
    host as the client doing the write

    Helps reduce network traffic"
    In Hadoop-speak, we're interested in DataNodes (storage nodes) and
    TaskTrackers (compute nodes). In terms of MapReduce, Hadoop does try and
    schedule tasks such that the data being processed by a given task on a
    given
    machine is also on that machine. As for loading data onto a DataNode,
    loading data from a DataNode will put a replica on that node. However, if
    you're loading data from, say, your local machine, Hadoop will choose a
    DataNode at random.
    Ah, so if DataNode will store file to HDFS, it would try to place a replica
    on this same DataNode as well? And then if this DataNode would try to read
    the file. HDFS would try to read it first from itself first?

    Regards.
  • Brian Bockelman at Apr 10, 2009 at 4:31 am

    On Apr 9, 2009, at 5:45 PM, Stas Oskin wrote:

    Hi.

    I have 2 questions about HDFS performance:

    1) How fast are the read and write operations over network, in Mbps
    per
    second?
    Depends. What hardware? How much hardware? Is the cluster under
    load? What does your I/O load look like? As a rule of thumb, you'll
    probably expect very close to hardware speed.

    At this instant, our cluster is doing about 10k reads / sec, each read
    is 128KB. About 1.2GB / s. The max we've recorded on this cluster is
    8GB/s. http://rcf.unl.edu/ganglia/?c=red-workers However, you
    probably don't care much about that unless you get to run on our
    hardware ;)

    Unfortunately, this question is kind of like asking "How fast is
    Linux?" There's just so many different performance characteristics
    that giving a good answer is always tough. My recommendation is to
    take an afternoon and play around with it.
    2) If the chunk server is located on same host as the client, is
    there any
    optimization in read operations?
    For example, Kosmos FS describe the following functionality:

    "Localhost optimization: One copy of data
    is placed on the chunkserver on the same
    host as the client doing the write

    Helps reduce network traffic"
    Yes, this optimization is performed.

    Brian
  • Stas Oskin at Apr 10, 2009 at 2:40 pm
    Hi.

    Depends. What hardware? How much hardware? Is the cluster under load?
    What does your I/O load look like? As a rule of thumb, you'll probably
    expect very close to hardware speed.
    Standard Xeon dual cpu, quad core servers, 4 GB RAM.
    The DataNodes also do some processing, with usual loads about ~4 (from 8
    recommended). The IO load is linear, there are almost no write or read
    peaks.

    By close to hardware speed, you mean results very near the results I get via
    iozone?

    Thanks.
  • Brian Bockelman at Apr 10, 2009 at 5:16 pm

    On Apr 10, 2009, at 9:40 AM, Stas Oskin wrote:

    Hi.

    Depends. What hardware? How much hardware? Is the cluster under
    load?
    What does your I/O load look like? As a rule of thumb, you'll
    probably
    expect very close to hardware speed.
    Standard Xeon dual cpu, quad core servers, 4 GB RAM.
    The DataNodes also do some processing, with usual loads about ~4
    (from 8
    recommended). The IO load is linear, there are almost no write or read
    peaks.
    Interesting -- machines are fairly RAM-poor for data processing ... I
    guess your tasks must be fairly efficient.
    By close to hardware speed, you mean results very near the results I
    get via
    iozone?
    Depends on what kind of I/O you do - are you going to be using
    MapReduce and co-locating jobs and data? If so, it's possible to get
    close to those speeds if you are I/O bound in your job and read right
    through each chunk. If you have multiple disks mounted individually,
    you'll need the number of streams equal to the number of disks. If
    you're going to do I/O that's not through MapReduce, you'll probably
    be bound by the network interface.

    Brian
  • Stas Oskin at Apr 10, 2009 at 6:52 pm
    Hi.
    Depends on what kind of I/O you do - are you going to be using MapReduce
    and co-locating jobs and data? If so, it's possible to get close to those
    speeds if you are I/O bound in your job and read right through each chunk.
    If you have multiple disks mounted individually, you'll need the number of
    streams equal to the number of disks. If you're going to do I/O that's not
    through MapReduce, you'll probably be bound by the network interface.
    Btw, this what I wanted to ask as well:

    Is it more efficient to unify the disks into one volume (RAID or LVM), and
    then present them as a single space? Or it's better to specify each disk
    separately?

    Reliability-wise, the latter sounds more correct, as a single/several (up to
    3) disks going down won't take the whole node with them. But perhaps there
    is a performance penalty?
  • Konstantin Shvachko at Apr 10, 2009 at 9:33 pm
    I just wanted to add to this one other published benchmark
    http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html
    In this example on a very busy cluster of 4000 nodes both read and write throughputs
    were close to the local disk bandwidth.
    This benchmark (called TestDFSIO) uses large consequent write and reads.
    You can run it yourself on your hardware to compare.
    Is it more efficient to unify the disks into one volume (RAID or LVM), and
    then present them as a single space? Or it's better to specify each disk
    separately?
    There was a discussion recently on this list about RAID0 vs separate disks.
    Please search the archives. Separate disks turn out to perform better.
    Reliability-wise, the latter sounds more correct, as a single/several (up to
    3) disks going down won't take the whole node with them. But perhaps there
    is a performance penalty?
    You always have block replicas on other nodes, so one node going down should not be a problem.

    Thanks,
    --Konstantin
  • Stas Oskin at Dec 28, 2009 at 7:02 pm
    Hi.

    Going back to the subject, has anyone ever bench-marked small (10 - 20 node)
    HDFS clusters?

    I did my own speed checks, and it seems I can reach ~77Mbps, on a quad-disk
    node. This comes to ~19Mbps per disk, which seems quite low in my opinion.

    Can anyone advice about this?

    Thanks.
  • Stas Oskin at Jan 2, 2010 at 10:54 pm
    Hi.

    Can anyone advice on the subject below?

    Thanks!
    On Mon, Dec 28, 2009 at 9:01 PM, Stas Oskin wrote:

    Hi.

    Going back to the subject, has anyone ever bench-marked small (10 - 20
    node) HDFS clusters?

    I did my own speed checks, and it seems I can reach ~77Mbps, on a quad-disk
    node. This comes to ~19Mbps per disk, which seems quite low in my opinion.

    Can anyone advice about this?

    Thanks.
  • Eli Collins at Jan 3, 2010 at 12:03 am
    Hey Stas,

    Can you provide more information about your workload and the
    environment? eg are you running t.o.a.h.h.BenchmarkThroughput,
    TestDFSIO, or timing hadoop fs -put/get to transfer data to hdfs from
    another machine, looking at metrics, etc. What else is running on the
    cluster? Have you profiled? etc. 77Mb/s (<10MB/s) seems low but w/o
    context is not meaningful.

    Thanks,
    Eli
    On Sat, Jan 2, 2010 at 2:54 PM, Stas Oskin wrote:
    Hi.

    Can anyone advice on the subject below?

    Thanks!
    On Mon, Dec 28, 2009 at 9:01 PM, Stas Oskin wrote:

    Hi.

    Going back to the subject, has anyone ever bench-marked small (10 - 20
    node) HDFS clusters?

    I did my own speed checks, and it seems I can reach ~77Mbps, on a quad-disk
    node. This comes to ~19Mbps per disk, which seems quite low in my opinion.

    Can anyone advice about this?

    Thanks.
  • Stas Oskin at Jan 5, 2010 at 1:56 pm
    Hi.

    Can you provide more information about your workload and the
    environment? eg are you running t.o.a.h.h.BenchmarkThroughput,
    TestDFSIO, or timing hadoop fs -put/get to transfer data to hdfs from
    another machine, looking at metrics, etc. What else is running on the
    cluster? Have you profiled? etc. 77Mb/s (<10MB/s) seems low but w/o
    context is not meaningful.

    I actually tested it with a simple Java test loader I quickly put together,
    which ran on each machine and continuously has written random data to DFS. I
    tuned the writing rate until I got ~77Mb/s - above it the iowait loads on
    each disk (measured by iostat) became above 50% - 60%, which is quite close
    to disks limits.

    If there is a more official / better way to do it, I'll be happy to get some
    pointers to it.
    You mentioned some TestDFSIO, any idea if it's present in 0.18.3?

    Regards.
  • Stas Oskin at Jan 5, 2010 at 2:12 pm
    Hi again.

    By the way, I forgot to mention that I do the tests on same machines that
    serve as DataNodes. i.e. same machine acts both like as a client and
    DataNode.

    Regards.
  • Eli Collins at Jan 10, 2010 at 8:53 am

    I actually tested it with a simple Java test loader I quickly put together,
    which ran on each machine and continuously has written random data to DFS. I
    tuned the writing rate until I got ~77Mb/s - above it the iowait loads on
    each disk (measured by iostat) became above 50% - 60%, which is quite close
    to disks limits.
    How many DNs are you using? How many copies of the benchmark are you
    running? What results do you get just running a single copy of the
    benchmark?

    I see ~46 MB/s hadoop fs put'ing a local 1gb file from one DN, using
    3-way replication. Running the test on three DNs I get around 30 MB/s.
    This is a little less than half the theoretical limit (using three
    hosts each with a single gigabit nic). In these tests I purged the
    buffer cache before running the test, with the input file cached in
    memory (more similar to your test) I get 92 MB/s on one host but about
    the same rate for three hosts (we're network bound). This is about 3x
    faster than what you're seeing so I suspect something's up with your
    test. Would be useful for you to see what results you get running the
    same test I did.
    You mentioned some TestDFSIO, any idea if it's present in 0.18.3?
    It's in 0.18.3 See src/test/org/apache/hadoop/fs/TestDFSIO.java

    Thanks,
    Eli
  • Andreas Kostyrka at Jan 3, 2010 at 3:22 am
    Well, that all depends on many details, but:

    -) are you really using 4 discs (configured correctly as data
    directories?)

    -) What hdd/connection technology?

    -) And 77MB/s would match up curiously well with 1Gbit networking cards?
    So you sure that you are testing a completely local setup? Where's your
    name node running then?

    Andreas


    Am Sonntag, den 03.01.2010, 00:54 +0200 schrieb Stas Oskin:
    Hi.

    Can anyone advice on the subject below?

    Thanks!
    On Mon, Dec 28, 2009 at 9:01 PM, Stas Oskin wrote:

    Hi.

    Going back to the subject, has anyone ever bench-marked small (10 - 20
    node) HDFS clusters?

    I did my own speed checks, and it seems I can reach ~77Mbps, on a quad-disk
    node. This comes to ~19Mbps per disk, which seems quite low in my opinion.

    Can anyone advice about this?

    Thanks.
  • Rajesh Balamohan at Jan 3, 2010 at 1:19 pm
    Also, It would be interesting to know "data.replication" setting you have
    for this benchmark?
    On Sun, Jan 3, 2010 at 8:51 AM, Andreas Kostyrka wrote:

    Well, that all depends on many details, but:

    -) are you really using 4 discs (configured correctly as data
    directories?)

    -) What hdd/connection technology?

    -) And 77MB/s would match up curiously well with 1Gbit networking cards?
    So you sure that you are testing a completely local setup? Where's your
    name node running then?

    Andreas


    Am Sonntag, den 03.01.2010, 00:54 +0200 schrieb Stas Oskin:
    Hi.

    Can anyone advice on the subject below?

    Thanks!
    On Mon, Dec 28, 2009 at 9:01 PM, Stas Oskin wrote:

    Hi.

    Going back to the subject, has anyone ever bench-marked small (10 - 20
    node) HDFS clusters?

    I did my own speed checks, and it seems I can reach ~77Mbps, on a
    quad-disk
    node. This comes to ~19Mbps per disk, which seems quite low in my
    opinion.
    Can anyone advice about this?

    Thanks.

    --
    ~Rajesh.B
  • Stas Oskin at Jan 5, 2010 at 2:00 pm
    Hi.

    Also, It would be interesting to know "data.replication" setting you have
    for this benchmark?
    data.replication = 2

    A bit of topic - is it safe to have such number? About a year ago I heard
    only 3 way replication was fully tested, while 2 way had some issues - was
    it fixed in subsequent versions?

    Regards.
  • Eli Collins at Jan 10, 2010 at 8:56 am

    data.replication = 2

    A bit of topic - is it safe to have such number? About a year ago I heard
    only 3 way replication was fully tested, while 2 way had some issues - was
    it fixed in subsequent versions?
    I think that's still a relatively untested configuration, though I'm
    not aware of any known bugs with it. I know of at least one cluster
    that uses 2-way replication. Note that 3-way replication is used both
    for availability and performance, though in a write benchmark 2-way
    replication should be faster than 3-way.

    Thanks,
    Eli
  • Brian Bockelman at Jan 16, 2010 at 5:20 pm
    Hi,

    We run with 2-way replication. The wonderful folks at Yahoo! worked through most of the bugs during 0.19.x IIRC. There was never any bugs with 2-way replication per-se, but running a cluster with 2 replicas exposed other bugs at a 100x rate compared to running with 3 replicas (due to the fact that a silent corruption + loss of a single data node = file loss).

    I'd estimate we lose files at a rate of about 1 per month for 200TB of actual data. That number would probably go down an order of magnitude or more if we were running with 3 replicas.

    Hope this helps.

    Brian
    On Jan 10, 2010, at 3:55 AM, Eli Collins wrote:

    data.replication = 2

    A bit of topic - is it safe to have such number? About a year ago I heard
    only 3 way replication was fully tested, while 2 way had some issues - was
    it fixed in subsequent versions?
    I think that's still a relatively untested configuration, though I'm
    not aware of any known bugs with it. I know of at least one cluster
    that uses 2-way replication. Note that 3-way replication is used both
    for availability and performance, though in a write benchmark 2-way
    replication should be faster than 3-way.

    Thanks,
    Eli
  • Stas Oskin at Jan 17, 2010 at 5:27 pm
    Hi.

    We run with 2-way replication. The wonderful folks at Yahoo! worked through
    most of the bugs during 0.19.x IIRC. There was never any bugs with 2-way
    replication per-se, but running a cluster with 2 replicas exposed other bugs
    at a 100x rate compared to running with 3 replicas (due to the fact that a
    silent corruption + loss of a single data node = file loss).

    I'd estimate we lose files at a rate of about 1 per month for 200TB of
    actual data. That number would probably go down an order of magnitude or
    more if we were running with 3 replicas.

    Hope this helps.
    Thanks for sharing!

    So, there is a good reason to believe, that version 0.19 and higher have the
    file storage / silent corruption issues sorted out?

    Regards.
  • Stas Oskin at Jan 5, 2010 at 1:58 pm
    Hi.

    Well, that all depends on many details, but:
    -) are you really using 4 discs (configured correctly as data
    directories?)
    Yes, 4 directories, one per each disk.

    -) What hdd/connection technology?
    SATA 3Gbp/s

    -) And 77MB/s would match up curiously well with 1Gbit networking cards?
    So you sure that you are testing a completely local setup? Where's your
    name node running then?
    I actually mixed this with 77Mbp/s (bits, not bytes), sorry for confusion.

    Regards.
  • Owen O'Malley at Apr 10, 2009 at 4:03 pm

    On Thu, Apr 9, 2009 at 9:30 PM, Brian Bockelman wrote:
    On Apr 9, 2009, at 5:45 PM, Stas Oskin wrote:

    Hi.

    I have 2 questions about HDFS performance:

    1) How fast are the read and write operations over network, in Mbps per
    second?
    Depends.  What hardware?  How much hardware?  Is the cluster under load?
    What does your I/O load look like?  As a rule of thumb, you'll probably
    expect very close to hardware speed.
    For comparison, on a 1400 node cluster, I can checksum 100 TB in
    around 10 minutes, which means I'm seeing read averages of roughly 166
    GB/sec. For writes with replication of 3, I see roughly 40-50 minutes
    to write 100TB, so roughly 33 GB/sec average. Of course the peaks are
    much higher. Each node has 4 SATA disks, dual quad core, and 8 GB of
    ram.

    -- Owen
  • Stas Oskin at Apr 10, 2009 at 4:07 pm
    Thanks for sharing.
    For comparison, on a 1400 node cluster, I can checksum 100 TB in
    around 10 minutes, which means I'm seeing read averages of roughly 166
    GB/sec. For writes with replication of 3, I see roughly 40-50 minutes
    to write 100TB, so roughly 33 GB/sec average. Of course the peaks are
    much higher. Each node has 4 SATA disks, dual quad core, and 8 GB of
    ram.
    From your experience, how RAM "hungry" HDFS is? Meaning, additional 4GB or
    ram (to make it 8GB aas in your case), really change anything?

    Regards.
  • Owen O'Malley at Apr 10, 2009 at 4:25 pm

    On Apr 10, 2009, at 9:07 AM, Stas Oskin wrote:

    From your experience, how RAM "hungry" HDFS is? Meaning, additional
    4GB or
    ram (to make it 8GB aas in your case), really change anything?
    I don't think the 4 to 8GB would matter much for HDFS. For Map/Reduce,
    it is very important.

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 9, '09 at 10:46p
activeJan 17, '10 at 5:27p
posts30
users11
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase