FAQ
Hello,

I don't know which mailing list is better for this question, so I like
to forward my questions to this mailing list.

If no one is thinking of doing direct IO in Hadoop, I will do it myself.
I have located the code, but the thing is that I'm not familiar with the
environment of compiling Hadoop. I can use jposix, but I don't know how
to integrate it to Hadoop (jposix uses JNI). Any instructions to do it?

Thank you,
Da


-------- Original Message --------
Subject: Hadoop use direct I/O in Linux?
Date: Sun, 02 Jan 2011 15:01:18 -0500
From: Da Zheng <zhengda1936@gmail.com>
To: common-user@hadoop.apache.org



Hello,

direct IO can make huge performance difference, especially when Atom processors
are used. but as far as I know, hadoop doesn't enable direct IO of Linux. Does
anyone know any unofficial versions were developed to use direct IO?

I googled it, and found FUSE provides an option for direct IO. If I use FUSE DFS
and enable direct IO, will I get what I want? i.e., when I write data to HDFS,
the data is written to the disk directly (no caching by any file systems)? or
this direct IO option only allows me to bypass the caching in FUSE and the data
is still cached by the underlying FS?

Best,
Da

Search Discussions

  • Da Zheng at Jan 3, 2011 at 6:49 pm
    Hello,

    I don't know which mailing list is better for this question, so I like
    to forward my questions to this mailing list.

    If no one is thinking of doing direct IO in Hadoop, I will do it myself.
    I have located the code, but the thing is that I'm not familiar with the
    environment of compiling Hadoop. I can use jposix, but I don't know how
    to integrate it to Hadoop (jposix uses JNI). Any instructions to do it?

    Thank you,
    Da


    -------- Original Message --------
    Subject: Hadoop use direct I/O in Linux?
    Date: Sun, 02 Jan 2011 15:01:18 -0500
    From: Da Zheng <zhengda1936@gmail.com>
    To: common-user@hadoop.apache.org



    Hello,

    direct IO can make huge performance difference, especially when Atom processors
    are used. but as far as I know, hadoop doesn't enable direct IO of Linux. Does
    anyone know any unofficial versions were developed to use direct IO?

    I googled it, and found FUSE provides an option for direct IO. If I use FUSE DFS
    and enable direct IO, will I get what I want? i.e., when I write data to HDFS,
    the data is written to the disk directly (no caching by any file systems)? or
    this direct IO option only allows me to bypass the caching in FUSE and the data
    is still cached by the underlying FS?

    Best,
    Da
  • Brian Bockelman at Jan 3, 2011 at 7:41 pm
    Hi Da,

    It's not immediately clear to me the size of the benefit versus the costs. Two cases where one normally thinks about direct I/O are:
    1) The usage scenario is a cache anti-pattern. This will be true for some Hadoop use cases (MapReduce), not true for some others.
    - http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
    2) The application manages its own cache. Not applicable.
    Atom processors, which you mention below, will just exacerbate (1) due to the small cache size.

    Since it should help with MapReduce case, there's probably an overall benefit to the community. However, what is the cost in terms of code complexity? Here's a LKML post from Linus mentioning all the nasty parts of doing O_DIRECT:

    http://lkml.org/lkml/2007/1/11/129

    Other choice quotes from Linus:

    """
    The right way to do it is to just not use O_DIRECT.

    The whole notion of "direct IO" is totally braindamaged. Just say no.

    This is your brain: O
    This is your brain on O_DIRECT: .

    Any questions?

    I should have fought back harder. There really is no valid reason for EVER
    using O_DIRECT. You need a buffer whatever IO you do, and it might as well
    be the page cache. There are better ways to control the page cache than
    play games and think that a page cache isn't necessary.

    So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
    instead.
    """

    It sounds like trying to implement direct I/O is the way of bugs, data loss, and pain for Hadoop.

    However, the OpenJDK list has a few opinions of their own (where I found the above quotes):

    http://markmail.org/message/rmty2xsl45p7klbt

    Using properly aligned NIO is going to get you most of the way (guess what Hadoop does already!). The other thing to try is using posix_fadvise JNI. In fact, it appears Lucene has considered this:

    http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html

    In their case, fadvise *didn't* work well, but for a reason which shouldn't be true for MapReduce.

    All-in-all, doing this specialization such that you don't hurt the general case is going to be tough.

    Brian
    On Jan 3, 2011, at 12:46 PM, Da Zheng wrote:

    Hello,

    I don't know which mailing list is better for this question, so I like to forward my questions to this mailing list.

    If no one is thinking of doing direct IO in Hadoop, I will do it myself. I have located the code, but the thing is that I'm not familiar with the environment of compiling Hadoop. I can use jposix, but I don't know how to integrate it to Hadoop (jposix uses JNI). Any instructions to do it?

    Thank you,
    Da


    -------- Original Message --------
    Subject: Hadoop use direct I/O in Linux?
    Date: Sun, 02 Jan 2011 15:01:18 -0500
    From: Da Zheng <zhengda1936@gmail.com>
    To: common-user@hadoop.apache.org



    Hello,

    direct IO can make huge performance difference, especially when Atom processors
    are used. but as far as I know, hadoop doesn't enable direct IO of Linux. Does
    anyone know any unofficial versions were developed to use direct IO?

    I googled it, and found FUSE provides an option for direct IO. If I use FUSE DFS
    and enable direct IO, will I get what I want? i.e., when I write data to HDFS,
    the data is written to the disk directly (no caching by any file systems)? or
    this direct IO option only allows me to bypass the caching in FUSE and the data
    is still cached by the underlying FS?

    Best,
    Da
  • Christopher Smith at Jan 3, 2011 at 11:18 pm

    On Mon, Jan 3, 2011 at 11:40 AM, Brian Bockelman wrote:

    It's not immediately clear to me the size of the benefit versus the costs.
    Two cases where one normally thinks about direct I/O are:
    1) The usage scenario is a cache anti-pattern. This will be true for some
    Hadoop use cases (MapReduce), not true for some others.
    - http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
    2) The application manages its own cache. Not applicable.
    Atom processors, which you mention below, will just exacerbate (1) due to
    the small cache size.
    Actually, assuming you thrash the cache anyway, having a smaller cache can
    often be a good thing. ;-)

    All-in-all, doing this specialization such that you don't hurt the general
    case is going to be tough.

    For the Hadoop case, the advantages of O_DIRECT would seem to be
    comparatively petty to using O_APPEND and/or MMAP (yes, I realize this is
    not quite the same as what you are proposing, but it seems close enough for
    most cases.. Your best case for a win is when you have reasonably random
    access to a file, and then something else that would benefit from more logve
    --
    Chris
  • Brian Bockelman at Jan 4, 2011 at 1:05 am

    On Jan 3, 2011, at 5:17 PM, Christopher Smith wrote:
    On Mon, Jan 3, 2011 at 11:40 AM, Brian Bockelman wrote:

    It's not immediately clear to me the size of the benefit versus the costs.
    Two cases where one normally thinks about direct I/O are:
    1) The usage scenario is a cache anti-pattern. This will be true for some
    Hadoop use cases (MapReduce), not true for some others.
    - http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
    2) The application manages its own cache. Not applicable.
    Atom processors, which you mention below, will just exacerbate (1) due to
    the small cache size.
    Actually, assuming you thrash the cache anyway, having a smaller cache can
    often be a good thing. ;-)
    Assuming no other thread wants to use that poor cache you are thrashing ;)
    All-in-all, doing this specialization such that you don't hurt the general
    case is going to be tough.

    For the Hadoop case, the advantages of O_DIRECT would seem to be
    comparatively petty to using O_APPEND and/or MMAP (yes, I realize this is
    not quite the same as what you are proposing, but it seems close enough for
    most cases.. Your best case for a win is when you have reasonably random
    access to a file, and then something else that would benefit from more logve
    Actually, our particular site would greatly benefit from O_DIRECT - we have non-MapReduce clients with a highly non-repetitive, random read I/O pattern with an actively managed application-level read-ahead (note: because we're almost guaranteed to wait for a disk seek - 2PB of SSDs are a touch pricey, the latency overheads of Java are not actually too important). The OS page cache is mostly useless for us as the working set size is on the order of a few hundred TB.

    However, I wouldn't actively clamor for O_DIRECT support, but could probably do wonders with a HDFS-equivalent to fadvise. I really don't want to get into the business of managing buffering in my application code any more than we already do.

    Brian

    PS - if there are bored folks wanting to do something beneficial to high-performance HDFS, I'd note that currently it is tough to get >1Gbps performance from a single Hadoop client transferring multiple files. However, HP labs had a clever approach: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf . I'd love to see a generic, easy-to-use API to do this.
  • Christopher Smith at Jan 4, 2011 at 2:48 am

    On Mon, Jan 3, 2011 at 5:05 PM, Brian Bockelman wrote:
    On Jan 3, 2011, at 5:17 PM, Christopher Smith wrote:
    On Mon, Jan 3, 2011 at 11:40 AM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    It's not immediately clear to me the size of the benefit versus the
    costs.
    Two cases where one normally thinks about direct I/O are:
    1) The usage scenario is a cache anti-pattern. This will be true for
    some
    Hadoop use cases (MapReduce), not true for some others.
    - http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
    2) The application manages its own cache. Not applicable.
    Atom processors, which you mention below, will just exacerbate (1) due
    to
    the small cache size.
    Actually, assuming you thrash the cache anyway, having a smaller cache can
    often be a good thing. ;-)
    Assuming no other thread wants to use that poor cache you are thrashing ;)

    Even then: a small cash can be cleared up more quickly. As in all cases, it
    very much depends on circumstance, but much like O_DIRECT, if you are
    blowing the cache anyway, there is little at stake.
    All-in-all, doing this specialization such that you don't hurt the
    general
    case is going to be tough.
    For the Hadoop case, the advantages of O_DIRECT would seem to be
    comparatively petty to using O_APPEND and/or MMAP (yes, I realize this is
    not quite the same as what you are proposing, but it seems close enough for
    most cases.. Your best case for a win is when you have reasonably random
    access to a file, and then something else that would benefit from more
    logve

    Actually, our particular site would greatly benefit from O_DIRECT - we have
    non-MapReduce clients with a highly non-repetitive, random read I/O pattern
    with an actively managed application-level read-ahead (note: because we're
    almost guaranteed to wait for a disk seek - 2PB of SSDs are a touch pricey,
    the latency overheads of Java are not actually too important). The OS page
    cache is mostly useless for us as the working set size is on the order of a
    few hundred TB.
    Sounds like a lot of fun! Even in a circumstance like the one you describe,
    unless the I/O pattern isn't truly random and some application level insight
    provides a unique advantage, the page cache will often do a better job of
    managing the memory both in terms of caching and read-ahead (it becomes a
    lot like the "building a better TCP using UDP": possible, but not really
    worth the effort). If you can pull off zero-copy I/O, the O_DIRECT can be a
    huge win, but Java makes that very, very difficult, and horribly painful to
    manage.

    However, I wouldn't actively clamor for O_DIRECT support, but could
    probably do wonders with a HDFS-equivalent to fadvise. I really don't want
    to get into the business of managing buffering in my application code any
    more than we already do.

    Yes, I think a few minor simple tweaks to HDFS could help tremendously,
    particularly for Map/Reduce style jobs.

    PS - if there are bored folks wanting to do something beneficial to
    high-performance HDFS, I'd note that currently it is tough to get >1Gbps
    performance from a single Hadoop client transferring multiple files.
    However, HP labs had a clever approach:
    http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf . I'd love to see
    a generic, easy-to-use API to do this.

    Interesting. We haven't tried to push the envelope, but we have achieved
    1Gbps... I can't recall if we ever got over 2Gbps though...
    --
    Chris
  • Brian Bockelman at Jan 4, 2011 at 3:16 am

    On Jan 3, 2011, at 8:47 PM, Christopher Smith wrote:
    On Mon, Jan 3, 2011 at 5:05 PM, Brian Bockelman wrote:
    On Jan 3, 2011, at 5:17 PM, Christopher Smith wrote:
    On Mon, Jan 3, 2011 at 11:40 AM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    It's not immediately clear to me the size of the benefit versus the
    costs.
    Two cases where one normally thinks about direct I/O are:
    1) The usage scenario is a cache anti-pattern. This will be true for
    some
    Hadoop use cases (MapReduce), not true for some others.
    - http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
    2) The application manages its own cache. Not applicable.
    Atom processors, which you mention below, will just exacerbate (1) due
    to
    the small cache size.
    Actually, assuming you thrash the cache anyway, having a smaller cache can
    often be a good thing. ;-)
    Assuming no other thread wants to use that poor cache you are thrashing ;)

    Even then: a small cash can be cleared up more quickly. As in all cases, it
    very much depends on circumstance, but much like O_DIRECT, if you are
    blowing the cache anyway, there is little at stake.
    All-in-all, doing this specialization such that you don't hurt the
    general
    case is going to be tough.
    For the Hadoop case, the advantages of O_DIRECT would seem to be
    comparatively petty to using O_APPEND and/or MMAP (yes, I realize this is
    not quite the same as what you are proposing, but it seems close enough for
    most cases.. Your best case for a win is when you have reasonably random
    access to a file, and then something else that would benefit from more
    logve

    Actually, our particular site would greatly benefit from O_DIRECT - we have
    non-MapReduce clients with a highly non-repetitive, random read I/O pattern
    with an actively managed application-level read-ahead (note: because we're
    almost guaranteed to wait for a disk seek - 2PB of SSDs are a touch pricey,
    the latency overheads of Java are not actually too important). The OS page
    cache is mostly useless for us as the working set size is on the order of a
    few hundred TB.
    Sounds like a lot of fun! Even in a circumstance like the one you describe,
    unless the I/O pattern isn't truly random and some application level insight
    provides a unique advantage, the page cache will often do a better job of
    managing the memory both in terms of caching and read-ahead (it becomes a
    lot like the "building a better TCP using UDP": possible, but not really
    worth the effort). If you can pull off zero-copy I/O, the O_DIRECT can be a
    huge win, but Java makes that very, very difficult, and horribly painful to
    manage.
    The I/O pattern isn't truly random. To convert from physicist terms to CS terms, the application is iterating through the rows of a column-oriented store, reading out somewhere between 1 and 10% of the columns. The twist is that the columns are compressed, meaning the size of a set of rows on disk is variable. This prevents any sort of OS page cache stride detection from helping - the OS sees everything as random. However, the application also has an index of where each row is located, meaning if it knows the active set of columns, it can predict the reads the client will perform and do a read-ahead.

    Some days, it does feel like "building a better TCP using UDP". However, we got a 3x performance improvement by building it (and multiplying by 10-15k cores for just our LHC experiment, that's real money!), so it's a particular monstrosity we are stuck with.
    However, I wouldn't actively clamor for O_DIRECT support, but could
    probably do wonders with a HDFS-equivalent to fadvise. I really don't want
    to get into the business of managing buffering in my application code any
    more than we already do.

    Yes, I think a few minor simple tweaks to HDFS could help tremendously,
    particularly for Map/Reduce style jobs.

    PS - if there are bored folks wanting to do something beneficial to
    high-performance HDFS, I'd note that currently it is tough to get >1Gbps
    performance from a single Hadoop client transferring multiple files.
    However, HP labs had a clever approach:
    http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf . I'd love to see
    a generic, easy-to-use API to do this.

    Interesting. We haven't tried to push the envelope, but we have achieved
    1Gbps... I can't recall if we ever got over 2Gbps though...
    We hit a real hard wall at 2.5Gbps / server. Hence, to fill our 10Gbps pipe, we've taken the approach of deploying 12 moderate external-facing servers instead of one large, fast server. Unfortunately, buying new servers was much cheaper than finding more time to track down the bottlenecks.

    Brian
  • Christopher Smith at Jan 4, 2011 at 2:56 pm

    On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelman wrote:

    The I/O pattern isn't truly random. To convert from physicist terms to CS
    terms, the application is iterating through the rows of a column-oriented
    store, reading out somewhere between 1 and 10% of the columns. The twist is
    that the columns are compressed, meaning the size of a set of rows on disk
    is variable.
    We're getting pretty far off topic here, but this is an interesting problem.
    It *sounds* to me like a "compressed bitmap index" problem, possibly with
    bloom filters for joins (basically what HBase/Cassandra/Hypertable get in
    to, or in a less distributed case: MonetDB). Is that on the money?

    This prevents any sort of OS page cache stride detection from helping -
    the OS sees everything as random.
    It seems though like if you organized the data a certain way, the OS page
    cache could help.

    However, the application also has an index of where each row is located,
    meaning if it knows the active set of columns, it can predict the reads the
    client will perform and do a read-ahead.
    Yes, this is the kind of advantage where O_DIRECT might help, although I'd
    hope in this kind of circumstance the OS buffer cache would mostly give up
    anyway and just give as much of the available RAM as possible to the app. In
    that case memory mapped files with a thread doing a bit of read ahead would
    seem like not that much slower than using O_DIRECT.

    That said, I have to wonder how often this problem devolves in to a straight
    forward column scan. I mean, with a 1-10% hit rate, you need SSD seek times
    for it to make sense to seek to specific records vs. just scanning through
    the whole column, or to put it another way: "disk is the new tape". ;-)

    Some days, it does feel like "building a better TCP using UDP". However,
    we got a 3x performance improvement by building it (and multiplying by
    10-15k cores for just our LHC experiment, that's real money!), so it's a
    particular monstrosity we are stuck with.

    It sure sounds like a problem better suited to C++ than Java though. What
    benefits do you yield from doing all this with a JVM?

    --
    Chris
  • Segel, Mike at Jan 4, 2011 at 3:12 pm
    All,
    While this is an interesting topic for debate, I think it's a moot point.
    A lot of DBAs (Especially Informix DBAs) don't agree with Linus. (I'm referring to an earlier post in this thread that referenced a quote from Linus T.) Direct I/O is a good thing. But if Linus is removing it from Linux...

    But with respect to Hadoop... disk i/o shouldn't be a major topic. I mean if it were, then why isn't anyone pushing the use of SSDs? Or if they are too expensive for your budget, why not SAS drives that spin at 15K?
    Ok, those points are rhetorical. The simple solution is that if you're i/o bound, you add more nodes with more disk to further distribute the load, right?

    Also, I may be wrong, but do all OS(s) that one can run Hadoop, handle Direct I/O? And handle it in a common way? So won't you end up having machine/OS specific classes?

    IMHO there are other features that don't yet exist in Hadoop/HBase that will yield a better ROI.

    Ok, so I may be way off base, so I'll shut up now... ;-P

    -Mike



    -----Original Message-----
    From: Christopher Smith
    Sent: Tuesday, January 04, 2011 8:56 AM
    To: common-dev@hadoop.apache.org
    Subject: Re: Hadoop use direct I/O in Linux?
    On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelman wrote:

    The I/O pattern isn't truly random. To convert from physicist terms to CS
    terms, the application is iterating through the rows of a column-oriented
    store, reading out somewhere between 1 and 10% of the columns. The twist is
    that the columns are compressed, meaning the size of a set of rows on disk
    is variable.
    We're getting pretty far off topic here, but this is an interesting problem.
    It *sounds* to me like a "compressed bitmap index" problem, possibly with
    bloom filters for joins (basically what HBase/Cassandra/Hypertable get in
    to, or in a less distributed case: MonetDB). Is that on the money?

    This prevents any sort of OS page cache stride detection from helping -
    the OS sees everything as random.
    It seems though like if you organized the data a certain way, the OS page
    cache could help.

    However, the application also has an index of where each row is located,
    meaning if it knows the active set of columns, it can predict the reads the
    client will perform and do a read-ahead.
    Yes, this is the kind of advantage where O_DIRECT might help, although I'd
    hope in this kind of circumstance the OS buffer cache would mostly give up
    anyway and just give as much of the available RAM as possible to the app. In
    that case memory mapped files with a thread doing a bit of read ahead would
    seem like not that much slower than using O_DIRECT.

    That said, I have to wonder how often this problem devolves in to a straight
    forward column scan. I mean, with a 1-10% hit rate, you need SSD seek times
    for it to make sense to seek to specific records vs. just scanning through
    the whole column, or to put it another way: "disk is the new tape". ;-)

    Some days, it does feel like "building a better TCP using UDP". However,
    we got a 3x performance improvement by building it (and multiplying by
    10-15k cores for just our LHC experiment, that's real money!), so it's a
    particular monstrosity we are stuck with.

    It sure sounds like a problem better suited to C++ than Java though. What
    benefits do you yield from doing all this with a JVM?

    --
    Chris


    The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
  • Da Zheng at Jan 4, 2011 at 6:01 pm
    The most important reason for me to use direct I/O is that the Atom
    processor is too weak. If I wrote a simple program to write data to the
    disk, CPU is almost 100% but the disk hasn't reached its maximal
    bandwidth. When I write data to SSD, the difference is even larger. Even
    if the program has saturated the two cores of the CPU, it cannot even
    get to the half of the maximal bandwidth of SSD.

    I don't know how much benefit direct I/O can bring to the normal
    processor such as Xeon, but I have a feeling I have to use direct I/O in
    order to have good performance on Atom processors.

    Best,
    Da
    On 01/04/2011 10:12 AM, Segel, Mike wrote:
    All,
    While this is an interesting topic for debate, I think it's a moot point.
    A lot of DBAs (Especially Informix DBAs) don't agree with Linus. (I'm referring to an earlier post in this thread that referenced a quote from Linus T.) Direct I/O is a good thing. But if Linus is removing it from Linux...

    But with respect to Hadoop... disk i/o shouldn't be a major topic. I mean if it were, then why isn't anyone pushing the use of SSDs? Or if they are too expensive for your budget, why not SAS drives that spin at 15K?
    Ok, those points are rhetorical. The simple solution is that if you're i/o bound, you add more nodes with more disk to further distribute the load, right?

    Also, I may be wrong, but do all OS(s) that one can run Hadoop, handle Direct I/O? And handle it in a common way? So won't you end up having machine/OS specific classes?

    IMHO there are other features that don't yet exist in Hadoop/HBase that will yield a better ROI.

    Ok, so I may be way off base, so I'll shut up now... ;-P

    -Mike



    -----Original Message-----
    From: Christopher Smith
    Sent: Tuesday, January 04, 2011 8:56 AM
    To: common-dev@hadoop.apache.org
    Subject: Re: Hadoop use direct I/O in Linux?

    On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelmanwrote:
    The I/O pattern isn't truly random. To convert from physicist terms to CS
    terms, the application is iterating through the rows of a column-oriented
    store, reading out somewhere between 1 and 10% of the columns. The twist is
    that the columns are compressed, meaning the size of a set of rows on disk
    is variable.
    We're getting pretty far off topic here, but this is an interesting problem.
    It *sounds* to me like a "compressed bitmap index" problem, possibly with
    bloom filters for joins (basically what HBase/Cassandra/Hypertable get in
    to, or in a less distributed case: MonetDB). Is that on the money?

    This prevents any sort of OS page cache stride detection from helping -
    the OS sees everything as random.
    It seems though like if you organized the data a certain way, the OS page
    cache could help.

    However, the application also has an index of where each row is located,
    meaning if it knows the active set of columns, it can predict the reads the
    client will perform and do a read-ahead.
    Yes, this is the kind of advantage where O_DIRECT might help, although I'd
    hope in this kind of circumstance the OS buffer cache would mostly give up
    anyway and just give as much of the available RAM as possible to the app. In
    that case memory mapped files with a thread doing a bit of read ahead would
    seem like not that much slower than using O_DIRECT.

    That said, I have to wonder how often this problem devolves in to a straight
    forward column scan. I mean, with a 1-10% hit rate, you need SSD seek times
    for it to make sense to seek to specific records vs. just scanning through
    the whole column, or to put it another way: "disk is the new tape". ;-)

    Some days, it does feel like "building a better TCP using UDP". However,
    we got a 3x performance improvement by building it (and multiplying by
    10-15k cores for just our LHC experiment, that's real money!), so it's a
    particular monstrosity we are stuck with.
    It sure sounds like a problem better suited to C++ than Java though. What
    benefits do you yield from doing all this with a JVM?
  • Christopher Smith at Jan 4, 2011 at 10:38 pm
    If you use direct I/O to reduce CPU time, that means you are saving CPU via
    DMA. If you are using Java's heap though, you can kiss that goodbye.

    That said, I'm surprised that the Atom can't keep up with magnetic disk
    unless you have a striped array. 100MB/s shouldn't be too taxing. Is it
    possible you're doing something wrong or your CPU is otherwise occupied?
    On Tue, Jan 4, 2011 at 9:58 AM, Da Zheng wrote:

    The most important reason for me to use direct I/O is that the Atom
    processor is too weak. If I wrote a simple program to write data to the
    disk, CPU is almost 100% but the disk hasn't reached its maximal bandwidth.
    When I write data to SSD, the difference is even larger. Even if the program
    has saturated the two cores of the CPU, it cannot even get to the half of
    the maximal bandwidth of SSD.

    I don't know how much benefit direct I/O can bring to the normal processor
    such as Xeon, but I have a feeling I have to use direct I/O in order to have
    good performance on Atom processors.

    Best,
    Da

    --
    Chris
  • Da Zheng at Jan 5, 2011 at 5:11 am

    On 1/4/11 5:17 PM, Christopher Smith wrote:
    If you use direct I/O to reduce CPU time, that means you are saving CPU via
    DMA. If you are using Java's heap though, you can kiss that goodbye.
    The buffer for direct I/O cannot be allocated from Java's heap anyway, I don't
    understand what you mean?
    That said, I'm surprised that the Atom can't keep up with magnetic disk
    unless you have a striped array. 100MB/s shouldn't be too taxing. Is it
    possible you're doing something wrong or your CPU is otherwise occupied?
    Yes, my C program can reach 100MB/s or even 110MB/s when writing data to the
    disk sequentially, but with direct I/O enabled, the maximal throughput is about
    140MB/s. But the biggest difference is CPU usage.
    Without direct I/O, operating system uses a lot of CPU time (the data below is
    got with top, and this is a dual-core processor with hyperthread enabled).
    Cpu(s): 3.4%us, 32.8%sy, 0.0%ni, 50.0%id, 12.1%wa, 0.0%hi, 1.6%si, 0.0%st
    But with direct I/O, the system time can be as little as 3%.

    Best,
    Da
    On Tue, Jan 4, 2011 at 9:58 AM, Da Zheng wrote:

    The most important reason for me to use direct I/O is that the Atom
    processor is too weak. If I wrote a simple program to write data to the
    disk, CPU is almost 100% but the disk hasn't reached its maximal bandwidth.
    When I write data to SSD, the difference is even larger. Even if the program
    has saturated the two cores of the CPU, it cannot even get to the half of
    the maximal bandwidth of SSD.

    I don't know how much benefit direct I/O can bring to the normal processor
    such as Xeon, but I have a feeling I have to use direct I/O in order to have
    good performance on Atom processors.

    Best,
    Da
  • Christopher Smith at Jan 5, 2011 at 5:45 am

    On Tue, Jan 4, 2011 at 9:11 PM, Da Zheng wrote:
    On 1/4/11 5:17 PM, Christopher Smith wrote:
    If you use direct I/O to reduce CPU time, that means you are saving CPU via
    DMA. If you are using Java's heap though, you can kiss that goodbye.
    The buffer for direct I/O cannot be allocated from Java's heap anyway, I
    don't
    understand what you mean?

    The DMA buffer cannot be on Java's heap, but in the typical use case (say
    Hadoop), it would certainly have to get copied either in to our out from
    Java's heap, and that's going to get the CPU involved whether you like it
    nor not. If you stay entirely off the Java heap, you really don't get to use
    much of Java's object model or capabilities, so you have to wonder why use
    Java in the first place.

    That said, I'm surprised that the Atom can't keep up with magnetic disk
    unless you have a striped array. 100MB/s shouldn't be too taxing. Is it
    possible you're doing something wrong or your CPU is otherwise occupied?
    Yes, my C program can reach 100MB/s or even 110MB/s when writing data to
    the
    disk sequentially, but with direct I/O enabled, the maximal throughput is
    about
    140MB/s. But the biggest difference is CPU usage.
    Without direct I/O, operating system uses a lot of CPU time (the data below
    is
    got with top, and this is a dual-core processor with hyperthread enabled).
    Cpu(s): 3.4%us, 32.8%sy, 0.0%ni, 50.0%id, 12.1%wa, 0.0%hi, 1.6%si,
    0.0%st
    But with direct I/O, the system time can be as little as 3%.
    I'm surprised that system time is really that high. We did Atom experiments
    where it wasn't even close to that. Are you using a memory mapped file? If
    not are you buffering your writes? Is there perhaps
    something dysfunctional about the drive controller/driver you are using?

    --
    Chris
  • Da Zheng at Jan 5, 2011 at 3:08 pm

    On 1/5/11 12:44 AM, Christopher Smith wrote:
    On Tue, Jan 4, 2011 at 9:11 PM, Da Zheng wrote:
    On 1/4/11 5:17 PM, Christopher Smith wrote:
    If you use direct I/O to reduce CPU time, that means you are saving CPU via
    DMA. If you are using Java's heap though, you can kiss that goodbye.
    The buffer for direct I/O cannot be allocated from Java's heap anyway, I
    don't
    understand what you mean?

    The DMA buffer cannot be on Java's heap, but in the typical use case (say
    Hadoop), it would certainly have to get copied either in to our out from
    Java's heap, and that's going to get the CPU involved whether you like it
    nor not. If you stay entirely off the Java heap, you really don't get to use
    much of Java's object model or capabilities, so you have to wonder why use
    Java in the first place.
    true. I wrote the code with JNI, and found it's still very close to its best
    performance when doing one or even two memory copy.
    That said, I'm surprised that the Atom can't keep up with magnetic disk
    unless you have a striped array. 100MB/s shouldn't be too taxing. Is it
    possible you're doing something wrong or your CPU is otherwise occupied?
    Yes, my C program can reach 100MB/s or even 110MB/s when writing data to
    the
    disk sequentially, but with direct I/O enabled, the maximal throughput is
    about
    140MB/s. But the biggest difference is CPU usage.
    Without direct I/O, operating system uses a lot of CPU time (the data below
    is
    got with top, and this is a dual-core processor with hyperthread enabled).
    Cpu(s): 3.4%us, 32.8%sy, 0.0%ni, 50.0%id, 12.1%wa, 0.0%hi, 1.6%si,
    0.0%st
    But with direct I/O, the system time can be as little as 3%.
    I'm surprised that system time is really that high. We did Atom experiments
    where it wasn't even close to that. Are you using a memory mapped file? If
    No, I don't. just simply write a large chunk of data to the memory and the code
    is attached below. Right now the buffer size is 1MB, I think it's big enough to
    get the best performance.
    not are you buffering your writes? Is there perhaps
    something dysfunctional about the drive controller/driver you are using?
    I'm not sure. It's also odd to me, but I thought it's what I can get from a Atom
    processor. I guess I need to do some profiling.
    Also, which Atom processors did you use? do you have hyperthread enabled?

    Best,
    Da

    int main (int argc, char *argv[])
    {
    char *out_file;
    int outfd;
    ssize_t size;
    time_t start_time2;
    long size1 = 0;

    out_file = argv[1];

    outfd = open (out_file, O_CREAT | O_WRONLY, S_IWUSR | S_IRUSR);
    if (outfd < 0) {
    perror ("open");
    return -1;
    }

    buf = mmap (0, bufsize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,
    -1, 0);

    start_time2 = start_time = time (NULL);
    signal(SIGINT , sighandler);
    int offset = 0;

    while (1) {
    fill_data ((int *) buf, bufsize);
    size = write (outfd, buf, bufsize);
    if (size < 0) {
    perror ("fwrite");
    return 1;
    }
    offset += size;
    tot_size += size;
    size1 += size;
    // if (posix_fadvise (outfd, 0, offset, POSIX_FADV_NOREUSE) < 0)
    // perror ("posix_fadvise");

    time_t end_time = time (NULL);
    if (end_time - start_time2 > 5) {
    printf ("current rate: %ld\n",
    (long) (size1 / (end_time - start_time2)));
    size1 = 0;
    start_time2 = end_time;
    }
    }
    }
  • Da Zheng at Jan 6, 2011 at 8:12 pm

    On 01/05/2011 12:44 AM, Christopher Smith wrote:
    Yes, my C program can reach 100MB/s or even 110MB/s when writing data to
    the
    disk sequentially, but with direct I/O enabled, the maximal throughput is
    about
    140MB/s. But the biggest difference is CPU usage.
    Without direct I/O, operating system uses a lot of CPU time (the data below
    is
    got with top, and this is a dual-core processor with hyperthread enabled).
    Cpu(s): 3.4%us, 32.8%sy, 0.0%ni, 50.0%id, 12.1%wa, 0.0%hi, 1.6%si,
    0.0%st
    But with direct I/O, the system time can be as little as 3%.
    I'm surprised that system time is really that high. We did Atom experiments
    where it wasn't even close to that. Are you using a memory mapped file? If
    not are you buffering your writes? Is there perhaps
    something dysfunctional about the drive controller/driver you are using?
    Which Atom processor did you use? Could you tell me how you did your
    experiment?

    Best,
    Da
  • Segel, Mike at Jan 5, 2011 at 2:50 pm
    You are mixing a few things up.

    You're testing your I/O using C.
    What do you see if you try testing your direct I/O from Java?
    I'm guessing that you'll keep your i/o piece in place and wrap it within some JNI code and then re-write the test in Java?

    Also are you testing large streams or random i/o blocks? (Hopefully both)

    I think that when you test out the system, you'll find that you won't see much, if any performance improvement.



    -----Original Message-----
    From: Da Zheng
    Sent: Tuesday, January 04, 2011 11:11 PM
    To: common-dev@hadoop.apache.org
    Subject: Re: Hadoop use direct I/O in Linux?
    On 1/4/11 5:17 PM, Christopher Smith wrote:
    If you use direct I/O to reduce CPU time, that means you are saving CPU via
    DMA. If you are using Java's heap though, you can kiss that goodbye.
    The buffer for direct I/O cannot be allocated from Java's heap anyway, I don't
    understand what you mean?
    That said, I'm surprised that the Atom can't keep up with magnetic disk
    unless you have a striped array. 100MB/s shouldn't be too taxing. Is it
    possible you're doing something wrong or your CPU is otherwise occupied?
    Yes, my C program can reach 100MB/s or even 110MB/s when writing data to the
    disk sequentially, but with direct I/O enabled, the maximal throughput is about
    140MB/s. But the biggest difference is CPU usage.
    Without direct I/O, operating system uses a lot of CPU time (the data below is
    got with top, and this is a dual-core processor with hyperthread enabled).
    Cpu(s): 3.4%us, 32.8%sy, 0.0%ni, 50.0%id, 12.1%wa, 0.0%hi, 1.6%si, 0.0%st
    But with direct I/O, the system time can be as little as 3%.

    Best,
    Da
    On Tue, Jan 4, 2011 at 9:58 AM, Da Zheng wrote:

    The most important reason for me to use direct I/O is that the Atom
    processor is too weak. If I wrote a simple program to write data to the
    disk, CPU is almost 100% but the disk hasn't reached its maximal bandwidth.
    When I write data to SSD, the difference is even larger. Even if the program
    has saturated the two cores of the CPU, it cannot even get to the half of
    the maximal bandwidth of SSD.

    I don't know how much benefit direct I/O can bring to the normal processor
    such as Xeon, but I have a feeling I have to use direct I/O in order to have
    good performance on Atom processors.

    Best,
    Da


    The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
  • Da Zheng at Jan 5, 2011 at 3:19 pm

    On 1/5/11 9:50 AM, Segel, Mike wrote:
    You are mixing a few things up.

    You're testing your I/O using C.
    What do you see if you try testing your direct I/O from Java?
    I'm guessing that you'll keep your i/o piece in place and wrap it within some JNI code and then re-write the test in Java?
    I tested both.
    Also are you testing large streams or random i/o blocks? (Hopefully both)
    I only test large streams. For mapreduce, the only random i/o access is in
    between mapping and reducing, right? where the output from mappers is sorted,
    spilled to the disk and then merge sort. Then reducers need another merge sort
    after they pull data. All these operations are not completely random. Maybe
    there is some random access for metadata, but it should be small.
    I think that when you test out the system, you'll find that you won't see much, if any performance improvement.



    -----Original Message-----
    From: Da Zheng
    Sent: Tuesday, January 04, 2011 11:11 PM
    To: common-dev@hadoop.apache.org
    Subject: Re: Hadoop use direct I/O in Linux?
    On 1/4/11 5:17 PM, Christopher Smith wrote:
    If you use direct I/O to reduce CPU time, that means you are saving CPU via
    DMA. If you are using Java's heap though, you can kiss that goodbye.
    The buffer for direct I/O cannot be allocated from Java's heap anyway, I don't
    understand what you mean?
    That said, I'm surprised that the Atom can't keep up with magnetic disk
    unless you have a striped array. 100MB/s shouldn't be too taxing. Is it
    possible you're doing something wrong or your CPU is otherwise occupied?
    Yes, my C program can reach 100MB/s or even 110MB/s when writing data to the
    disk sequentially, but with direct I/O enabled, the maximal throughput is about
    140MB/s. But the biggest difference is CPU usage.
    Without direct I/O, operating system uses a lot of CPU time (the data below is
    got with top, and this is a dual-core processor with hyperthread enabled).
    Cpu(s): 3.4%us, 32.8%sy, 0.0%ni, 50.0%id, 12.1%wa, 0.0%hi, 1.6%si, 0.0%st
    But with direct I/O, the system time can be as little as 3%.

    Best,
    Da
    On Tue, Jan 4, 2011 at 9:58 AM, Da Zheng wrote:

    The most important reason for me to use direct I/O is that the Atom
    processor is too weak. If I wrote a simple program to write data to the
    disk, CPU is almost 100% but the disk hasn't reached its maximal bandwidth.
    When I write data to SSD, the difference is even larger. Even if the program
    has saturated the two cores of the CPU, it cannot even get to the half of
    the maximal bandwidth of SSD.

    I don't know how much benefit direct I/O can bring to the normal processor
    such as Xeon, but I have a feeling I have to use direct I/O in order to have
    good performance on Atom processors.

    Best,
    Da


    The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
  • Jay Booth at Jan 5, 2011 at 3:02 pm

    On Tue, Jan 4, 2011 at 12:58 PM, Da Zheng wrote:

    The most important reason for me to use direct I/O is that the Atom
    processor is too weak. If I wrote a simple program to write data to the
    disk, CPU is almost 100% but the disk hasn't reached its maximal bandwidth.
    When I write data to SSD, the difference is even larger. Even if the program
    has saturated the two cores of the CPU, it cannot even get to the half of
    the maximal bandwidth of SSD.
    The issue here is most likely checksumming. Hadoop computes a CRC32 for
    every 512 bytes it reads or writes. On most processors, the CPU can easily
    keep up and still saturate the pipe, but your atom is probably behind.
    There's a config somewhere to disable checksums, I'd suggest trying that.
    They're much more expensive CPU wise than simply byte funneling on the
    client side (and the wire protocol which involves interleaving checksum data
    with file data is the reason it would be really difficult to rewrite the
    client to use direct I/O).




    I don't know how much benefit direct I/O can bring to the normal processor
    such as Xeon, but I have a feeling I have to use direct I/O in order to have
    good performance on Atom processors.

    Best,
    Da

    On 01/04/2011 10:12 AM, Segel, Mike wrote:

    All,
    While this is an interesting topic for debate, I think it's a moot point.
    A lot of DBAs (Especially Informix DBAs) don't agree with Linus. (I'm
    referring to an earlier post in this thread that referenced a quote from
    Linus T.) Direct I/O is a good thing. But if Linus is removing it from
    Linux...

    But with respect to Hadoop... disk i/o shouldn't be a major topic. I mean
    if it were, then why isn't anyone pushing the use of SSDs? Or if they are
    too expensive for your budget, why not SAS drives that spin at 15K?
    Ok, those points are rhetorical. The simple solution is that if you're i/o
    bound, you add more nodes with more disk to further distribute the load,
    right?

    Also, I may be wrong, but do all OS(s) that one can run Hadoop, handle
    Direct I/O? And handle it in a common way? So won't you end up having
    machine/OS specific classes?

    IMHO there are other features that don't yet exist in Hadoop/HBase that
    will yield a better ROI.

    Ok, so I may be way off base, so I'll shut up now... ;-P

    -Mike



    -----Original Message-----
    From: Christopher Smith
    Sent: Tuesday, January 04, 2011 8:56 AM
    To: common-dev@hadoop.apache.org
    Subject: Re: Hadoop use direct I/O in Linux?

    On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelman<bbockelm@cse.unl.edu
    wrote:
    The I/O pattern isn't truly random. To convert from physicist terms to
    CS
    terms, the application is iterating through the rows of a column-oriented
    store, reading out somewhere between 1 and 10% of the columns. The twist
    is
    that the columns are compressed, meaning the size of a set of rows on
    disk
    is variable.

    We're getting pretty far off topic here, but this is an interesting
    problem.
    It *sounds* to me like a "compressed bitmap index" problem, possibly with
    bloom filters for joins (basically what HBase/Cassandra/Hypertable get in
    to, or in a less distributed case: MonetDB). Is that on the money?


    This prevents any sort of OS page cache stride detection from helping -
    the OS sees everything as random.

    It seems though like if you organized the data a certain way, the OS
    page
    cache could help.


    However, the application also has an index of where each row is
    located,
    meaning if it knows the active set of columns, it can predict the reads
    the
    client will perform and do a read-ahead.

    Yes, this is the kind of advantage where O_DIRECT might help, although
    I'd
    hope in this kind of circumstance the OS buffer cache would mostly give up
    anyway and just give as much of the available RAM as possible to the app.
    In
    that case memory mapped files with a thread doing a bit of read ahead
    would
    seem like not that much slower than using O_DIRECT.

    That said, I have to wonder how often this problem devolves in to a
    straight
    forward column scan. I mean, with a 1-10% hit rate, you need SSD seek
    times
    for it to make sense to seek to specific records vs. just scanning through
    the whole column, or to put it another way: "disk is the new tape". ;-)


    Some days, it does feel like "building a better TCP using UDP". However,
    we got a 3x performance improvement by building it (and multiplying by
    10-15k cores for just our LHC experiment, that's real money!), so it's a
    particular monstrosity we are stuck with.
    It sure sounds like a problem better suited to C++ than Java though. What
    benefits do you yield from doing all this with a JVM?
  • Milind Bhandarkar at Jan 5, 2011 at 10:03 pm
    I agree with Jay B. Checksumming is usually the culprit for high CPU on clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to generate 1 ext3 checksum block per DFS block will speedup read/write without any loss of reliability.

    - milind

    ---
    Milind Bhandarkar
    (mbhandarkar@linkedin.com)
    (650-776-3236)
  • Brian Bockelman at Jan 5, 2011 at 10:17 pm

    On Jan 5, 2011, at 4:03 PM, Milind Bhandarkar wrote:

    I agree with Jay B. Checksumming is usually the culprit for high CPU on clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to generate 1 ext3 checksum block per DFS block will speedup read/write without any loss of reliability.
    But (speaking to non-MapReduce users) make sure this doesn't adversely affect your usage patterns. If your checksum size is 64KB, then the minimum read size is 64KB. So, an extremely unlucky read of 2 bytes might cause 128KB+overhead to travel across the network.

    Know thine usage scenarios.

    Brian
  • Milind Bhandarkar at Jan 5, 2011 at 10:24 pm
    Know thine usage scenarios.

    Yup.

    - milind

    ---
    Milind Bhandarkar
    (mbhandarkar@linkedin.com)
    (650-776-3236)
  • Da Zheng at Jan 5, 2011 at 11:46 pm
    I'm not sure of that. I wrote a small checksum program for testing.
    After the size of a block gets to larger than 8192 bytes, I don't see
    much performance improvement. See the code below. I don't think 64MB can
    bring us any benefit.
    I did change io.bytes.per.checksum to 131072 in hadoop, and the program
    ran about 4 or 5 minutes faster (the total time for reducing is about 35
    minutes).

    import java.util.zip.CRC32;
    import java.util.zip.Checksum;


    public class Test1 {
    public static void main(String args[]) {
    Checksum sum = new CRC32();
    byte[] bs = new byte[512];
    final int tot_size = 64 * 1024 * 1024;
    long time = System.nanoTime();
    for (int k = 0; k < tot_size / bs.length; k++) {
    for (int i = 0; i < bs.length; i++)
    bs[i] = (byte) i;
    sum.update(bs, 0, bs.length);
    }
    System.out.println("takes " + (System.nanoTime() - time) / 1000
    / 1000);
    }
    }

    On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
    I agree with Jay B. Checksumming is usually the culprit for high CPU on clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to generate 1 ext3 checksum block per DFS block will speedup read/write without any loss of reliability.

    - milind

    ---
    Milind Bhandarkar
    (mbhandarkar@linkedin.com)
    (650-776-3236)




  • Milind Bhandarkar at Jan 6, 2011 at 12:44 am
    Have you tried with org.apache.hadoop.util.DataChecksum and org.apache.hadoop.util.PureJavaCrc32 ?

    - Milind
    On Jan 5, 2011, at 3:42 PM, Da Zheng wrote:

    I'm not sure of that. I wrote a small checksum program for testing. After the size of a block gets to larger than 8192 bytes, I don't see much performance improvement. See the code below. I don't think 64MB can bring us any benefit.
    I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran about 4 or 5 minutes faster (the total time for reducing is about 35 minutes).

    import java.util.zip.CRC32;
    import java.util.zip.Checksum;


    public class Test1 {
    public static void main(String args[]) {
    Checksum sum = new CRC32();
    byte[] bs = new byte[512];
    final int tot_size = 64 * 1024 * 1024;
    long time = System.nanoTime();
    for (int k = 0; k < tot_size / bs.length; k++) {
    for (int i = 0; i < bs.length; i++)
    bs[i] = (byte) i;
    sum.update(bs, 0, bs.length);
    }
    System.out.println("takes " + (System.nanoTime() - time) / 1000 / 1000);
    }
    }

    On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
    I agree with Jay B. Checksumming is usually the culprit for high CPU on clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to generate 1 ext3 checksum block per DFS block will speedup read/write without any loss of reliability.

    - milind

    ---
    Milind Bhandarkar
    (mbhandarkar@linkedin.com)
    (650-776-3236)




    ---
    Milind Bhandarkar
    (mbhandarkar@linkedin.com)
    (650-776-3236)
  • Da Zheng at Jan 6, 2011 at 2:03 am
    isn't DataChecksum just a wrapper of CRC32?
    I'm still using Hadoop 0.20.2. there is no PureJavaCrc32

    Da
    On 1/5/11 7:44 PM, Milind Bhandarkar wrote:
    Have you tried with org.apache.hadoop.util.DataChecksum and org.apache.hadoop.util.PureJavaCrc32 ?

    - Milind
    On Jan 5, 2011, at 3:42 PM, Da Zheng wrote:

    I'm not sure of that. I wrote a small checksum program for testing. After the size of a block gets to larger than 8192 bytes, I don't see much performance improvement. See the code below. I don't think 64MB can bring us any benefit.
    I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran about 4 or 5 minutes faster (the total time for reducing is about 35 minutes).

    import java.util.zip.CRC32;
    import java.util.zip.Checksum;


    public class Test1 {
    public static void main(String args[]) {
    Checksum sum = new CRC32();
    byte[] bs = new byte[512];
    final int tot_size = 64 * 1024 * 1024;
    long time = System.nanoTime();
    for (int k = 0; k < tot_size / bs.length; k++) {
    for (int i = 0; i < bs.length; i++)
    bs[i] = (byte) i;
    sum.update(bs, 0, bs.length);
    }
    System.out.println("takes " + (System.nanoTime() - time) / 1000 / 1000);
    }
    }

    On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
    I agree with Jay B. Checksumming is usually the culprit for high CPU on clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to generate 1 ext3 checksum block per DFS block will speedup read/write without any loss of reliability.

    - milind

    ---
    Milind Bhandarkar
    (mbhandarkar@linkedin.com)
    (650-776-3236)




    ---
    Milind Bhandarkar
    (mbhandarkar@linkedin.com)
    (650-776-3236)




  • Greg Roelofs at Jan 5, 2011 at 8:36 pm

    Da Zheng wrote:

    I already did "ant compile-c++-libhdfs -Dlibhdfs=1", but it seems nothing is
    compiled as it prints the following:
    check-c++-libhdfs:
    check-c++-makefile-libhdfs:
    create-c++-libhdfs-makefile:
    compile-c++-libhdfs:
    BUILD SUCCESSFUL
    Total time: 2 seconds
    You may need to add -Dcompile.native=true in there.

    Switching lists.

    Greg

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJan 3, '11 at 6:44p
activeJan 6, '11 at 8:12p
posts25
users8
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase