FAQ
Hi all,

After reading the appenddesign3.pdf in HDFS-256,
and looking at the BlockReceiver.java code in 0.21.0,
I am confused by the following.

The document says that:
*For each packet, a DataNode in the pipeline has to do 3 things.
1. Stream data
a. Receive data from the upstream DataNode or the client
b. Push the data to the downstream DataNode if there is any
2. Write the data/crc to its block file/meta file.
3. Stream ack
a. Receive an ack from the downstream DataNode if there is any
b. Send an ack to the upstream DataNode or the client*

And *"...there is no guarantee on the order of (2) and (3)"*

In BlockReceiver.receivePacket(), after read the packet buffer,
datanode does:
1) put the packet seqno in the ack queue
2) write data and checksum to disk
3) flush data and checksum (to disk)

The thing that confusing me is that: the streaming of ack does not
necessary depends on whether data has been flush to disk or not.
Then, my question is:
Why do DataNode need to flush data and checksum
every time the DataNode receives a packet. This flush may be costly.
Why cant the DataNode just batch server write (after receiving
server packet) and flush all at once?
Is there any particular reason for doing so?

Can somebody clarify this for me?

Thanks so much.
Thanh

Search Discussions

  • Thanh Do at Nov 11, 2010 at 4:33 am
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do wrote:

    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


  • Todd Lipcon at Nov 11, 2010 at 5:11 am
    Nope, flush just flushes the java side buffer to the Linux buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk, but it
    has not yet been implemented.

    -Todd
    On Wednesday, November 10, 2010, Thanh Do wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do wrote:

    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Thanh Do at Nov 11, 2010 at 3:31 pm
    Thanks Todd,

    In HDFS-6313, i see three API (sync, hflush, hsync),
    And I assume hflush corresponds to :

    *"API2: flushes out to all replicas of the block.
    The data is in the buffers of the DNs but not on the DN's OS buffers.
    New readers will see the data after the call has returned.*"

    I am still confused that, once the client calls hflushes,
    the client gonna wait for all outstanding packet to be acked,
    before sending subsequent packet.
    But at DataNode, it is possible that the ack to client is sent
    before the write of data and checksum to replica. So if
    the DataNode crashes just after sending the ack and before
    writing to replica, will the semantics be violated here?

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon wrote:

    Nope, flush just flushes the java side buffer to the Linux buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk, but it
    has not yet been implemented.

    -Todd
    On Wednesday, November 10, 2010, Thanh Do wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do wrote:

    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Nov 11, 2010 at 7:21 pm

    On Thu, Nov 11, 2010 at 7:31 AM, Thanh Do wrote:

    Thanks Todd,

    In HDFS-6313, i see three API (sync, hflush, hsync),
    And I assume hflush corresponds to :

    *"API2: flushes out to all replicas of the block.
    The data is in the buffers of the DNs but not on the DN's OS buffers.
    New readers will see the data after the call has returned.*"
    I think the way it got implemented, hflush() actually does flush to OS
    buffers, since BlockReceiver calls flush() before it enqueues the sequence
    number in the responder pending ack queue in receivePacket().

    I am still confused that, once the client calls hflushes,
    the client gonna wait for all outstanding packet to be acked,
    before sending subsequent packet.
    Currently, yes. HDFS-895 which will hopefully be committed this week adds
    the ability to "pipeline" the packets - eg an hflush() only blocks the
    caller of hflush() until previously written data has been flushed, but
    doesn't stop other writers from appending more on top. Big speed improvement
    for HBase in particular with this.

    But at DataNode, it is possible that the ack to client is sent
    before the write of data and checksum to replica. So if
    the DataNode crashes just after sending the ack and before
    writing to replica, will the semantics be violated here?
    The DN will forward the packet to its downstream mirror in the pipeline, but
    doesn't actually enqueue the seqno on the pending ack queue until it has
    flushed to disk. So the different replicas may end up writing to disk in
    different orders, but the client won't get the ack until all have flushed.
    If any fails to flush, it will break the pipeline and initiate replica
    recovery -- but the client still has all of the unacked packets in its
    "ackQueue", so after recovery it simply flips those back onto "dataQueue"
    for the new pipeline.

    -Todd
    On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon wrote:

    Nope, flush just flushes the java side buffer to the Linux buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk, but it
    has not yet been implemented.

    -Todd
    On Wednesday, November 10, 2010, Thanh Do wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do wrote:

    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Hairong Kuang at Nov 11, 2010 at 7:55 pm
    A few clarification on API2 semantics.

    1. Ack gets sent back to the client before a packet gets written to local
    files.
    2. Data become visible to new readers on the condition that at least one
    DataNode does not have an error.
    3. The reason that flush is done after a write is more for the purpose of
    implementation simplification. Currently readers do not read from DataNode
    buffer. They only read from system buffer. A flush makes the data visible to
    readers sooner.

    Hairong
    On 11/11/10 7:31 AM, "Thanh Do" wrote:

    Thanks Todd,

    In HDFS-6313, i see three API (sync, hflush, hsync),
    And I assume hflush corresponds to :

    *"API2: flushes out to all replicas of the block.
    The data is in the buffers of the DNs but not on the DN's OS buffers.
    New readers will see the data after the call has returned.*"

    I am still confused that, once the client calls hflushes,
    the client gonna wait for all outstanding packet to be acked,
    before sending subsequent packet.
    But at DataNode, it is possible that the ack to client is sent
    before the write of data and checksum to replica. So if
    the DataNode crashes just after sending the ack and before
    writing to replica, will the semantics be violated here?

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon wrote:

    Nope, flush just flushes the java side buffer to the Linux buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk, but it
    has not yet been implemented.

    -Todd
    On Wednesday, November 10, 2010, Thanh Do wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do wrote:

    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Nov 11, 2010 at 8:27 pm

    On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang wrote:

    A few clarification on API2 semantics.

    1. Ack gets sent back to the client before a packet gets written to local
    files.
    Ah, I see in trunk this is the case. In 0.20-append, it's the other way
    around - we only enqueue after flush.

    2. Data become visible to new readers on the condition that at least one
    DataNode does not have an error.
    3. The reason that flush is done after a write is more for the purpose of
    implementation simplification. Currently readers do not read from DataNode
    buffer. They only read from system buffer. A flush makes the data visible
    to
    readers sooner.

    Hairong
    On 11/11/10 7:31 AM, "Thanh Do" wrote:

    Thanks Todd,

    In HDFS-6313, i see three API (sync, hflush, hsync),
    And I assume hflush corresponds to :

    *"API2: flushes out to all replicas of the block.
    The data is in the buffers of the DNs but not on the DN's OS buffers.
    New readers will see the data after the call has returned.*"

    I am still confused that, once the client calls hflushes,
    the client gonna wait for all outstanding packet to be acked,
    before sending subsequent packet.
    But at DataNode, it is possible that the ack to client is sent
    before the write of data and checksum to replica. So if
    the DataNode crashes just after sending the ack and before
    writing to replica, will the semantics be violated here?

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon wrote:

    Nope, flush just flushes the java side buffer to the Linux buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk, but it
    has not yet been implemented.

    -Todd
    On Wednesday, November 10, 2010, Thanh Do wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do wrote:

    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera

    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Thanh Do at Nov 11, 2010 at 8:43 pm
    Thank you all for clarification guys.
    I also looked at 0.20-append trunk and see that the order is totally
    different.

    One more thing, do you guys plan to implement hsync(), i.e API3
    in the near future. Are there any class of application that requires such
    strong guarantee?

    Thanh
    On Thu, Nov 11, 2010 at 2:27 PM, Todd Lipcon wrote:

    On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang <kuang.hairong@gmail.com
    wrote:
    A few clarification on API2 semantics.

    1. Ack gets sent back to the client before a packet gets written to local
    files.
    Ah, I see in trunk this is the case. In 0.20-append, it's the other way
    around - we only enqueue after flush.

    2. Data become visible to new readers on the condition that at least one
    DataNode does not have an error.
    3. The reason that flush is done after a write is more for the purpose of
    implementation simplification. Currently readers do not read from DataNode
    buffer. They only read from system buffer. A flush makes the data visible
    to
    readers sooner.

    Hairong
    On 11/11/10 7:31 AM, "Thanh Do" wrote:

    Thanks Todd,

    In HDFS-6313, i see three API (sync, hflush, hsync),
    And I assume hflush corresponds to :

    *"API2: flushes out to all replicas of the block.
    The data is in the buffers of the DNs but not on the DN's OS buffers.
    New readers will see the data after the call has returned.*"

    I am still confused that, once the client calls hflushes,
    the client gonna wait for all outstanding packet to be acked,
    before sending subsequent packet.
    But at DataNode, it is possible that the ack to client is sent
    before the write of data and checksum to replica. So if
    the DataNode crashes just after sending the ack and before
    writing to replica, will the semantics be violated here?

    Thanks
    Thanh
    On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon wrote:

    Nope, flush just flushes the java side buffer to the Linux buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk, but it
    has not yet been implemented.

    -Todd
    On Wednesday, November 10, 2010, Thanh Do wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh

    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do <thanhdo@cs.wisc.edu>
    wrote:
    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera

    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Nov 11, 2010 at 9:10 pm

    On Thu, Nov 11, 2010 at 12:43 PM, Thanh Do wrote:

    Thank you all for clarification guys.
    I also looked at 0.20-append trunk and see that the order is totally
    different.

    One more thing, do you guys plan to implement hsync(), i.e API3
    in the near future. Are there any class of application that requires such
    strong guarantee?
    I don't personally have any plans - everyone I've talked to who cares about
    data durability is OK with potential file truncation if power is lost across
    all DNs simultaneously.

    I'm sure there are some applications where this isn't acceptable, but people
    aren't using HBase for those applications yet :)

    -Todd
    On Thu, Nov 11, 2010 at 2:27 PM, Todd Lipcon wrote:

    On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang <kuang.hairong@gmail.com
    wrote:
    A few clarification on API2 semantics.

    1. Ack gets sent back to the client before a packet gets written to
    local
    files.
    Ah, I see in trunk this is the case. In 0.20-append, it's the other way
    around - we only enqueue after flush.

    2. Data become visible to new readers on the condition that at least
    one
    DataNode does not have an error.
    3. The reason that flush is done after a write is more for the purpose
    of
    implementation simplification. Currently readers do not read from DataNode
    buffer. They only read from system buffer. A flush makes the data
    visible
    to
    readers sooner.

    Hairong
    On 11/11/10 7:31 AM, "Thanh Do" wrote:

    Thanks Todd,

    In HDFS-6313, i see three API (sync, hflush, hsync),
    And I assume hflush corresponds to :

    *"API2: flushes out to all replicas of the block.
    The data is in the buffers of the DNs but not on the DN's OS buffers.
    New readers will see the data after the call has returned.*"

    I am still confused that, once the client calls hflushes,
    the client gonna wait for all outstanding packet to be acked,
    before sending subsequent packet.
    But at DataNode, it is possible that the ack to client is sent
    before the write of data and checksum to replica. So if
    the DataNode crashes just after sending the ack and before
    writing to replica, will the semantics be violated here?

    Thanks
    Thanh

    On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Nope, flush just flushes the java side buffer to the Linux buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk, but it
    has not yet been implemented.

    -Todd

    On Wednesday, November 10, 2010, Thanh Do <thanhdo@cs.wisc.edu>
    wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh

    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do <thanhdo@cs.wisc.edu>
    wrote:
    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is
    any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be
    costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera

    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Thanh Do at Nov 11, 2010 at 9:26 pm
    Got it!

    Currently, the model is single writer/multiple reader.
    In the GFS paper, i see they have *record append*
    semantics, that is allow multiple clients writing to the
    same file. Do you guys have any plan to implement
    this...

    Thanh
    On Thu, Nov 11, 2010 at 3:10 PM, Todd Lipcon wrote:
    On Thu, Nov 11, 2010 at 12:43 PM, Thanh Do wrote:

    Thank you all for clarification guys.
    I also looked at 0.20-append trunk and see that the order is totally
    different.

    One more thing, do you guys plan to implement hsync(), i.e API3
    in the near future. Are there any class of application that requires such
    strong guarantee?
    I don't personally have any plans - everyone I've talked to who cares about
    data durability is OK with potential file truncation if power is lost
    across
    all DNs simultaneously.

    I'm sure there are some applications where this isn't acceptable, but
    people
    aren't using HBase for those applications yet :)

    -Todd
    On Thu, Nov 11, 2010 at 2:27 PM, Todd Lipcon wrote:

    On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang <
    kuang.hairong@gmail.com
    wrote:
    A few clarification on API2 semantics.

    1. Ack gets sent back to the client before a packet gets written to
    local
    files.
    Ah, I see in trunk this is the case. In 0.20-append, it's the other way
    around - we only enqueue after flush.

    2. Data become visible to new readers on the condition that at least
    one
    DataNode does not have an error.
    3. The reason that flush is done after a write is more for the
    purpose
    of
    implementation simplification. Currently readers do not read from DataNode
    buffer. They only read from system buffer. A flush makes the data
    visible
    to
    readers sooner.

    Hairong
    On 11/11/10 7:31 AM, "Thanh Do" wrote:

    Thanks Todd,

    In HDFS-6313, i see three API (sync, hflush, hsync),
    And I assume hflush corresponds to :

    *"API2: flushes out to all replicas of the block.
    The data is in the buffers of the DNs but not on the DN's OS
    buffers.
    New readers will see the data after the call has returned.*"

    I am still confused that, once the client calls hflushes,
    the client gonna wait for all outstanding packet to be acked,
    before sending subsequent packet.
    But at DataNode, it is possible that the ack to client is sent
    before the write of data and checksum to replica. So if
    the DataNode crashes just after sending the ack and before
    writing to replica, will the semantics be violated here?

    Thanks
    Thanh

    On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Nope, flush just flushes the java side buffer to the Linux buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk, but
    it
    has not yet been implemented.

    -Todd

    On Wednesday, November 10, 2010, Thanh Do <thanhdo@cs.wisc.edu>
    wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh

    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do <thanhdo@cs.wisc.edu>
    wrote:
    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3 things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is
    any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there is
    any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does
    not
    necessary depends on whether data has been flush to disk or not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be
    costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera

    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Nov 11, 2010 at 9:37 pm

    On Thu, Nov 11, 2010 at 1:26 PM, Thanh Do wrote:

    Got it!

    Currently, the model is single writer/multiple reader.
    In the GFS paper, i see they have *record append*
    semantics, that is allow multiple clients writing to the
    same file. Do you guys have any plan to implement
    this...
    Not that I'm aware of - as a community project I can't speak for everyone
    else, though :)

    It's interesting to note that the GFS designers are on record in an ACM
    Queue interview[1] saying that this feature was a mistake. It was too hard
    to implement correctly and it has some really strange semantics that users
    found difficult to understand (eg different replicas of a block could
    contain records in different orders!)

    [1] http://queue.acm.org/detail.cfm?id=1594206

    Todd

    On Thu, Nov 11, 2010 at 3:10 PM, Todd Lipcon wrote:
    On Thu, Nov 11, 2010 at 12:43 PM, Thanh Do wrote:

    Thank you all for clarification guys.
    I also looked at 0.20-append trunk and see that the order is totally
    different.

    One more thing, do you guys plan to implement hsync(), i.e API3
    in the near future. Are there any class of application that requires
    such
    strong guarantee?
    I don't personally have any plans - everyone I've talked to who cares about
    data durability is OK with potential file truncation if power is lost
    across
    all DNs simultaneously.

    I'm sure there are some applications where this isn't acceptable, but
    people
    aren't using HBase for those applications yet :)

    -Todd
    On Thu, Nov 11, 2010 at 2:27 PM, Todd Lipcon wrote:

    On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang <
    kuang.hairong@gmail.com
    wrote:
    A few clarification on API2 semantics.

    1. Ack gets sent back to the client before a packet gets written to
    local
    files.
    Ah, I see in trunk this is the case. In 0.20-append, it's the other
    way
    around - we only enqueue after flush.

    2. Data become visible to new readers on the condition that at
    least
    one
    DataNode does not have an error.
    3. The reason that flush is done after a write is more for the
    purpose
    of
    implementation simplification. Currently readers do not read from DataNode
    buffer. They only read from system buffer. A flush makes the data
    visible
    to
    readers sooner.

    Hairong
    On 11/11/10 7:31 AM, "Thanh Do" wrote:

    Thanks Todd,

    In HDFS-6313, i see three API (sync, hflush, hsync),
    And I assume hflush corresponds to :

    *"API2: flushes out to all replicas of the block.
    The data is in the buffers of the DNs but not on the DN's OS
    buffers.
    New readers will see the data after the call has returned.*"

    I am still confused that, once the client calls hflushes,
    the client gonna wait for all outstanding packet to be acked,
    before sending subsequent packet.
    But at DataNode, it is possible that the ack to client is sent
    before the write of data and checksum to replica. So if
    the DataNode crashes just after sending the ack and before
    writing to replica, will the semantics be violated here?

    Thanks
    Thanh

    On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon <todd@cloudera.com
    wrote:
    Nope, flush just flushes the java side buffer to the Linux
    buffer
    cache -- not all the way to the media.

    Hsync is the API that will eventually go all the way to disk,
    but
    it
    has not yet been implemented.

    -Todd

    On Wednesday, November 10, 2010, Thanh Do <thanhdo@cs.wisc.edu>
    wrote:
    Or another way to rephase my question:
    does data.flush and checksumOut.flush guarantee
    data be synchronized with underlying disk,
    just like fsync().

    Thanks
    Thanh

    On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do <
    thanhdo@cs.wisc.edu>
    wrote:
    Hi all,

    After reading the appenddesign3.pdf in HDFS-256,
    and looking at the BlockReceiver.java code in 0.21.0,
    I am confused by the following.

    The document says that:
    *For each packet, a DataNode in the pipeline has to do 3
    things.
    1. Stream data
    a. Receive data from the upstream DataNode or the client
    b. Push the data to the downstream DataNode if there is
    any
    2. Write the data/crc to its block file/meta file.
    3. Stream ack
    a. Receive an ack from the downstream DataNode if there
    is
    any
    b. Send an ack to the upstream DataNode or the client*

    And *"...there is no guarantee on the order of (2) and (3)"*

    In BlockReceiver.receivePacket(), after read the packet
    buffer,
    datanode does:
    1) put the packet seqno in the ack queue
    2) write data and checksum to disk
    3) flush data and checksum (to disk)

    The thing that confusing me is that: the streaming of ack does
    not
    necessary depends on whether data has been flush to disk or
    not.
    Then, my question is:
    Why do DataNode need to flush data and checksum
    every time the DataNode receives a packet. This flush may be
    costly.
    Why cant the DataNode just batch server write (after receiving
    server packet) and flush all at once?
    Is there any particular reason for doing so?

    Can somebody clarify this for me?

    Thanks so much.
    Thanh


    --
    Todd Lipcon
    Software Engineer, Cloudera

    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-dev @
categorieshadoop
postedNov 11, '10 at 4:26a
activeNov 11, '10 at 9:37p
posts11
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase