FAQ
Hi folks,

I have a question regarding HDFS' client side buffering.
From the documents
http://hadoop.apache.org/core/docs/r0.19.0/hdfs_design.html#Staging

It states that a HDFS client caches one blocks size before it contacts a
namenode for a new block.
Is this true?
I can't find a part of source code for this operation.

Can anyone shed some light on this for me?

I appreciate your help.

-sangmin

Search Discussions

  • Ajit Ratnaparkhi at Feb 23, 2009 at 2:55 pm
    Hi,

    I have same doubt.
    From the code scan it looks like whenever client writes data, one packet is
    buffered (of size 64 KB) and this packet is directly sent to the
    corresponding datanodes. Whenever a block end is found and new packet of new
    block is ready, namenode is contacted to create new block entry and to
    assign datanodes to it, then the new packets are sent to one of these newly
    allocated datanodes.

    So it seems that it does not cache entire block locally before contacting
    namenode, as stated in design doc.

    can somebody please clarify on this.



    On Mon, Feb 23, 2009 at 11:05 AM, Sangmin Lee wrote:

    Hi folks,

    I have a question regarding HDFS' client side buffering.
    From the documents

    http://hadoop.apache.org/core/docs/r0.19.0/hdfs_design.html#Staging

    It states that a HDFS client caches one blocks size before it contacts a
    namenode for a new block.
    Is this true?
    I can't find a part of source code for this operation.

    The source code of above mentioned description can be found in,
    DFSClient.DFSOutputStream
    Short explaination is given below,

    1. Whenever user writes data by calling FSDataOutputStream.write(...)
    internally DFSClient.DFSOutputStream.writeChunk(...) gets called which
    creates a 'Packet' in its buffer and enqueues it in 'DataQueue' maintained
    by object of DFSOutputStream. (Packet size is 64KB. )

    2. There is a continuously running thread 'DataStreamer'
    (DFSClient.DFSOutputStream.DataStreamer) which is started when
    DFSOutputStream object is created.

    3. This DataStreamer continuously looks at 'DataQueue', as soon as it finds
    a packet added to the queue, it dequeues that packet and sends it on stream
    connected to the datanode. If end of block is found, it contacts namenode
    (namenode.addblock) and gets new datanode address.


    Can anyone shed some light on this for me?

    I appreciate your help.

    -sangmin

    thanks,
    - ajit.
  • Dhruba Borthakur at Feb 23, 2009 at 6:35 pm
    Hi Ajit,

    Your asusmption is absolutely right. The hdfs client buffers small packets
    of data in memory and then sends them to the pipeline of datanode(s). Each
    memory buffer packet is typically 64K and the client uses a sliding window
    protocol with a max window size of 80 packets. The client does *not* cache
    the contents of the entire block in a local disk.

    I will update the design document... the description over there is stale.

    thanks,
    dhruba

    On Mon, Feb 23, 2009 at 6:55 AM, Ajit Ratnaparkhi wrote:

    Hi,

    I have same doubt.

    From the code scan it looks like whenever client writes data, one packet is
    buffered (of size 64 KB) and this packet is directly sent to the
    corresponding datanodes. Whenever a block end is found and new packet of
    new
    block is ready, namenode is contacted to create new block entry and to
    assign datanodes to it, then the new packets are sent to one of these newly
    allocated datanodes.

    So it seems that it does not cache entire block locally before contacting
    namenode, as stated in design doc.

    can somebody please clarify on this.



    On Mon, Feb 23, 2009 at 11:05 AM, Sangmin Lee wrote:

    Hi folks,

    I have a question regarding HDFS' client side buffering.
    From the documents

    http://hadoop.apache.org/core/docs/r0.19.0/hdfs_design.html#Staging

    It states that a HDFS client caches one blocks size before it contacts a
    namenode for a new block.
    Is this true?
    I can't find a part of source code for this operation.

    The source code of above mentioned description can be found in,
    DFSClient.DFSOutputStream
    Short explaination is given below,

    1. Whenever user writes data by calling FSDataOutputStream.write(...)
    internally DFSClient.DFSOutputStream.writeChunk(...) gets called which
    creates a 'Packet' in its buffer and enqueues it in 'DataQueue' maintained
    by object of DFSOutputStream. (Packet size is 64KB. )

    2. There is a continuously running thread 'DataStreamer'
    (DFSClient.DFSOutputStream.DataStreamer) which is started when
    DFSOutputStream object is created.

    3. This DataStreamer continuously looks at 'DataQueue', as soon as it finds
    a packet added to the queue, it dequeues that packet and sends it on stream
    connected to the datanode. If end of block is found, it contacts namenode
    (namenode.addblock) and gets new datanode address.


    Can anyone shed some light on this for me?

    I appreciate your help.

    -sangmin

    thanks,
    - ajit.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedFeb 23, '09 at 11:05a
activeFeb 23, '09 at 6:35p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase