FAQ
Hi,

If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
file, this file will ocupy 64MB in the HDFS?

Thanks,

Search Discussions

  • Marcos Ortiz at Jun 10, 2011 at 3:30 pm

    On 06/10/2011 10:35 AM, Pedro Costa wrote:
    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,
    HDFS is not very efficient storing small files, because each file is
    stored in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186
  • Pedro Costa at Jun 10, 2011 at 3:43 pm
    But, how can I say that a 1KB file will only use 1KB of disc space, if
    a block is configured has 64MB? In my view, if a 1KB use a block of
    64MB, the file will occupy 64MB in the disc.

    How can you disassociate a 64MB data block from HDFS of a disk block?
    On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz wrote:
    On 06/10/2011 10:35 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,

    HDFS is not very efficient storing small files, because each file is stored
    in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186
  • Philip Zeyliger at Jun 10, 2011 at 4:01 pm

    On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa wrote:
    But, how can I say that a 1KB file will only use 1KB of disc space, if
    a block is configured has 64MB? In my view, if a 1KB use a block of
    64MB, the file will occupy 64MB in the disc.
    A block of HDFS is the unit of distribution and replication, not the
    unit of storage. HDFS uses the underlying file systems for physical
    storage.

    -- Philip
    How can you disassociate a  64MB data block from HDFS of a disk block?
    On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz wrote:
    On 06/10/2011 10:35 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,

    HDFS is not very efficient storing small files, because each file is stored
    in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186
  • Pedro Costa at Jun 10, 2011 at 4:09 pm
    This means that, when HDFS reads 1KB file from the disk, he will put
    the data in blocks of 64MB?
    On Fri, Jun 10, 2011 at 5:00 PM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa wrote:
    But, how can I say that a 1KB file will only use 1KB of disc space, if
    a block is configured has 64MB? In my view, if a 1KB use a block of
    64MB, the file will occupy 64MB in the disc.
    A block of HDFS is the unit of distribution and replication, not the
    unit of storage.  HDFS uses the underlying file systems for physical
    storage.

    -- Philip
    How can you disassociate a  64MB data block from HDFS of a disk block?
    On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz wrote:
    On 06/10/2011 10:35 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,

    HDFS is not very efficient storing small files, because each file is stored
    in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186


    --
    ---------------------------
    Pedro Sá da Costa

    @: pcosta@lasige.di.fc.ul.pt
    @: psdc1978@gmail.com
  • Philip Zeyliger at Jun 10, 2011 at 4:15 pm

    On Fri, Jun 10, 2011 at 9:08 AM, Pedro Costa wrote:
    This means that, when HDFS reads 1KB file from the disk, he will put
    the data in blocks of 64MB?
    No.
    On Fri, Jun 10, 2011 at 5:00 PM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa wrote:
    But, how can I say that a 1KB file will only use 1KB of disc space, if
    a block is configured has 64MB? In my view, if a 1KB use a block of
    64MB, the file will occupy 64MB in the disc.
    A block of HDFS is the unit of distribution and replication, not the
    unit of storage.  HDFS uses the underlying file systems for physical
    storage.

    -- Philip
    How can you disassociate a  64MB data block from HDFS of a disk block?
    On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz wrote:
    On 06/10/2011 10:35 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,

    HDFS is not very efficient storing small files, because each file is stored
    in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186


    --
    ---------------------------
    Pedro Sá da Costa

    @: pcosta@lasige.di.fc.ul.pt
    @: psdc1978@gmail.com
  • Pedro Costa at Jun 10, 2011 at 4:48 pm
    So, I'm not getting how a 1KB file can cost a block of 64MB. Can
    anyone explain me?
    On Fri, Jun 10, 2011 at 5:13 PM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 9:08 AM, Pedro Costa wrote:
    This means that, when HDFS reads 1KB file from the disk, he will put
    the data in blocks of 64MB?
    No.
    On Fri, Jun 10, 2011 at 5:00 PM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa wrote:
    But, how can I say that a 1KB file will only use 1KB of disc space, if
    a block is configured has 64MB? In my view, if a 1KB use a block of
    64MB, the file will occupy 64MB in the disc.
    A block of HDFS is the unit of distribution and replication, not the
    unit of storage.  HDFS uses the underlying file systems for physical
    storage.

    -- Philip
    How can you disassociate a  64MB data block from HDFS of a disk block?
    On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz wrote:
    On 06/10/2011 10:35 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,

    HDFS is not very efficient storing small files, because each file is stored
    in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186
  • John George at Jun 10, 2011 at 5:18 pm
    I am also relatively new to hadoop, so others may feel free to correct me if
    am wrong.

    NN keeps track of a file by "inode" and the blocks related to that inode. In
    your case, since your file size is smaller than the block size, NN will have
    only ONE block associated with this inode (assuming only one replication).
    To track the 1KB file, the approximate memory that it will cost NN is the
    memory to store the INODE-to-BLOCK relation.

    As far as the DN is concerned, all it knows is that there is data that
    resides on its disk with a specific name (that associates with the Block
    name that NN knows). It does not know (or care) about what the block size is
    or where else it is replicated etc! Hence for the DN it is just another file
    and it consumes how much ever space is required to store this file (1KB in
    your case).

    So, it does not cost either the NN or DN 64MB to store a 1KB file.

    Regards,
    John George
    On 6/10/11 11:47 AM, "Pedro Costa" wrote:

    So, I'm not getting how a 1KB file can cost a block of 64MB. Can
    anyone explain me?
    On Fri, Jun 10, 2011 at 5:13 PM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 9:08 AM, Pedro Costa wrote:
    This means that, when HDFS reads 1KB file from the disk, he will put
    the data in blocks of 64MB? No.
    On Fri, Jun 10, 2011 at 5:00 PM, Philip Zeyliger <philip@cloudera.com>
    wrote:
    On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa wrote:
    But, how can I say that a 1KB file will only use 1KB of disc space, if
    a block is configured has 64MB? In my view, if a 1KB use a block of
    64MB, the file will occupy 64MB in the disc.
    A block of HDFS is the unit of distribution and replication, not the
    unit of storage.  HDFS uses the underlying file systems for physical
    storage.

    -- Philip
    How can you disassociate a  64MB data block from HDFS of a disk block?
    On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz wrote:
    On 06/10/2011 10:35 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,

    HDFS is not very efficient storing small files, because each file is
    stored
    in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186
  • Josh Patterson at Jun 10, 2011 at 7:35 pm
    It will only take up ~1KB of local datanode disk space (+ metadata
    space such as the CRC32 of every 512 bytes, along with replication @
    1KB per replicated block, in this case 2KB) but the real cost is a
    block entry in the Namenode --- all block data at the namenode lives
    in memory, which is a much more scare resource for the cluster in a
    relative sense.
    On Fri, Jun 10, 2011 at 11:47 AM, Pedro Costa wrote:
    So, I'm not getting how a 1KB file can cost a block of 64MB. Can
    anyone explain me?
    On Fri, Jun 10, 2011 at 5:13 PM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 9:08 AM, Pedro Costa wrote:
    This means that, when HDFS reads 1KB file from the disk, he will put
    the data in blocks of 64MB?
    No.
    On Fri, Jun 10, 2011 at 5:00 PM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa wrote:
    But, how can I say that a 1KB file will only use 1KB of disc space, if
    a block is configured has 64MB? In my view, if a 1KB use a block of
    64MB, the file will occupy 64MB in the disc.
    A block of HDFS is the unit of distribution and replication, not the
    unit of storage.  HDFS uses the underlying file systems for physical
    storage.

    -- Philip
    How can you disassociate a  64MB data block from HDFS of a disk block?
    On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz wrote:
    On 06/10/2011 10:35 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,

    HDFS is not very efficient storing small files, because each file is stored
    in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186


    --
    Twitter: @jpatanooga
    Solution Architect @ Cloudera
    hadoop: http://www.cloudera.com
    blog: http://jpatterson.floe.tv
  • Allen Wittenauer at Jun 13, 2011 at 5:06 pm
    FYI, I've added this to the FAQ since it comes up every so often.
  • Jain, Prem at Jun 13, 2011 at 8:38 pm
    Where do I see the FAQ ?

    -----Original Message-----
    From: Allen Wittenauer
    Sent: Monday, June 13, 2011 10:06 AM
    To: hdfs-user@hadoop.apache.org
    Subject: Re: Block size in HDFS


    FYI, I've added this to the FAQ since it comes up every so often.
  • Allen Wittenauer at Jun 13, 2011 at 9:08 pm

    On Jun 13, 2011, at 1:37 PM, Jain, Prem wrote:

    Where do I see the FAQ ?

    http://wiki.apache.org/hadoop/FAQ
  • Matthew Foley at Jun 10, 2011 at 7:06 pm
    Pedro,
    You need to distinguish between "HDFS" files and blocks, and "Low-level Disk" files
    and blocks.

    Large HDFS files are broken into HDFS blocks and stored in multiple Datanodes.
    On the Datanodes, each HDFS block is stored as a Low-level Disk file.
    So if you have the block size set to 64MB, then a 70MB HDFS file would be split into
    a 64MB block and a 6MB final block. Each of those blocks will become a disk file
    on one or more datanodes (depending on replication settings), and will take up
    however much disk storage is needed for that disk file. Of course we could have
    just said "save a 64MB chunk for each block", but that would have been wasteful.
    Instead, it just uses as much disk as is actually needed for the amount of data in
    that block.

    Obviously, only the last block in an HDFS file can be smaller than the block size.

    It's worth mentioning that Low-level Disk "blocks" are different. Because of the way
    hard drive hardware works, disk blocks are fixed size, typically either 4KB or 8KB.
    It is impossible to allocate less than a full disk block of low-level disk storage.
    But this constraint does not apply to HDFS blocks, which are higher-level constructs.

    --Matt


    On Jun 10, 2011, at 9:13 AM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 9:08 AM, Pedro Costa wrote:
    This means that, when HDFS reads 1KB file from the disk, he will put
    the data in blocks of 64MB?
    No.
    On Fri, Jun 10, 2011 at 5:00 PM, Philip Zeyliger wrote:
    On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa wrote:
    But, how can I say that a 1KB file will only use 1KB of disc space, if
    a block is configured has 64MB? In my view, if a 1KB use a block of
    64MB, the file will occupy 64MB in the disc.
    A block of HDFS is the unit of distribution and replication, not the
    unit of storage. HDFS uses the underlying file systems for physical
    storage.

    -- Philip
    How can you disassociate a 64MB data block from HDFS of a disk block?
    On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz wrote:
    On 06/10/2011 10:35 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?

    Thanks,

    HDFS is not very efficient storing small files, because each file is stored
    in a block (of 64 MB in your case), and the block metadata
    is held in memory by the NN. But you should know that this 1KB file only
    will use 1KB of disc space.

    For small files, you can use Hadoop archives.
    Regards

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186


    --
    ---------------------------
    Pedro Sá da Costa

    @: pcosta@lasige.di.fc.ul.pt
    @: psdc1978@gmail.com
  • Sridhar basam at Jun 10, 2011 at 3:43 pm

    On Fri, Jun 10, 2011 at 11:05 AM, Pedro Costa wrote:

    Hi,

    If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
    file, this file will ocupy 64MB in the HDFS?
    No, it will occupy something much closer to 1KB than 64MB. There is some
    small overhead related to metadata about the block plus the space on the
    underlying filesystem (ext2, ext3, etc) required to store the 1KB file.
    You can just cd into the datanode data directories and look for the block
    in question.

    HDFS isn't really optimized for small files though.

    Sridhar

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedJun 10, '11 at 3:06p
activeJun 13, '11 at 9:08p
posts14
users9
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase