FAQ
Hello,
I'm looking at storing a large number of files under one directory.

I started to break the files into subdirectories out of habit (from working on ntfs/etc), but it occurred to me that maybe (from a performance perspective), it doesn't really matter on hdfs.

Does it? Is there some recommended limit on the number of files to store in one directory on hdfs? I'm thinking thousands to millions, so we're not talking about INT_MAX or anything, but a lot.

Or is it only limited by my sanity :) ?

I suppose it would come down to the data structure(s) used by the namenode when tracking file metadata. But I don't know what those are - I did skim the HDFS architecture document, but didn't see anything conclusive.

Take care,
-stu

Search Discussions

  • Allen Wittenauer at Aug 18, 2010 at 1:01 am

    On Aug 17, 2010, at 5:44 PM, Stuart Smith wrote:
    I started to break the files into subdirectories out of habit (from working on ntfs/etc), but it occurred to me that maybe (from a performance perspective), it doesn't really matter on hdfs.

    Does it? Is there some recommended limit on the number of files to store in one directory on hdfs? I'm thinking thousands to millions, so we're not talking about INT_MAX or anything, but a lot.

    Or is it only limited by my sanity :) ?
    We have a directory with several thousand files in it.

    It is always a pain when we hit it because the client heap size needs to be increased to do anything in it: directory listings, web uis, distcp, etc, etc, etc. Doing any sort of manipulation in that dir is also slower.

    My recommendation: don't do it. Directories, AFAIK, are relatively cheap resource wise vs. lots of files in one.

    [Hopefully these files are large. Otherwise they should be joined together... if not, you're going to take a performance hit processing them *and* storing them...]
  • Stu24mail at Aug 18, 2010 at 2:02 am
    Thanks!
    I'll go with keeping my sanity then.

    The files will all be >= 64MB

    Take care,
    -stu
    -----Original Message-----
    From: Allen Wittenauer <awittenauer@linkedin.com>
    Date: Wed, 18 Aug 2010 01:00:42
    To: <hdfs-user@hadoop.apache.org><hdfs-user@hadoop.apache.org>
    Reply-To: hdfs-user@hadoop.apache.org
    Subject: Re: Maximum number of files in directory? (in hdfs)

    On Aug 17, 2010, at 5:44 PM, Stuart Smith wrote:
    I started to break the files into subdirectories out of habit (from working on ntfs/etc), but it occurred to me that maybe (from a performance perspective), it doesn't really matter on hdfs.

    Does it? Is there some recommended limit on the number of files to store in one directory on hdfs? I'm thinking thousands to millions, so we're not talking about INT_MAX or anything, but a lot.

    Or is it only limited by my sanity :) ?
    We have a directory with several thousand files in it.

    It is always a pain when we hit it because the client heap size needs to be increased to do anything in it: directory listings, web uis, distcp, etc, etc, etc. Doing any sort of manipulation in that dir is also slower.

    My recommendation: don't do it. Directories, AFAIK, are relatively cheap resource wise vs. lots of files in one.

    [Hopefully these files are large. Otherwise they should be joined together... if not, you're going to take a performance hit processing them *and* storing them...]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedAug 18, '10 at 12:45a
activeAug 18, '10 at 2:02a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Stu24mail: 2 posts Allen Wittenauer: 1 post

People

Translate

site design / logo © 2022 Grokbase