FAQ
Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode,
you really cannot get a more authoritative number elsewhere :) I would do
the back-of-envelope with ~160 bytes/file and ~150 bytes/block.
On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith wrote:


This is the best coverage I've seen from a source that would know:


http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/

One relevant quote:

To store 100 million files (referencing 200 million blocks), a name-node
should have at least 60 GB of RAM.

But, honestly, if you're just building out your cluster, you'll probably
run into a lot of other limits first: hard drive space, regionserver memory,
the infamous ulimit/xciever :), etc...the

Take care,
-stu

--- On *Wed, 2/2/11, Dhruba Borthakur wrote:


From: Dhruba Borthakur <dhruba@gmail.com>

Subject: Re: HDFS without Hadoop: Why?
To: hdfs-user@hadoop.apache.org
Date: Wednesday, February 2, 2011, 9:00 PM


The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This
is a very rough calculation.

dhruba

On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <chinmayd@qualcomm.com<http://mc/compose?to=chinmayd@qualcomm.com>
wrote:
What you describe is pretty much my use case as well. Since I don’t know
how big the number of files could get , I am trying to figure out if there
is a theoretical design limitation in hdfs…..



From what I have read, the name node will store all metadata of all files
in the RAM. Assuming (in my case), that a file is less than the configured
block size….there should be a very rough formula that can be used to
calculate the max number of files that hdfs can serve based on the
configured RAM on the name node?



Can any of the implementers comment on this? Am I even thinking on the
right track…?



Thanks Ian for the haystack link…very informative indeed.



-Chinmay







*From:* Stuart Smith [mailto:stu24mail@yahoo.com<http://mc/compose?to=stu24mail@yahoo.com>]

*Sent:* Wednesday, February 02, 2011 4:41 PM

*To:* hdfs-user@hadoop.apache.org<http://mc/compose?to=hdfs-user@hadoop.apache.org>
*Subject:* RE: HDFS without Hadoop: Why?



Hello,
I'm actually using hbase/hadoop/hdfs for lots of small files (with a
long tail of larger files). Well, millions of small files - I don't know
what you mean by lots :)

Facebook probably knows better, But what I do is:

- store metadata in hbase
- files smaller than 10 MB or so in hbase
-larger files in a hdfs directory tree.

I started storing 64 MB files and smaller in hbase (chunk size), but that
causes issues with regionservers when running M/R jobs. This is related to
the fact that I'm running a cobbled together cluster & my region servers
don't have that much memory. I would play the size to see what works for
you..

Take care,
-stu

--- On *Wed, 2/2/11, Dhodapkar, Chinmay <chinmayd@qualcomm.com<http://mc/compose?to=chinmayd@qualcomm.com>
* wrote:

From: Dhodapkar, Chinmay <chinmayd@qualcomm.com<http://mc/compose?to=chinmayd@qualcomm.com>
Subject: RE: HDFS without Hadoop: Why?
To: "hdfs-user@hadoop.apache.org<http://mc/compose?to=hdfs-user@hadoop.apache.org>"
<hdfs-user@hadoop.apache.org<http://mc/compose?to=hdfs-user@hadoop.apache.org>
Date: Wednesday, February 2, 2011, 7:28 PM

Hello,



I have been following this thread for some time now. I am very comfortable
with the advantages of hdfs, but still have lingering questions about the
usage of hdfs for general purpose storage (no mapreduce/hbase etc).



Can somebody shed light on what the limitations are on the number of files
that can be stored. Is it limited in anyway by the namenode? The use case I
am interested in is to store a very large number of relatively small files
(1MB to 25MB).



Interestingly, I saw a facebook presentation on how they use hbase/hdfs
internally. Them seem to store all metadata in hbase and the actual
images/files/etc in something called “haystack” (why not use hdfs since they
already have it?). Anybody know what “haystack” is?



Thanks!

Chinmay







*From:* Jeff Hammerbacher [mailto:hammer@cloudera.com<http://mc/compose?to=hammer@cloudera.com>]

*Sent:* Wednesday, February 02, 2011 3:31 PM
*To:* hdfs-user@hadoop.apache.org<http://mc/compose?to=hdfs-user@hadoop.apache.org>
*Subject:* Re: HDFS without Hadoop: Why?




- Large block size wastes space for small file. The minimum file size
is 1 block.

That's incorrect. If a file is smaller than the block size, it will only
consume as much space as there is data in the file.


- There are no hardlinks, softlinks, or quotas.

That's incorrect; there are quotas and softlinks.






--
Connect to me at http://www.facebook.com/dhruba

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

People

Translate

site design / logo © 2021 Grokbase