|| at Apr 8, 2011 at 6:51 pm
On Fri, Apr 8, 2011 at 1:59 PM, Edward Capriolo wrote:
Right. Most inodes are always cached when:
1) small disks
2) light load.
But that is not the case with hadoop.
Making the problem worse:
It seems like hadoop seems to issues 'du -sk' for all disks at the
same time. This pulverises cache.
All this to calculate a size that is typically within .01% of what a
df estimate would tell us.
Don't know your setup but i think this is manageble in the short-medium
term. Even with a 20TB node, you are likely looking at much less than a
million files depending on your configuration and usage. I would much rather
blow 500MB-1GB on keeping these entries in RAM vs the pagecache where most
it probably ends up hitting the disks anyway.
The one case where i think the du is needed is for when people haven't
dedicated the entire space on a drive to hadoop. Using df in this case
wouldn't accurately reflect usage.