[
https://issues.apache.org/jira/browse/HADOOP-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577719#action_12577719 ]
aw edited comment on HADOOP-2991 at 3/11/08 11:07 PM:
--------------------------------------------------------------------
Ahh file systems. Can't live with them, can't live with them.
First off: I'm not a big fan of percentages when dealing with file systems.
Back in the day, UFS would reserve 10% of the system for root's usage. So on a 10G disk, it would save 1G for itself. Not a big deal and when the file system had issues, that worked out well. 100G would turn into 10G. Ugh. Not cool. Go even bigger and the amounts get insane. So many implementations changed this to a sliding scale rather than a single percentage. Some food for thought.
Secondly, df. A great source of cross platform trouble... Let me throw out one of my favorite real world examples, this time from one of my home machines:
Filesystem size used avail capacity Mounted on
int 165G 28K 21G 1% /int
int/home 165G 68G 21G 77% /export/home
int/mii2u 165G 1014K 21G 1% /int/mii2u
int/squid-cache 5.0G 4.4G 591M 89% /int/squid-cache
int/local 165G 289M 21G 2% /usr/local
Stop. Go back and look carefully at those numbers.
In case you haven't guessed, this is a (partial) output of a df -h from a Solaris machine utilizing ZFS. It is pretty clear that with the exception of the file system using a hard quota (int/squid-cache), size != used+available. Instead, size=(all fs used)+available. Using "used" in any sort of capacity isn't going to tell you anything about how much space is actually available. This type of output is fairly common for any pool-based storage system.
Then there are file system quotas, which depending upon the OS, may or may not show up in df output. The same thing with the aforementioned percentages with reserved space.
Anyway, what does this all mean?
Well, in my mind, that all of the above suggestions in the JIRA just really don't work out well... and that's just on UNIX. Heck, even a heterogeneous UNIX environment makes me shudder. How does one work with pooled storage *and* traditional file systems if you want to have a single config?
Quite frankly, you can't. As much as I hate to say it, I suspect the answer (as unpopular as it might be) is probably to set a hard limit on how much space the HDFS will use rather than trying to second guess what the operating system is doing. Does this suck? Yes. Does this suck less than all of the gymnastics around trying to figure this out dynamically? I think so.
Let's face it, in order to make an app like Hadoop not eat more space than what you want vs. what is configured in the file system, you are essentially looking at partitioning it. At that point, you might as well just configure it in the app and be done with it. In the end, this basically means that HDFS needs to keep track of how much space it is using at all times and not go over that limit. This likely also means that it must implement high and low water marks such that if the low water mark is hit, writes to the filesystem get deferred/deprioritized and high water marks basically mean to start rebalancing the blocks or saying the file system is full.
Now, I know that it might be difficult to calculate what the max space should be. On reflection though, I'm not really sure that's true. If I know what size my slice is and I have an idea of how much of that I want to give to HDFS, then I can calculate that max value. If an admin gets in trouble with the space being allocated, the ability to lower the high and low water marks, which should trigger a rebalance, thus freeing space. This is essentially how apps like squid work. It works quite well. [Interestingly enough, the file system structure on disk is quite similar to how the data node stores its blocks.... Hmm... ]
One thing to point out with this solution: if the admin overcommits the space on the drive then, quite frankly, they hung themselves. They know how much space they gave HDFS. If they go over it, oh well. I'd much rather have MapRed blow up than HDFS blow up, since it is much easier to pick up the pieces of a broken job than it is of the file system, especially in the case where there are under-replicated blocks.
Again, I totally admit that this solution is likely to be unpopular. But I can't see a way out of this mess that works with the multiple types of storage systems in use.
P.S., while I'm here, let me throw my more of my own personal prejudices into this: putting something like hadoop in / or some other file system (but not necessarily device) that is used by the OS is just *begging* for trouble. That's just a bad practice for a real, production system. If someone does that, they rightly deserve any pain that it caused.
was (Author: aw):
https://issues.apache.org/jira/browse/HADOOP-2991Ahh file systems. Can't live with them, can't live with them.
First off: I'm not a big fan of percentages when dealing with file systems.
Back in the day, UFS would reserve 10% of the system for root's usage. So on a 10G disk, it would save 1G for itself. Not a big deal and when the file system had issues, that worked out well. 100G would turn into 10G. Ugh. Not cool. Go even bigger and the amounts get insane. So many implementations changed this to a sliding scale rather than a single percentage. Some food for thought.
Secondly, df. A great source of cross platform trouble... Let me throw out one of my favorite real world examples, this time from one of my home machines:
Filesystem size used avail capacity Mounted on
int 165G 28K 21G 1% /int
int/home 165G 68G 21G 77% /export/home
int/mii2u 165G 1014K 21G 1% /int/mii2u
int/squid-cache 5.0G 4.4G 591M 89% /int/squid-cache
int/local 165G 289M 21G 2% /usr/local
Stop. Go back and look carefully at those numbers.
In case you haven't guessed, this is a (partial) output of a df -h from a Solaris machine utilizing ZFS. It is pretty clear that with the exception of the file system using a hard quota (int/squid-cache), size != used+available. Instead, size=(all fs used)+available. Using "used" in any sort of capacity isn't going to tell you anything about how much space is actually available. This type of output is fairly common for any pool-based storage system.
Then there are file system quotas, which depending upon the OS, may or may not show up in df output. The same thing with the aforementioned percentages with reserved space.
Anyway, what does this all mean?
Well, in my mind, that all of the above suggestions in the JIRA just really don't work out well... and that's just on UNIX. Heck, even a heterogeneous UNIX environment makes me shudder. How does one work with pooled storage *and* traditional file systems if you want to have a single config?
Quite frankly, you can't. As much as I hate to say it, I suspect the answer (as unpopular as it might be) is probably to set a hard limit on how much space the HDFS will use rather than trying to second guess what the operating system is doing. Does this suck? Yes. Does this suck less than all of the gymnastics around trying to figure this out dynamically? I think so.
Let's face it, in order to make an app like Hadoop not eat more space than what you want vs. what is configured in the file system, you are essentially looking at partitioning it. At that point, you might as well just configure it in the app and be done with it. In the end, this basically means that HDFS needs to keep track of how much space it is using at all times and not go over that limit. This likely also means that it must implement high and low water marks such that if the low water mark is hit, writes to the filesystem get deferred/deprioritized and high water marks basically mean to start rebalancing the blocks or saying the file system is full.
Now, I know that it might be difficult to calculate what the max space should be. On reflection though, I'm not really sure that's true. If I know what size my slice is and I have an idea of how much of that I want to give to HDFS, then I can calculate that max value. If an admin gets in trouble with the space being allocated, the ability to lower the high and low water marks, which should trigger a rebalance, thus freeing space. This is essentially how apps like squid work. It works quite well. [Interestingly enough, the file system structure on disk is quite similar to how the data node stores its blocks.... Hmm... ]
One thing to point out with this solution: if the admin overcommits the space on the drive then, quite frankly, they hung themselves. They know how much space they gave HDFS. If they go over it, oh well. I'd much rather have MapRed blow up than HDFS blow up, since it is much easier to pick up the pieces of a broken job than it is of the file system, especially in the case where there are under-replicated blocks.
Again, I totally admit that this solution is likely to be unpopular. But I can't see a way out of this mess that works with the multiple types of storage systems in use.
P.S., while I'm here, let me throw my more of my own personal prejudices into this: putting something like hadoop in / or some other file system (but not necessarily device) that is used by the OS is just *begging* for trouble. That's just a bad practice for a real, production system. If someone does that, they rightly deserve any pain that it caused.
dfs.du.reserved not honored in 0.15/16 (regression from 0.14+patch for 2549)
----------------------------------------------------------------------------
Key: HADOOP-2991
URL:
https://issues.apache.org/jira/browse/HADOOP-2991Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.15.0, 0.15.1, 0.15.2, 0.15.3, 0.16.0
Reporter: Joydeep Sen Sarma
Priority: Critical
changes for
https://issues.apache.org/jira/browse/HADOOP-1463have caused a regression. earlier:
- we could set dfs.du.reserve to 1G and be *sure* that 1G would not be used.
now this is no longer true. I am quoting Pete Wyckoff's example:
<example>
Let's look at an example. 100 GB disk and /usr using 45 GB and dfs using 50 GBs now
Df -kh shows:
Capacity = 100 GB
Available = 1 GB (remember ~4 GB chopped out for metadata and stuff)
Used = 95 GBs
remaining = 100 GB - 50 GB - 1GB = 49 GB
Min(remaining, available) = 1 GB
98% of which is usable for DFS apparently -
So, we're at the limit, but are free to use 98% of the remaining 1GB.
</example>
this is broke. based on the discussion on 1463 - it seems like the notion of 'capacity' as being the first field of 'df' is problematic. For example - here's what our df output looks like:
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 130G 123G 49M 100% /
as u can see - 'Size' is a misnomer - that much space is not available. Rather the actual usable space is 123G+49M ~ 123G. (not entirely sure what the discrepancy is due to - but have heard this may be due to space reserved for file system metadata). Because of this discrepancy - we end up in a situation where file system is out of space.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.