FAQ
Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
--------------------------------------------------------------------------------------

Key: HADOOP-1170
URL: https://issues.apache.org/jira/browse/HADOOP-1170
Project: Hadoop
Issue Type: Bug
Components: dfs
Affects Versions: 0.11.2
Reporter: Igor Bolotin


While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.

Stack trace showed following on most of the data nodes:
"org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
at java.io.UnixFileSystem.checkAccess(Native Method)
at java.io.File.canRead(File.java:660)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
- locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
at java.lang.Thread.run(Thread.java:595)

I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Igor Bolotin (JIRA) at Mar 28, 2007 at 4:54 am
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Igor Bolotin updated HADOOP-1170:
    ---------------------------------

    Attachment: 1170.patch

    Attached patch removes checkDataDir() calls from DataXceiveServer.run() method.
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Igor Bolotin (JIRA) at Mar 28, 2007 at 4:54 am
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Igor Bolotin updated HADOOP-1170:
    ---------------------------------

    Status: Patch Available (was: Open)
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 28, 2007 at 5:14 am
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484711 ]

    Hadoop QA commented on HADOOP-1170:
    -----------------------------------

    +1, because http://issues.apache.org/jira/secure/attachment/12354393/1170.patch applied and successfully tested against trunk revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/523072. Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Mar 28, 2007 at 6:32 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484955 ]

    Raghu Angadi commented on HADOOP-1170:
    --------------------------------------


    It is invoked in two more places in Datanode.java.. though not this often. Should we remove those as well? It is called once before sending block report and when a command is received from namenode (e.g. block invalidate cmd in response to heartBeat).


    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Mar 28, 2007 at 6:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484961 ]

    Hairong Kuang commented on HADOOP-1170:
    ---------------------------------------

    I agree that it is too costly to call checkDirs on every I/O operation. A background thread that periodically does the sanity check would be nicer.

    The patch should also clean up the code that does the error handling.
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Mar 28, 2007 at 7:12 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484967 ]

    dhruba borthakur commented on HADOOP-1170:
    ------------------------------------------

    I like the idea of a background thread that periodically checks the data directories. The idea is to detect bad/inaccessible data directories and shutdown the datanode if this occurs, right?
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Mar 28, 2007 at 7:40 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484971 ]

    Raghu Angadi commented on HADOOP-1170:
    --------------------------------------


    There is going to be a periodic checker for all the blocks. The same thread could check the some of these conditions too. For this jira, I vote for removing all calls to checkDirs in Datanode.java.

    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Mar 30, 2007 at 5:38 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485578 ]

    Doug Cutting commented on HADOOP-1170:
    --------------------------------------

    Is there a consensus to commit this as-is, or is someone working on an improved version?
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Igor Bolotin (JIRA) at Mar 30, 2007 at 6:02 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Igor Bolotin updated HADOOP-1170:
    ---------------------------------

    Status: Open (was: Patch Available)
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Igor Bolotin (JIRA) at Mar 30, 2007 at 6:02 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485583 ]

    Igor Bolotin commented on HADOOP-1170:
    --------------------------------------

    I'll prepare patch with all calls removed later today
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Igor Bolotin (JIRA) at Mar 30, 2007 at 7:56 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Igor Bolotin updated HADOOP-1170:
    ---------------------------------

    Attachment: 1170-v2.patch

    This patch removes all FSDataset.checkDataDir() calls from DataNode as well as DiskErrorException handling in DataXceiveServer.run() method. I decided not to touch DiskErrorException handling in DataNode.offerService() - I just don't know whether or not it's possible to get it there.

    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Igor Bolotin (JIRA) at Mar 30, 2007 at 7:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Igor Bolotin updated HADOOP-1170:
    ---------------------------------

    Status: Patch Available (was: Open)
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 30, 2007 at 8:45 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485636 ]

    Hadoop QA commented on HADOOP-1170:
    -----------------------------------

    +1, because http://issues.apache.org/jira/secure/attachment/12354634/1170-v2.patch applied and successfully tested against trunk revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/524205. Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Mar 30, 2007 at 9:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Doug Cutting updated HADOOP-1170:
    ---------------------------------

    Resolution: Fixed
    Fix Version/s: 0.13.0
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Igor.
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Fix For: 0.13.0

    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 31, 2007 at 11:20 am
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485718 ]

    Hadoop QA commented on HADOOP-1170:
    -----------------------------------

    Integrated in Hadoop-Nightly #43 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/43/)
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Fix For: 0.13.0

    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Apr 19, 2007 at 6:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490129 ]

    dhruba borthakur commented on HADOOP-1170:
    ------------------------------------------

    This patch improves the performance situation but removes all checkDirs from the datanode. This introduces the problem that disk's migt not get checked for a long time. This is dangerous for a cluster where disks go bad. I think we should either implement a background thread to call checkDirs() before this patch can be deployed on a real cluster.
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Fix For: 0.13.0

    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Apr 19, 2007 at 6:34 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490132 ]

    Doug Cutting commented on HADOOP-1170:
    --------------------------------------
    we should either implement a background thread to call checkDirs() before this patch can be deployed on a real cluster
    Please file a new issue to be fixed for 0.13 for this.
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Fix For: 0.13.0

    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Igor Bolotin (JIRA) at Apr 19, 2007 at 6:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490139 ]

    Igor Bolotin commented on HADOOP-1170:
    --------------------------------------

    There is another issue HADOOP-1200 that was open exactly for this
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Fix For: 0.13.0

    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • eric baldeschwieler (JIRA) at Apr 19, 2007 at 8:01 pm
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490156 ]

    eric baldeschwieler commented on HADOOP-1170:
    ---------------------------------------------

    The thing to understand is that we can not upgrade our cluster to HEAD with this patch committed. This patch breaks us. We'll try to move forward in the new issue rather than advocating rolling this back, but this patch did not address the concerns we raised in this bug and so we have a problem. I hope we can avoid this in the future.

    I'm not advocating rolling back because I agree that these checks were not the appropriate solution to the disk problems they solved.

    In case the context isn't clear, we frequently see individual drives go read only on our machines. This check was inserted to allow this problem to be detected early and avoid failed jobs cause by write failures.
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Fix For: 0.13.0

    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at May 8, 2007 at 11:25 am
    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494258 ]

    Hadoop QA commented on HADOOP-1170:
    -----------------------------------

    Integrated in Hadoop-Nightly #82 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/82/)
    Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
    --------------------------------------------------------------------------------------

    Key: HADOOP-1170
    URL: https://issues.apache.org/jira/browse/HADOOP-1170
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.11.2
    Reporter: Igor Bolotin
    Fix For: 0.13.0

    Attachments: 1170-v2.patch, 1170.patch


    While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
    Stack trace showed following on most of the data nodes:
    "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
    at java.io.UnixFileSystem.checkAccess(Native Method)
    at java.io.File.canRead(File.java:660)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
    at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
    at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
    - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
    at java.lang.Thread.run(Thread.java:595)
    I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 28, '07 at 4:46a
activeMay 8, '07 at 11:25a
posts21
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hadoop QA (JIRA): 21 posts

People

Translate

site design / logo © 2022 Grokbase