FAQ
Unable to run jobs when all the nodes in rack are down
------------------------------------------------------

Key: HADOOP-5599
URL: https://issues.apache.org/jira/browse/HADOOP-5599
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.20.0
Reporter: Ramya R
Fix For: 0.20.0


Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Ramya R (JIRA) at Mar 31, 2009 at 2:55 pm
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694138#action_12694138 ]

    Ramya R commented on HADOOP-5599:
    ---------------------------------

    Consider the following simple scenario:
    * Generate data
    * Datanodes in a given rack goes down(2 replicas of many block are lost)
    * Run sort on the generated data
    * Sort job fails
    * The filesystem is declared CORRUPT

    However the expected behavior would be to successfully sort the data available using the third replica of blocks.
    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Nigel Daley (JIRA) at Mar 31, 2009 at 3:24 pm
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694147#action_12694147 ]

    Nigel Daley commented on HADOOP-5599:
    -------------------------------------

    Does your randomwriter.cfg lower the default replication from 3 to 2 or 1? Or does the randomwriter code?
    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ramya R (JIRA) at Mar 31, 2009 at 3:38 pm
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694157#action_12694157 ]

    Ramya R commented on HADOOP-5599:
    ---------------------------------

    bq. Does your randomwriter.cfg lower the default replication from 3 to 2 or 1? Or does the randomwriter code?
    None of the above two scenarios occur.
    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ramya R (JIRA) at Mar 31, 2009 at 4:31 pm
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694173#action_12694173 ]

    Ramya R commented on HADOOP-5599:
    ---------------------------------

    I ran the above job on a 480 node cluster with many racks and the JT was brought up using fairshare scheduler.

    A similar kind of behavior is observed in the following case as well:
    * Generate data and sort it
    * Datanodes in a given rack go down
    * Run testmapredsort. The job fails
    * The filesystem is declared CORRUPT

    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Koji Noguchi (JIRA) at Mar 31, 2009 at 5:29 pm
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694189#action_12694189 ]

    Koji Noguchi commented on HADOOP-5599:
    --------------------------------------

    Only open case I know for block getting corrupt due to lost rack is HADOOP-4477.
    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Iyappan Srinivasan (JIRA) at Apr 3, 2009 at 8:23 am
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Iyappan Srinivasan updated HADOOP-5599:
    ---------------------------------------

    Attachment: 5599log.txt

    Logs of randomwriter console output, fsck console output and namenode logs.
    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0

    Attachments: 5599log.txt


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Iyappan Srinivasan (JIRA) at Apr 3, 2009 at 8:29 am
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695292#action_12695292 ]

    Iyappan Srinivasan commented on HADOOP-5599:
    --------------------------------------------

    I was able to reproduce the above said issue.

    1) In the cluster, generate data using randomwriter
    2) Get one rack and kill all the datanodes in that rack only.
    3) Run sort job. It fails.
    4) Run Fsck from root. It says data corrupt.

    I have attached the logs of these .
    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0

    Attachments: 5599log.txt


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Apr 6, 2009 at 11:44 pm
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696299#action_12696299 ]

    Owen O'Malley commented on HADOOP-5599:
    ---------------------------------------

    Did you have the topology defined? If not, there is nothing DFS can do. If so, the question becomes why blocks ended up with all the replicas on one rack.
    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0

    Attachments: 5599log.txt


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ramya R (JIRA) at Apr 7, 2009 at 3:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696387#action_12696387 ]

    Ramya R commented on HADOOP-5599:
    ---------------------------------

    No, the topology was not defined. Thats the reason why all the replicas were placed in the "default" rack. After defining the network topology, the jobs successfully completed using the remaining replica. Thanks Owen.

    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0

    Attachments: 5599log.txt


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ramya R (JIRA) at Apr 7, 2009 at 7:40 am
    [ https://issues.apache.org/jira/browse/HADOOP-5599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ramya R resolved HADOOP-5599.
    -----------------------------

    Resolution: Invalid
    Unable to run jobs when all the nodes in rack are down
    ------------------------------------------------------

    Key: HADOOP-5599
    URL: https://issues.apache.org/jira/browse/HADOOP-5599
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.20.0
    Reporter: Ramya R
    Fix For: 0.20.0

    Attachments: 5599log.txt


    Jobs such as randomwriter, sort, validator fail when all the datanodes in a rack are down.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 31, '09 at 2:53p
activeApr 7, '09 at 7:40a
posts11
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Ramya R (JIRA): 11 posts

People

Translate

site design / logo © 2022 Grokbase