FAQ
Hi,

We are running our cluster on Amazon EC2. we are using cloudera
scripts to setup hadoop. On the master node, we start below services.

609 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode'
610 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode'
611 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker'
612
613 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop dfsadmin -safemode wait'

On the slave machine, we run the below services.

625 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode'
626 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker'

The main problem we are facing is, hdfs safemode recovery is taking
more than an hour and this is causing delays in our job completion.

Below are the main log messages.

1. domU-12-31-39-0A-34-61.compute-1.internal 10/05/05 20:44:19 INFO
ipc.Client: Retrying connect to server:
ec2-184-73-64-64.compute-1.amazonaws.com/10.192.11.240:8020. Already
tried 21 time(s).
2. The reported blocks 283634 needs additional 322258 blocks to reach
the threshold 0.9990 of total blocks 606499. Safe mode will be turned
off automatically.

The first message is thrown in task trackers log because, job tracker
is not started. job tracker didn't start because of hdfs safemode
recovery.

The second message is thrown during the recovery process.

Is there something I am doing wrong?
How much time does normal hdfs safemode recovery takes?
Will there be any speedup, by not starting task trackers till job
tracker is started?
Are there any known hadoop problems on amazon cluster?

Thanks for your help.

Regards
Bala Mudiam

Search Discussions

  • Steve Loughran at May 8, 2010 at 7:39 am

    Balanagireddy Mudiam wrote:

    How much time does normal hdfs safemode recovery takes?
    If you don't have secondary namenode set up it has to replay namenode
    operations, the time to recover then depends on how long the cluster has
    been up. 40 minutes is entirely possible. Don't panic and kill the
    process, incidentally, that only makes things worse.
    Will there be any speedup, by not starting task trackers till job
    tracker is started?
    no

    -steve
  • Bhupesh Bansal at Jun 12, 2010 at 12:49 am
    Steve,

    I am also seeing similar issues, I am not clear how will the secondary name
    node helps here ?
    AFAIK secondary namenode checkpoints and saves namenode snapshots
    periodically and namenode
    do not check with secondary namenode for any data inconsistencies.

    Best
    Bhupesh
    --
    View this message in context: http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889900.html
    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
  • Allen Wittenauer at Jun 12, 2010 at 12:57 am
    (removing hadoop-user@lucene)
    On Jun 11, 2010, at 5:08 PM, Bhupesh Bansal wrote:

    I am also seeing similar issues, I am not clear how will the secondary name
    node helps here ?
    AFAIK secondary namenode checkpoints and saves namenode snapshots
    periodically and namenode
    do not check with secondary namenode for any data inconsistencies.

    You can copy the checkpoint over to the primary. This is better than no backup at all. :)
  • Bhupesh Bansal at Jun 12, 2010 at 1:04 am
    Allen,

    How you doing? Heard finally moving away from Solaris and moving to linux :)
    Hope things are going well for you !


    I think I found the source of my problems, The issue is in Amazon EC2 when I
    start my cluster (1 namenode, 16 datanodes) datanodes are not able to talk
    to namenode at all (I tried telnet from datanode to namenode) and it gets
    fixed progressively and magically in about 30-40 mins when all of them to be
    able to talk and hence the safemode taking 40 mins.

    We are running secondary namenode and do regular scps to safe guard the
    data.

    Best
    Bhupesh


    On Fri, Jun 11, 2010 at 5:57 PM, Allen Wittenauer [via Lucene] <
    ml-node+889956-634000226-291170@n3.nabble.comwrote:
    (removing hadoop-user@lucene)
    On Jun 11, 2010, at 5:08 PM, Bhupesh Bansal wrote:

    I am also seeing similar issues, I am not clear how will the secondary name
    node helps here ?
    AFAIK secondary namenode checkpoints and saves namenode snapshots
    periodically and namenode
    do not check with secondary namenode for any data inconsistencies.

    You can copy the checkpoint over to the primary. This is better than no
    backup at all. :)

    ------------------------------
    View message @
    http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889956.html
    To unsubscribe from Re: HDFS safemode recovery take more than an hour, click
    here< (link removed) >.

    --
    View this message in context: http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889964.html
    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
  • Allen Wittenauer at Jun 12, 2010 at 1:14 am

    On Jun 11, 2010, at 6:04 PM, Bhupesh Bansal wrote:

    How you doing? Heard finally moving away from Solaris and moving to linux :)
    Hope things are going well for you !
    HP apparently doesn't want us to eval their hardware (at least, by their non response), so at this rate we aren't. :( Maybe they are afraid I'll make it break. ;) [I'll likely stick to Solaris on the NN and JT due to much more sane large page support. That really needs to get fixed in the Linux kernel.]
    I think I found the source of my problems, The issue is in Amazon EC2 when I
    start my cluster (1 namenode, 16 datanodes) datanodes are not able to talk
    to namenode at all (I tried telnet from datanode to namenode) and it gets
    fixed progressively and magically in about 30-40 mins when all of them to be
    able to talk and hence the safemode taking 40 mins.
    Oh, weird. I have no practical experience with EC2, so can't really offer any guidance. Tom or someone else might be able to tho.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 7, '10 at 10:17p
activeJun 12, '10 at 1:14a
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase