FAQ
My hadoop jobs don't start
This is configured to use an existing DFS and to unpack a tarball with a
cut down 0.16.0 config
I have looked in the mom logs on the client machines and am not getting
anything meaningful.


The hadoop ports are biased by 1000 to allow another cluster to run on
this machines, running and older version of hadoop.

Using Python: 2.5.1 (r251:54863, Feb 24 2008, 12:00:38)
[GCC 4.1.0 20060304 (Red Hat 4.1.0-3)]

[2008-02-25 21:56:38,611] DEBUG/10 hod:144 - ('hdimg01', 63059)
[2008-02-25 21:56:38,612] INFO/20 hod:216 - Service Registry Started.
[2008-02-25 21:56:38,615] DEBUG/10 hadoop:425 - allocate /tmp/hod 27 27
[2008-02-25 21:56:38,618] DEBUG/10 torque:72 - ringmaster cmd:
/data1/hadoop-0.16.0-dfs/contrib/hod/bin/ringmaster
--hodring.tarball-retry-initial-time 1.0
--hodring.cmd-retry-initial-time 2.0 --$[2008-02-25 21:56:38,620]
DEBUG/10 torque:44 - qsub -> /usr/bin/qsub -l nodes=27 -W x= -l nodes=27
-W x= -N "HOD" -r n -d /tmp/ -q batch
[2008-02-25 21:56:38,822] DEBUG/10 torque:54 - qsub stdin: #!/bin/sh
[2008-02-25 21:56:38,823] DEBUG/10 torque:54 - qsub stdin:
/data1/hadoop-0.16.0-dfs/contrib/hod/bin/ringmaster
--hodring.tarball-retry-initial-time 1.0
--hodring.cmd-retry-initial-time 2.0 --hodr$[2008-02-25 21:56:38,835]
DEBUG/10 torque:76 - qsub jobid: 13.hdimg01
[2008-02-25 21:56:38,837] DEBUG/10 torque:87 - /usr/bin/qstat -f -1
13.hdimg01
[2008-02-25 21:56:39,362] DEBUG/10 torque:87 - /usr/bin/qstat -f -1
13.hdimg01
[2008-02-25 21:56:39,390] INFO/20 hadoop:447 - Hod Job successfully
submitted. JobId : 13.hdimg01.
[2008-02-25 21:56:49,438] DEBUG/10 torque:87 - /usr/bin/qstat -f -1
13.hdimg01
[2008-02-25 21:56:49,463] ERROR/40 torque:96 - qstat error: exit code:
153 | signal: False | core False
[2008-02-25 21:56:49,464] INFO/20 hadoop:451 - Ringmaster at : None.
[2008-02-25 21:56:49,465] INFO/20 hadoop:530 - Cleaning up job id
13.hdimg01, as cluster could not be allocated.
[2008-02-25 21:56:49,467] DEBUG/10 torque:131 - /usr/bin/qdel 13.hdimg01
[2008-02-25 21:56:49,490] CRITICAL/50 hod:253 - Cannot allocate cluster
/tmp/hod
[2008-02-25 21:56:50,434] DEBUG/10 hod:391 - return code: 6
You have new mail in /var/spool/mail/XXXX


--
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested

Search Discussions

  • Hemanth Yamijala at Feb 26, 2008 at 8:07 am

    Jason Venner wrote:
    My hadoop jobs don't start
    This is configured to use an existing DFS and to unpack a tarball with
    a cut down 0.16.0 config
    I have looked in the mom logs on the client machines and am not
    getting anything meaningful.
    What is your hod command line ? Specifically, how did you provide the
    tarball option ?
    Can you attach the log of the hod command, like you did the hodrc. There
    are some lines in the output that don't seem complete.
    Set your debug option in the [ringmaster] section to 4, and rerun hod.
    Under the log-dir specified in the [ringmaster] section you will be able
    to see a log file corresponding to your jobid. Can you attach that too ?
    The ringmaster node is the first one allocated by torque for the job,
    that is, the mother superior for the job.
    How is your tarball built ? Can you check that there's no hadoop-env.sh
    with pre-filled values in them. Look at HADOOP-2860.

    Thanks
    Hemanth
  • Jason Venner at Feb 26, 2008 at 3:34 pm
    Well, this finally started to work, after we learned how to debug.

    There were 2 issues, 1, the torque scp command was passing 3 arguments
    instead of 2, and this was causing the error logs to get eaten.

    On our master node, the dfs hod is installed in a different place that
    on the child nodes, and a symlink plased into the 'standard location'.
    HOD/torque was forwarding the real location instead of the configured
    location.

    To find out the SCP was failing, we had to up the debug level on the
    pbs_moms' by seding SIGUSR1's to them, 4 seemed sufficient, then look at
    the /var/log/messages to find the failure reports.

    For the short term, we just made symlinks on the child nodes of where
    the virtual cluster was expecting to find the dfs configuration.



    Hemanth Yamijala wrote:
    Jason Venner wrote:
    My hadoop jobs don't start
    This is configured to use an existing DFS and to unpack a tarball
    with a cut down 0.16.0 config
    I have looked in the mom logs on the client machines and am not
    getting anything meaningful.
    What is your hod command line ? Specifically, how did you provide the
    tarball option ?
    Can you attach the log of the hod command, like you did the hodrc.
    There are some lines in the output that don't seem complete.
    Set your debug option in the [ringmaster] section to 4, and rerun hod.
    Under the log-dir specified in the [ringmaster] section you will be
    able to see a log file corresponding to your jobid. Can you attach
    that too ? The ringmaster node is the first one allocated by torque
    for the job, that is, the mother superior for the job.
    How is your tarball built ? Can you check that there's no
    hadoop-env.sh with pre-filled values in them. Look at HADOOP-2860.

    Thanks
    Hemanth
    --
    Jason Venner
    Attributor - Publish with Confidence <http://www.attributor.com/>
    Attributor is hiring Hadoop Wranglers, contact if interested

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 26, '08 at 6:56a
activeFeb 26, '08 at 3:34p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Jason Venner: 2 posts Hemanth Yamijala: 1 post

People

Translate

site design / logo © 2022 Grokbase