FAQ
Hi,

appreciate if anyone can help. When one of the data node was offline for
some reason, like memory crash or hardware maintaining, the roles in that
node will not come back to health anymore after bring it back online.

It always complains "This role's host has been out of contact with Cloudera
Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
"


What shall I do to catch it up to the pool?


To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.

Search Discussions

  • Darren Lo at Feb 12, 2014 at 10:12 pm
    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was offline for
    some reason, like memory crash or hardware maintaining, the roles in that
    node will not come back to health anymore after bring it back online.

    It always complains "This role's host has been out of contact with
    Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it, send an
    email to scm-users+unsubscribe@cloudera.org.


    --
    Thanks,
    Darren

    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Rex Zhen at Feb 12, 2014 at 10:26 pm
    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren


    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen <rex....@gmail.com <javascript:>
    wrote:
    Hi,

    appreciate if anyone can help. When one of the data node was offline for
    some reason, like memory crash or hardware maintaining, the roles in that
    node will not come back to health anymore after bring it back online.

    It always complains "This role's host has been out of contact with
    Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org <javascript:>.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Darren Lo at Feb 12, 2014 at 10:24 pm
    Hi Rex,

    Do the agent logs (in /var/log/cloudera-scm-agent/ ) indicate any error
    connecting to CM server? Make sure you're checking while ssh'd into the
    host with the problem.

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:18 PM, Rex Zhen wrote:

    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was offline for
    some reason, like memory crash or hardware maintaining, the roles in that
    node will not come back to health anymore after bring it back online.

    It always complains "This role's host has been out of contact with
    Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send an
    email to scm-users+unsubscribe@cloudera.org.


    --
    Thanks,
    Darren

    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Rex Zhen at Feb 12, 2014 at 10:32 pm
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 635 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:22 +0000] 12917
    Monitor-RegionServerMonitor abstract_monitor ERROR Error fetching
    metrics at 'http://dn-03.nim.com:60030/metrics?format=json'
    ./cloudera-scm-agent.log: raise URLError(err)
    ./cloudera-scm-agent.log:URLError: <urlopen error [Errno 111] Connection
    refused>
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 636 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to read file
    descriptor max for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/limits'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get file
    descriptor count for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/fd/'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get process
    metrics 14391: no process found with pid 14391
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 637 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 638 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 639 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.0-1.cdh4.2.0.p0.10: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.1-1.cdh4.2.1.p0.5: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:45:54 +0000] 3433 MainThread
    abstract_monitor ERROR JournalNodeMonitor for None unable to find
    appropriate process to monitor for pid 11552.
    ./cloudera-scm-agent.out:Error: could not find config file
    /var/run/cloudera-scm-agent/supervisor/supervisord.conf
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    On Wednesday, February 12, 2014 2:24:19 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Do the agent logs (in /var/log/cloudera-scm-agent/ ) indicate any error
    connecting to CM server? Make sure you're checking while ssh'd into the
    host with the problem.

    Thanks,
    Darren


    On Wed, Feb 12, 2014 at 2:18 PM, Rex Zhen <rex....@gmail.com <javascript:>
    wrote:
    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was offline
    for some reason, like memory crash or hardware maintaining, the roles in
    that node will not come back to health anymore after bring it back online.

    It always complains "This role's host has been out of contact with
    Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org <javascript:>.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Adar Dembo at Feb 12, 2014 at 10:47 pm
    Rex,

    Are you running CM 4.6.x? There's a bug (fixed in CM 4.7) where the agent
    doesn't properly exit if it can't contact supervisor. It looks like you've
    been hit by that bug. I think the net result is that you have two agents
    running.

    To fix this, kill all the agents and all the supervisors on the machine.
    Kill the agents with "kill -9", as that'll avoid provoking the above bug.
    You should try killing the supervisors with a regular SIGTERM (not "-9") so
    that any managed processes are properly killed as well.

    Once you're done, use "service cloudera-scm-agent start" to restart the
    agent.

    On Wed, Feb 12, 2014 at 2:31 PM, Rex Zhen wrote:

    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 635 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:22 +0000] 12917
    Monitor-RegionServerMonitor abstract_monitor ERROR Error fetching
    metrics at 'http://dn-03.nim.com:60030/metrics?format=json'
    ./cloudera-scm-agent.log: raise URLError(err)
    ./cloudera-scm-agent.log:URLError: <urlopen error [Errno 111] Connection
    refused>
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 636 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to read file
    descriptor max for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/limits'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get file
    descriptor count for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/fd/'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get process
    metrics 14391: no process found with pid 14391
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 637 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 638 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 639 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.0-1.cdh4.2.0.p0.10: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.1-1.cdh4.2.1.p0.5: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:45:54 +0000] 3433 MainThread
    abstract_monitor ERROR JournalNodeMonitor for None unable to find
    appropriate process to monitor for pid 11552.
    ./cloudera-scm-agent.out:Error: could not find config file
    /var/run/cloudera-scm-agent/supervisor/supervisord.conf
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    On Wednesday, February 12, 2014 2:24:19 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Do the agent logs (in /var/log/cloudera-scm-agent/ ) indicate any error
    connecting to CM server? Make sure you're checking while ssh'd into the
    host with the problem.

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:18 PM, Rex Zhen wrote:

    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was offline
    for some reason, like memory crash or hardware maintaining, the roles in
    that node will not come back to health anymore after bring it back online.

    It always complains "This role's host has been out of contact with
    Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send an
    email to scm-users+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Rex Zhen at Feb 12, 2014 at 10:57 pm
    When I checked the Parcels, the actived CDH is CDH 4.3.0-1.cdh4.3.0.p0.22 ,
    does it have the same bug as 4.6.x?

    On Wednesday, February 12, 2014 2:47:52 PM UTC-8, Adar Dembo wrote:

    Rex,

    Are you running CM 4.6.x? There's a bug (fixed in CM 4.7) where the agent
    doesn't properly exit if it can't contact supervisor. It looks like you've
    been hit by that bug. I think the net result is that you have two agents
    running.

    To fix this, kill all the agents and all the supervisors on the machine.
    Kill the agents with "kill -9", as that'll avoid provoking the above bug.
    You should try killing the supervisors with a regular SIGTERM (not "-9") so
    that any managed processes are properly killed as well.

    Once you're done, use "service cloudera-scm-agent start" to restart the
    agent.


    On Wed, Feb 12, 2014 at 2:31 PM, Rex Zhen <rex....@gmail.com <javascript:>
    wrote:
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 635 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:22 +0000] 12917
    Monitor-RegionServerMonitor abstract_monitor ERROR Error fetching
    metrics at 'http://dn-03.nim.com:60030/metrics?format=json'
    ./cloudera-scm-agent.log: raise URLError(err)
    ./cloudera-scm-agent.log:URLError: <urlopen error [Errno 111] Connection
    refused>
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 636 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to read file
    descriptor max for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/limits'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get file
    descriptor count for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/fd/'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get process
    metrics 14391: no process found with pid 14391
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 637 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 638 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 639 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.0-1.cdh4.2.0.p0.10: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.1-1.cdh4.2.1.p0.5: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:45:54 +0000] 3433 MainThread
    abstract_monitor ERROR JournalNodeMonitor for None unable to find
    appropriate process to monitor for pid 11552.
    ./cloudera-scm-agent.out:Error: could not find config file
    /var/run/cloudera-scm-agent/supervisor/supervisord.conf
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    On Wednesday, February 12, 2014 2:24:19 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Do the agent logs (in /var/log/cloudera-scm-agent/ ) indicate any error
    connecting to CM server? Make sure you're checking while ssh'd into the
    host with the problem.

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:18 PM, Rex Zhen wrote:

    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was offline
    for some reason, like memory crash or hardware maintaining, the roles in
    that node will not come back to health anymore after bring it back online.

    It always complains "This role's host has been out of contact with
    Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Adar Dembo at Feb 12, 2014 at 11:00 pm
    That's the CDH version, not the CM version.

    The easiest way to check the CM version is to look at the version of the
    cloudera-manager-agent package (via rpm or dpkg, depending on your distro).

    On Wed, Feb 12, 2014 at 2:57 PM, Rex Zhen wrote:

    When I checked the Parcels, the actived CDH is CDH 4.3.0-1.cdh4.3.0.p0.22
    , does it have the same bug as 4.6.x?

    On Wednesday, February 12, 2014 2:47:52 PM UTC-8, Adar Dembo wrote:

    Rex,

    Are you running CM 4.6.x? There's a bug (fixed in CM 4.7) where the agent
    doesn't properly exit if it can't contact supervisor. It looks like you've
    been hit by that bug. I think the net result is that you have two agents
    running.

    To fix this, kill all the agents and all the supervisors on the machine.
    Kill the agents with "kill -9", as that'll avoid provoking the above bug.
    You should try killing the supervisors with a regular SIGTERM (not "-9") so
    that any managed processes are properly killed as well.

    Once you're done, use "service cloudera-scm-agent start" to restart the
    agent.

    On Wed, Feb 12, 2014 at 2:31 PM, Rex Zhen wrote:

    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 635 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:22 +0000] 12917
    Monitor-RegionServerMonitor abstract_monitor ERROR Error fetching
    metrics at 'http://dn-03.nim.com:60030/metrics?format=json'
    ./cloudera-scm-agent.log: raise URLError(err)
    ./cloudera-scm-agent.log:URLError: <urlopen error [Errno 111]
    Connection refused>
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 636 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to read file
    descriptor max for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/limits'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get file
    descriptor count for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/fd/'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get process
    metrics 14391: no process found with pid 14391
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 637 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 638 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 639 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.0-1.cdh4.2.0.p0.10: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.1-1.cdh4.2.1.p0.5: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:45:54 +0000] 3433 MainThread
    abstract_monitor ERROR JournalNodeMonitor for None unable to find
    appropriate process to monitor for pid 11552.
    ./cloudera-scm-agent.out:Error: could not find config file
    /var/run/cloudera-scm-agent/supervisor/supervisord.conf
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    On Wednesday, February 12, 2014 2:24:19 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Do the agent logs (in /var/log/cloudera-scm-agent/ ) indicate any error
    connecting to CM server? Make sure you're checking while ssh'd into the
    host with the problem.

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:18 PM, Rex Zhen wrote:

    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was offline
    for some reason, like memory crash or hardware maintaining, the roles in
    that node will not come back to health anymore after bring it back online.

    It always complains "This role's host has been out of contact with
    Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Rex Zhen at Feb 13, 2014 at 1:31 am
    Yeah, you are right. the CM version is 4.6.3

    cleaned up the process and restart the agent. most of the services are up
    except the tasktracker.

    here is the error log in the tasktracker,

    2014-02-12 16:54:44,734 ERROR org.apache.hadoop.mapred.JettyBugMonitor:
    Jetty bug monitor failed
    2014-02-12 16:59:05,418 ERROR org.apache.hadoop.mapred.TaskTracker:
    RECEIVED SIGNAL 15: SIGTERM

    the tasktracker keep restarting itself.

    On Wednesday, February 12, 2014 3:00:07 PM UTC-8, Adar Dembo wrote:

    That's the CDH version, not the CM version.

    The easiest way to check the CM version is to look at the version of the
    cloudera-manager-agent package (via rpm or dpkg, depending on your distro).


    On Wed, Feb 12, 2014 at 2:57 PM, Rex Zhen <rex....@gmail.com <javascript:>
    wrote:
    When I checked the Parcels, the actived CDH is CDH 4.3.0-1.cdh4.3.0.p0.22
    , does it have the same bug as 4.6.x?

    On Wednesday, February 12, 2014 2:47:52 PM UTC-8, Adar Dembo wrote:

    Rex,

    Are you running CM 4.6.x? There's a bug (fixed in CM 4.7) where the
    agent doesn't properly exit if it can't contact supervisor. It looks like
    you've been hit by that bug. I think the net result is that you have two
    agents running.

    To fix this, kill all the agents and all the supervisors on the machine.
    Kill the agents with "kill -9", as that'll avoid provoking the above bug.
    You should try killing the supervisors with a regular SIGTERM (not "-9") so
    that any managed processes are properly killed as well.

    Once you're done, use "service cloudera-scm-agent start" to restart the
    agent.

    On Wed, Feb 12, 2014 at 2:31 PM, Rex Zhen wrote:

    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 635 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:22 +0000] 12917
    Monitor-RegionServerMonitor abstract_monitor ERROR Error fetching
    metrics at 'http://dn-03.nim.com:60030/metrics?format=json'
    ./cloudera-scm-agent.log: raise URLError(err)
    ./cloudera-scm-agent.log:URLError: <urlopen error [Errno 111]
    Connection refused>
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 636 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to read file
    descriptor max for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/limits'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get file
    descriptor count for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/fd/'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get process
    metrics 14391: no process found with pid 14391
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 637 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 638 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Failed to contact supervisor after 639 attempts.
    Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917 MainThread
    agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.0-1.cdh4.2.0.p0.10: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.1-1.cdh4.2.1.p0.5: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:45:54 +0000] 3433 MainThread
    abstract_monitor ERROR JournalNodeMonitor for None unable to find
    appropriate process to monitor for pid 11552.
    ./cloudera-scm-agent.out:Error: could not find config file
    /var/run/cloudera-scm-agent/supervisor/supervisord.conf
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    On Wednesday, February 12, 2014 2:24:19 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Do the agent logs (in /var/log/cloudera-scm-agent/ ) indicate any
    error connecting to CM server? Make sure you're checking while ssh'd into
    the host with the problem.

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:18 PM, Rex Zhen wrote:

    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was
    offline for some reason, like memory crash or hardware maintaining, the
    roles in that node will not come back to health anymore after bring it back
    online.

    It always complains "This role's host has been out of contact with
    Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Adar Dembo at Feb 13, 2014 at 1:44 am
    Hmm, never seen that one before. Maybe try e-mailing the cdh-user list?

    On Wed, Feb 12, 2014 at 5:31 PM, Rex Zhen wrote:

    Yeah, you are right. the CM version is 4.6.3

    cleaned up the process and restart the agent. most of the services are up
    except the tasktracker.

    here is the error log in the tasktracker,

    2014-02-12 16:54:44,734 ERROR org.apache.hadoop.mapred.JettyBugMonitor:
    Jetty bug monitor failed
    2014-02-12 16:59:05,418 ERROR org.apache.hadoop.mapred.TaskTracker:
    RECEIVED SIGNAL 15: SIGTERM

    the tasktracker keep restarting itself.

    On Wednesday, February 12, 2014 3:00:07 PM UTC-8, Adar Dembo wrote:

    That's the CDH version, not the CM version.

    The easiest way to check the CM version is to look at the version of the
    cloudera-manager-agent package (via rpm or dpkg, depending on your distro).

    On Wed, Feb 12, 2014 at 2:57 PM, Rex Zhen wrote:

    When I checked the Parcels, the actived CDH is CDH 4.3.0-1.cdh4.3.0.p0.22
    , does it have the same bug as 4.6.x?

    On Wednesday, February 12, 2014 2:47:52 PM UTC-8, Adar Dembo wrote:

    Rex,

    Are you running CM 4.6.x? There's a bug (fixed in CM 4.7) where the
    agent doesn't properly exit if it can't contact supervisor. It looks like
    you've been hit by that bug. I think the net result is that you have two
    agents running.

    To fix this, kill all the agents and all the supervisors on the
    machine. Kill the agents with "kill -9", as that'll avoid provoking the
    above bug. You should try killing the supervisors with a regular SIGTERM
    (not "-9") so that any managed processes are properly killed as well.

    Once you're done, use "service cloudera-scm-agent start" to restart the
    agent.

    On Wed, Feb 12, 2014 at 2:31 PM, Rex Zhen wrote:

    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 635
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:22 +0000] 12917
    Monitor-RegionServerMonitor abstract_monitor ERROR Error fetching
    metrics at 'http://dn-03.nim.com:60030/metrics?format=json'
    ./cloudera-scm-agent.log: raise URLError(err)
    ./cloudera-scm-agent.log:URLError: <urlopen error [Errno 111]
    Connection refused>
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917
    MainThread agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 636
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to read file
    descriptor max for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/limits'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get file
    descriptor count for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/fd/'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get process
    metrics 14391: no process found with pid 14391
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917
    MainThread agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 637
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917
    MainThread agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 638
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917
    MainThread agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 639
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.0-1.cdh4.2.0.p0.10: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433 MainThread
    parcel ERROR Failed to deactivate system symlinks for parcel
    CDH-4.2.1-1.cdh4.2.1.p0.5: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:45:54 +0000] 3433 MainThread
    abstract_monitor ERROR JournalNodeMonitor for None unable to find
    appropriate process to monitor for pid 11552.
    ./cloudera-scm-agent.out:Error: could not find config file
    /var/run/cloudera-scm-agent/supervisor/supervisord.conf
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    On Wednesday, February 12, 2014 2:24:19 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Do the agent logs (in /var/log/cloudera-scm-agent/ ) indicate any
    error connecting to CM server? Make sure you're checking while ssh'd into
    the host with the problem.

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:18 PM, Rex Zhen wrote:

    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was
    offline for some reason, like memory crash or hardware maintaining, the
    roles in that node will not come back to health anymore after bring it back
    online.

    It always complains "This role's host has been out of contact
    with Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Philip Zeyliger at Feb 13, 2014 at 4:28 am
    Jetty bug monitor triggered my memory. See
    https://github.com/cloudera/hadoop-common/blob/cdh5-2.2.0_5.0.0b2/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/JettyBugMonitor.java.
      It was introduced in
    https://issues.apache.org/jira/browse/MAPREDUCE-3184. In short, a bug in
    how the JVM and Jetty interact causes things to go haywire, and the TT
    kills itself. One TT dead is better than one TT slow, because the slow TT
    would slow down everyone disproportionally. CM can be configured to
    restart the TT after it dies.

    It seems a bit unusual that this happens in a loop for your TT. I'd be
    interested in your JVM and CDH combination, but at that point, as Adar
    suggests, the cdh-user@ is a better bet. There are folks on there who know
    many more details about this.

    -- Philip



    On Wed, Feb 12, 2014 at 5:44 PM, Adar Dembo wrote:

    Hmm, never seen that one before. Maybe try e-mailing the cdh-user list?

    On Wed, Feb 12, 2014 at 5:31 PM, Rex Zhen wrote:

    Yeah, you are right. the CM version is 4.6.3

    cleaned up the process and restart the agent. most of the services are up
    except the tasktracker.

    here is the error log in the tasktracker,

    2014-02-12 16:54:44,734 ERROR org.apache.hadoop.mapred.JettyBugMonitor:
    Jetty bug monitor failed
    2014-02-12 16:59:05,418 ERROR org.apache.hadoop.mapred.TaskTracker:
    RECEIVED SIGNAL 15: SIGTERM

    the tasktracker keep restarting itself.

    On Wednesday, February 12, 2014 3:00:07 PM UTC-8, Adar Dembo wrote:

    That's the CDH version, not the CM version.

    The easiest way to check the CM version is to look at the version of the
    cloudera-manager-agent package (via rpm or dpkg, depending on your distro).

    On Wed, Feb 12, 2014 at 2:57 PM, Rex Zhen wrote:

    When I checked the Parcels, the actived CDH is CDH 4.3.0-1.cdh4.3.0.p0.22
    , does it have the same bug as 4.6.x?

    On Wednesday, February 12, 2014 2:47:52 PM UTC-8, Adar Dembo wrote:

    Rex,

    Are you running CM 4.6.x? There's a bug (fixed in CM 4.7) where the
    agent doesn't properly exit if it can't contact supervisor. It looks like
    you've been hit by that bug. I think the net result is that you have two
    agents running.

    To fix this, kill all the agents and all the supervisors on the
    machine. Kill the agents with "kill -9", as that'll avoid provoking the
    above bug. You should try killing the supervisors with a regular SIGTERM
    (not "-9") so that any managed processes are properly killed as well.

    Once you're done, use "service cloudera-scm-agent start" to restart
    the agent.

    On Wed, Feb 12, 2014 at 2:31 PM, Rex Zhen wrote:

    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 635
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:08 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:22 +0000] 12917
    Monitor-RegionServerMonitor abstract_monitor ERROR Error fetching
    metrics at 'http://dn-03.nim.com:60030/metrics?format=json'
    ./cloudera-scm-agent.log: raise URLError(err)
    ./cloudera-scm-agent.log:URLError: <urlopen error [Errno 111]
    Connection refused>
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917
    MainThread agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 636
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:23 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to read file
    descriptor max for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/limits'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get file
    descriptor count for process 14391: [Errno 2] No such file or directory:
    '/proc/14391/fd/'
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:32 +0000] 12917
    MonitorDaemon-Reporter proc_metrics_utils ERROR Failed to get process
    metrics 14391: no process found with pid 14391
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917
    MainThread agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 637
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:38 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917
    MainThread agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 638
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:58:53 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917
    MainThread agent ERROR Could not contact supervisor.
    ./cloudera-scm-agent.log: raise error, msg
    ./cloudera-scm-agent.log:error: [Errno 111] Connection refused
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917
    MainThread agent ERROR Failed to contact supervisor after 639
    attempts. Agent will exit.
    ./cloudera-scm-agent.log:[12/Feb/2014 12:59:08 +0000] 12917
    MainThread agent ERROR Caught unexpected exception in main loop.
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433
    MainThread parcel ERROR Failed to deactivate system symlinks for
    parcel CDH-4.2.0-1.cdh4.2.0.p0.10: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:09:13 +0000] 3433
    MainThread parcel ERROR Failed to deactivate system symlinks for
    parcel CDH-4.2.1-1.cdh4.2.1.p0.5: 1
    ./cloudera-scm-agent.log:[12/Feb/2014 14:45:54 +0000] 3433
    MainThread abstract_monitor ERROR JournalNodeMonitor for None unable to
    find appropriate process to monitor for pid 11552.
    ./cloudera-scm-agent.out:Error: could not find config file
    /var/run/cloudera-scm-agent/supervisor/supervisord.conf
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    ./cmf_listener.log:ERROR:__main__:Error in supervisord listener loop.
    ./cmf_listener.log:OSError: [Errno 32] Broken pipe
    On Wednesday, February 12, 2014 2:24:19 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Do the agent logs (in /var/log/cloudera-scm-agent/ ) indicate any
    error connecting to CM server? Make sure you're checking while ssh'd into
    the host with the problem.

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:18 PM, Rex Zhen wrote:

    Hi Darren,

    Thank you for your quick reply.

    # service cloudera-scm-agent status
    cloudera-scm-agent (pid 3433) is running...

    Rex Zhen

    On Wednesday, February 12, 2014 2:12:00 PM UTC-8, Darren Lo wrote:

    Hi Rex,

    Is the cloudera manager agent running on that host?
    service cloudera-scm-agent status

    If not, then start it
    service cloudera-scm-agent start

    Thanks,
    Darren

    On Wed, Feb 12, 2014 at 2:09 PM, Rex Zhen wrote:

    Hi,

    appreciate if anyone can help. When one of the data node was
    offline for some reason, like memory crash or hardware maintaining, the
    roles in that node will not come back to health anymore after bring it back
    online.

    It always complains "This role's host has been out of contact
    with Cloudera Manager for too long.<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/220/advicePopup?timestamp=1392242468642&currentMode=true&healthTestName=JOURNAL_NODE_SCM_HEALTH>
    "


    What shall I do to catch it up to the pool?


    To unsubscribe from this group and stop receiving emails from
    it, send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.


    --
    Thanks,
    Darren
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to scm-users+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupscm-users @
categorieshadoop
postedFeb 12, '14 at 10:09p
activeFeb 13, '14 at 4:28a
posts11
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase