FAQ
Hi all,

Our 5.0.37 system was running pretty good when I noticed the followign in
the ndb node error log:

Time: Wednesday 5 December 2007 - 00:32:43
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the
machine running this node (Internal error, programming error or missing e
rror message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 20177
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.7
Version: Version 5.0.37
***EOM***


Could someone provide some insight as to what happened to create this ? We
have auto restart configured and it looks like both nodes are running fine
with just the temp outage. Should I be concerned ?

Thanks,
Yong.



Yong Lee
Developer
ylee@eqo.com
http://www.eqo.com/
direct: +1.604.273.8173 x113
mobile: +1.604.418.4470
fax: +1.604.273.8172
web: www.EQO.com <http://www.eqo.com/>
EQO ID: yonglee

Search Discussions

  • Stewart Smith at Dec 7, 2007 at 1:45 am

    On Thu, 2007-12-06 at 00:22 -0800, Yong Lee wrote:
    Message: WatchDog terminate, internal error or massive overload on the
    machine running this node (Internal error, programming error or missing e
    rror message, please report a bug)
    Could someone provide some insight as to what happened to create this ? We
    have auto restart configured and it looks like both nodes are running fine
    with just the temp outage. Should I be concerned ?
    do you have monitoring on the machine? perhaps you can go and check
    things like load average, disk usage etc.

    Possibly something hogged resources for a while.
    --
    Stewart Smith, Senior Software Engineer (MySQL Cluster)
    MySQL AB, www.mysql.com
    Office: +14082136540 Ext: 6616
    VoIP: 6616@sip.us.mysql.com
    Mobile: +61 4 3 8844 332
  • Yong Lee at Dec 7, 2007 at 5:09 pm
    Thanks for the reply Stewart.

    If it is a resource contention issue it would be nice if the error message
    could indicate what resource it was having difficulty with :)

    Yong.


    Yong Lee
    Developer
    ylee@eqo.com

    direct: +1.604.273.8173 x113
    mobile: +1.604.418.4470
    fax: +1.604.273.8172
    web: www.EQO.com
    EQO ID: yonglee



    -----Original Message-----
    From: Stewart Smith
    Sent: December 6, 2007 5:45 PM
    To: Yong Lee
    Cc: cluster@lists.mysql.com
    Subject: Re: Massive overload of machine
    On Thu, 2007-12-06 at 00:22 -0800, Yong Lee wrote:
    Message: WatchDog terminate, internal error or massive overload on the
    machine running this node (Internal error, programming error or
    missing e rror message, please report a bug)
    Could someone provide some insight as to what happened to create this
    ? We have auto restart configured and it looks like both nodes are
    running fine with just the temp outage. Should I be concerned ?
    do you have monitoring on the machine? perhaps you can go and check things
    like load average, disk usage etc.

    Possibly something hogged resources for a while.
    --
    Stewart Smith, Senior Software Engineer (MySQL Cluster) MySQL AB,
    www.mysql.com
    Office: +14082136540 Ext: 6616
    VoIP: 6616@sip.us.mysql.com
    Mobile: +61 4 3 8844 332
  • Stewart Smith at Dec 8, 2007 at 10:33 am

    On Fri, 2007-12-07 at 09:08 -0800, Yong Lee wrote:
    If it is a resource contention issue it would be nice if the error message
    could indicate what resource it was having difficulty with :)
    Most likely is CPU... but beyond that we can't really tell

    --
    Stewart Smith, Senior Software Engineer (MySQL Cluster)
    MySQL AB, www.mysql.com
    Office: +14082136540 Ext: 6616
    VoIP: 6616@sip.us.mysql.com
    Mobile: +61 4 3 8844 332
  • Yong Lee at Dec 10, 2007 at 8:38 pm
    Hi all,

    I am seeing this occur quite regularly on both our ndb nodes and any help in
    addressing this before the outage occurs on both the ndb nodes to take down
    my system would be greatly appreciated.

    I started doing a top command with one minute snapshots to see if I can get
    a sense of the cpu and memory usage during the outage. We have a dual core
    system with 8G of memory running a mysql node as well as an ndb node on the
    same machine.

    The virtual size of the ndb node is around 3GB.

    I suspect that my problem may be due to my config parameters but I really
    don't know what it could be.


    ============================================================================
    ==========
    The results of my top monitoring look as thus (the outage is around 10:16)

    top - 10:14:36 up 58 days, 1:15, 2 users, load average: 0.23, 0.20, 0.18
    Tasks: 60 total, 1 running, 59 sleeping, 0 stopped, 0 zombie
    Cpu(s): 2.7% us, 1.3% sy, 0.0% ni, 93.7% id, 2.2% wa, 0.1% hi, 0.0% si
    Mem: 8310552k total, 8056736k used, 253816k free, 53512k buffers
    Swap: 4192956k total, 168k used, 4192788k free, 5684940k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    28511 mysql 15 0 3047m 1.9g 1988 S 6 24.1 65:37.02 ndbd
    19123 mysql 15 0 199m 161m 3892 S 1 2.0 24:16.99 mysqld
    20703 mysql 16 0 18284 1936 508 S 0 0.0 0:00.00 ndbd


    top - 10:15:36 up 58 days, 1:16, 2 users, load average: 0.18, 0.19, 0.18
    Tasks: 60 total, 1 running, 59 sleeping, 0 stopped, 0 zombie
    Cpu(s): 3.1% us, 1.5% sy, 0.0% ni, 94.0% id, 1.4% wa, 0.1% hi, 0.0% si
    Mem: 8310552k total, 8133936k used, 176616k free, 56672k buffers
    Swap: 4192956k total, 168k used, 4192788k free, 5759520k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    28511 mysql 15 0 3047m 1.9g 1988 S 8 24.1 65:41.68 ndbd
    19123 mysql 16 0 199m 161m 3892 S 1 2.0 24:17.37 mysqld
    20703 mysql 16 0 18284 1936 508 S 0 0.0 0:00.00 ndbd


    top - 10:16:36 up 58 days, 1:17, 2 users, load average: 0.71, 0.35, 0.23
    Tasks: 60 total, 1 running, 59 sleeping, 0 stopped, 0 zombie
    Cpu(s): 2.1% us, 1.4% sy, 0.0% ni, 84.6% id, 11.8% wa, 0.1% hi, 0.0% si
    Mem: 8310552k total, 6777128k used, 1533424k free, 58704k buffers
    Swap: 4192956k total, 168k used, 4192788k free, 5917388k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    25725 mysql 15 0 3026m 476m 1004 S 1 5.9 0:00.44 ndbd
    19123 mysql 15 0 199m 161m 3892 S 1 2.0 24:17.78 mysqld
    20703 mysql 16 0 18284 1940 508 S 0 0.0 0:00.00 ndbd

    ============================================================================
    ==========
    In the mysql cluster log for the ndb node, I see:

    Time: Monday 10 December 2007 - 10:16:27
    Status: Temporary error, restart node
    Message: WatchDog terminate, internal error or massive overload on the
    machine running this node (Internal error, programmin
    g error or missing error message, please report a bug)
    Error: 6050
    Error data: Job Handling
    Error object: WatchDog.cpp
    Program: /usr/sbin/ndbd
    Pid: 28511
    Trace: /var/lib/mysql-cluster/ndb_3_trace.log.19
    Version: Version 5.0.37

    ============================================================================
    ==========
    The cluster log looks like this. Note that we do scheduled backups every 15
    minutes. I thought that this may be a problem but I am seeing outages that
    occur before a backup is to occur.

    2007-12-10 10:01:38 [MgmSrvr] INFO -- Node 4: Local checkpoint 10802
    started. Keep GCI = 11084379 oldest restorable GCI
    = 11084507
    2007-12-10 10:11:35 [MgmSrvr] INFO -- Node 4: Local checkpoint 10803
    started. Keep GCI = 11084676 oldest restorable GCI
    = 11084790
    2007-12-10 10:15:53 [MgmSrvr] INFO -- Node 4: Backup 23724 started from
    node 1
    2007-12-10 10:16:12 [MgmSrvr] WARNING -- Node 4: Node 3 missed heartbeat 2
    2007-12-10 10:16:13 [MgmSrvr] WARNING -- Node 4: Node 3 missed heartbeat 3
    2007-12-10 10:16:13 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
    2007-12-10 10:16:13 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
    2007-12-10 10:16:15 [MgmSrvr] WARNING -- Node 4: Node 3 missed heartbeat 4
    2007-12-10 10:16:15 [MgmSrvr] ALERT -- Node 4: Node 3 declared dead due
    to missed heartbeat

    ============================================================================
    ==========
    My config.ini parameters look like this:

    [NDBD DEFAULT]
    NoOfReplicas=2
    datadir=/var/lib/mysql-cluster
    DataMemory=2400M
    IndexMemory=150M
    MaxNoOfAttributes=2000
    MaxNoOfTables=128
    MaxNoOfOrderedIndexes=400
    MaxNoOfUniqueHashIndexes=400
    MaxNoOfTriggers=1500
    NoOfDiskPagesToDiskAfterRestartTUP=94
    NoOfDiskPagesToDiskAfterRestartACC=13
    NoOfDiskPagesToDiskDuringRestartTUP=94
    NoOfDiskPagesToDiskDuringRestartACC=13
    NoOfFragmentLogFiles=9
    MaxNoOfConcurrentTransactions=10000
    MaxNoOfConcurrentOperations=100000
    StopOnError=0
    LogLevelStartup=15
    LogLevelNodeRestart=15
    LogLevelError=15
    LogLevelShutdown=15

    StartFailureTimeout=600000
    TransactionInactiveTimeout=5000



    Yong Lee
    Developer
    ylee@eqo.com

    direct: +1.604.273.8173 x113
    mobile: +1.604.418.4470
    fax: +1.604.273.8172
    web: www.EQO.com
    EQO ID: yonglee

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcluster @
categoriesmysql
postedDec 6, '07 at 8:22a
activeDec 10, '07 at 8:38p
posts5
users3
websitemysql.com
irc#mysql

People

Translate

site design / logo © 2018 Grokbase