FAQ
Hello all,

I have a problem with diagnosing "CRS-1615 voting device hang" error.

OS: OEL 5.6 (64-bit)
DB: Oracle 11.1.0.7
Storage: IBM XIV
HW: Blade server 8 cores with Hyper-Threading enabled (64 GB RAM).

This is active/passive failover cluster with OCFS2 used as cluster file
system.
OCR and VotingDisks are placed on OCFS2 mounts.

Problem occurs during "impdp" import of 700GB database from NFS.
First part of an import, loading data, is done without any problem.

During an index creation, after hour or two, node restarts due to the
problem with availability of voting disks:
"[cssd(28868)]CRS-1615:voting device hang at 50% fatal, termination in 99620
ms, disk (0//ocfs2/voting1/votingdisk)"

Impdp parameter file:
--
userid=xxx/yyy
directory=export_dir
dumpfile=exportdb_%U.dmp
logfile=import.log
parallel=16

exclude=statistics
schemas=(..list of shemas...)
exclude=DB_LINK

--

Database is in noarchivelog mode with "memory_target=7621050368" parameter
specified.
I've noticed that utilization of the server is not significant during
import.

As this error is reproducible I'm trying to find how to efficiently diagnose
this problem and to trace the cause.

If you have any suggestions I would appreciate any help.

Regards,
Marko

--
Marko Sutic, dipl.ing.rač.
My LinkedIn Profile <http://hr.linkedin.com/in/markosutic>

--
http://www.freelists.org/webpage/oracle-l

Search Discussions

  • D'Hooge Freek at Aug 25, 2011 at 7:56 am
    Marco,

    Did you notice a high io wait during the import?
    Do you see the same error message on the other node (on which the import is not running)?

    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer

    From: oracle-l-bounce_at_freelists.org On Behalf Of Marko Sutic
    Sent: donderdag 25 augustus 2011 9:08
    To: oracle-l@freelists.org
    Subject: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

    Hello all,

    I have a problem with diagnosing "CRS-1615 voting device hang" error.

    OS: OEL 5.6 (64-bit)
    DB: Oracle 11.1.0.7
    Storage: IBM XIV
    HW: Blade server 8 cores with Hyper-Threading enabled (64 GB RAM).

    This is active/passive failover cluster with OCFS2 used as cluster file system.
    OCR and VotingDisks are placed on OCFS2 mounts.

    Problem occurs during "impdp" import of 700GB database from NFS.
    First part of an import, loading data, is done without any problem.

    During an index creation, after hour or two, node restarts due to the problem with availability of voting disks:
    "[cssd(28868)]CRS-1615:voting device hang at 50% fatal, termination in 99620 ms, disk (0//ocfs2/voting1/votingdisk)"

    Impdp parameter file:
    --
    userid=xxx/yyy
    directory=export_dir
    dumpfile=exportdb_%U.dmp
    logfile=import.log
    parallel=16

    exclude=statistics
    schemas=(..list of shemas...)
    exclude=DB_LINK

    --

    Database is in noarchivelog mode with "memory_target=7621050368" parameter specified.
    I've noticed that utilization of the server is not significant during import.

    As this error is reproducible I'm trying to find how to efficiently diagnose this problem and to trace the cause.

    If you have any suggestions I would appreciate any help.

    Regards,
    Marko

    --
    Marko Sutic, dipl.ing.raè.
    My LinkedIn Profile
    --
    http://www.freelists.org/webpage/oracle-l
  • Marko Sutic at Aug 25, 2011 at 8:51 am
    Errors messages from another node:

    2011-08-25 10:38:33.563
    [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in
    14.000 seconds
    2011-08-25 10:38:40.558
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
    7.010 seconds
    2011-08-25 10:38:41.560
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
    6.010 seconds
    2011-08-25 10:38:45.558
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    2.010 seconds
    2011-08-25 10:38:46.560
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    1.010 seconds
    2011-08-25 10:38:47.562
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    0.010 seconds
    2011-08-25 10:38:47.574
    [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in
    /u01/app/crs/log/l01ora4/cssd/ocssd.log.
    2011-08-25 10:39:01.579
    [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
    l01ora4 .

    Regards,
    Marko
    On Thu, Aug 25, 2011 at 10:48 AM, Marko Sutic wrote:

    Statistics during import:

    [root_at_l01ora3 ~]# sar -u 2 15
    Linux 2.6.18-238.el5 (l01ora3.ot.hr) 08/25/2011

    10:36:55 AM CPU %user %nice %system %iowait %steal
    %idle
    10:36:57 AM all 5.59 0.00 0.06 3.12 0.00
    91.22
    10:36:59 AM all 1.84 0.00 0.09 4.43 0.00
    93.63
    10:37:01 AM all 1.50 0.00 0.09 4.81 0.00
    93.60
    10:37:03 AM all 1.44 0.00 0.09 4.78 0.00
    93.69
    10:37:05 AM all 1.47 0.00 0.41 4.87 0.00
    93.25
    10:37:07 AM all 1.47 0.00 0.06 4.75 0.00
    93.72
    10:37:09 AM all 1.22 0.00 0.09 5.18 0.00
    93.51
    10:37:11 AM all 0.22 0.00 0.03 6.15 0.00
    93.60
    10:37:13 AM all 0.28 0.00 0.06 8.72 0.00
    90.94
    10:37:15 AM all 1.53 0.00 0.19 4.93 0.00
    93.35
    10:37:17 AM all 1.47 0.00 0.09 4.72 0.00
    93.72
    10:37:19 AM all 6.28 0.00 0.06 0.00 0.00
    93.66
    10:37:21 AM all 0.31 0.00 0.03 6.03 0.00
    93.63
    10:37:23 AM all 0.00 0.00 0.03 11.31 0.00
    88.66
    10:37:25 AM all 0.06 0.00 0.06 12.48 0.00
    87.39
    Average: all 1.65 0.00 0.10 5.75 0.00
    92.50


    (I've excluded inactive devices from output)
    $ iostat -xd 5 3
    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
    avgqu-sz await svctm %util
    sda 0.00 21.40 0.00 20.80 0.00 337.60 16.23
    3.47 167.02 3.59 7.46
    sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    0.00 0.00 0.00 0.00
    sda2 0.00 21.40 0.00 20.80 0.00 337.60 16.23
    3.47 167.02 3.59 7.46
    sdb 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.17 1.17 0.14
    sdb1 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.17 1.17 0.14
    sdc 0.00 0.00 23.80 0.20 24371.20 6.40 1015.73
    0.17 7.28 4.46 10.70
    sdc1 0.00 0.00 23.80 0.20 24371.20 6.40 1015.73
    0.17 7.28 4.46 10.70
    sde 0.00 0.00 0.40 0.60 0.80 0.80 1.60
    0.00 1.40 1.40 0.14
    sde1 0.00 0.00 0.40 0.60 0.80 0.80 1.60
    0.00 1.40 1.40 0.14
    sdo 0.00 0.00 22.80 0.00 23347.20 0.00 1024.00
    0.17 7.36 4.46 10.16
    sdo1 0.00 0.00 22.80 0.00 23347.20 0.00 1024.00
    0.17 7.36 4.46 10.16
    sdt 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.50 1.50 0.18
    sdt1 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.50 1.50 0.18
    sdu 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 0.67 0.67 0.08
    sdu1 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 0.67 0.67 0.08
    sdx 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.33 1.33 0.16
    sdx1 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.33 1.33 0.16
    sdaa 0.00 0.00 20.40 0.60 20691.20 19.20 986.21
    0.14 6.56 4.00 8.40
    sdaa1 0.00 0.00 20.40 0.60 20691.20 19.20 986.21
    0.14 6.56 4.00 8.40
    sdai 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 0.67 0.67 0.08
    sdai1 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 0.67 0.67 0.08
    sdam 0.00 0.00 23.60 0.00 24166.40 0.00 1024.00
    0.17 7.22 4.45 10.50
    sdam1 0.00 0.00 23.60 0.00 24166.40 0.00 1024.00
    0.17 7.22 4.45 10.50
    sday 0.00 0.00 24.00 0.00 24576.00 0.00 1024.00
    0.17 7.19 4.43 10.64
    sday1 0.00 0.00 24.00 0.00 24576.00 0.00 1024.00
    0.17 7.19 4.43 10.64
    sdaz 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.04
    sdaz1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.04
    sdbb 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.04
    sdbb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.04
    sdbi 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    sdbi1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    sdbk 0.00 0.00 24.00 0.00 24576.00 0.00 1024.00
    1.16 6.85 41.68 100.02
    sdbk1 0.00 0.00 24.00 0.00 24576.00 0.00 1024.00
    1.16 6.85 41.68 100.02
    sdbr 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    2.00 0.00 0.00 100.02
    sdbr1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    2.00 0.00 0.00 100.02
    sdbw 0.00 0.00 24.00 0.00 24576.00 0.00 1024.00
    0.16 6.78 4.24 10.18
    sdbw1 0.00 0.00 24.00 0.00 24576.00 0.00 1024.00
    0.16 6.78 4.24 10.18
    sdcd 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    sdcd1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    sdci 0.00 0.00 23.60 0.40 23968.00 12.80 999.20
    0.17 7.14 4.38 10.52
    sdci1 0.00 0.00 23.60 0.40 23968.00 12.80 999.20
    0.17 7.14 4.38 10.52
    sdcm 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    3.00 0.00 0.00 100.02
    sdcm1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    3.00 0.00 0.00 100.02
    dm-0 0.00 0.00 0.00 42.20 0.00 337.60 8.00
    4.93 116.84 1.77 7.46
    dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    dm-3 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.33 1.33 0.16
    dm-4 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 0.67 0.67 0.08
    dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    3.00 0.00 0.00 100.02
    dm-6 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 0.67 0.67 0.08
    dm-7 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.50 1.50 0.18
    dm-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    3.00 0.00 0.00 100.02
    dm-9 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    dm-10 0.00 0.00 0.60 0.80 1.20 1.00 1.57
    0.00 1.29 1.29 0.18
    dm-11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    dm-12 0.00 0.00 186.40 1.20 190476.80 38.40 1015.54
    2.32 7.04 5.33 100.02
    dm-13 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.17 1.17 0.14
    dm-14 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    3.00 0.00 0.00 100.02
    dm-15 0.00 0.00 186.40 1.20 190476.80 38.40 1015.54
    2.32 7.04 5.33 100.02
    dm-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    3.00 0.00 0.00 100.02
    dm-17 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.17 1.17 0.14
    dm-18 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.33 1.33 0.16
    dm-19 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 1.50 1.50 0.18
    dm-20 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 0.67 0.67 0.08
    dm-21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    dm-22 0.00 0.00 0.60 0.60 1.20 0.60 1.50
    0.00 0.67 0.67 0.08
    dm-23 0.00 0.00 0.60 0.80 1.20 1.00 1.57
    0.00 1.29 1.29 0.18
    dm-24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02
    dm-25 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    1.00 0.00 0.00 100.02


    Hm... disk utilization is 100% for several devices.

    "dm-12" and "dm-12" are devices with database files.


    FC:
    24:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI
    Express HBA (rev 02)
    24:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI
    Express HBA (rev 02)


    Regards,
    Marko


    2011/8/25 D'Hooge Freek
    Marco,

    Did you notice a high io wait during the import?
    Do you see the same error message on the other node (on which the import
    is not running)?


    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer
    ----
    From: oracle-l-bounce_at_freelists.org
    On Behalf Of Marko Sutic
    Sent: donderdag 25 augustus 2011 9:08
    To: oracle-l@freelists.org
    Subject: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

    Hello all,

    I have a problem with diagnosing "CRS-1615 voting device hang" error.

    OS: OEL 5.6 (64-bit)
    DB: Oracle 11.1.0.7
    Storage: IBM XIV
    HW: Blade server 8 cores with Hyper-Threading enabled (64 GB RAM).

    This is active/passive failover cluster with OCFS2 used as cluster file
    system.
    OCR and VotingDisks are placed on OCFS2 mounts.


    Problem occurs during "impdp" import of 700GB database from NFS.
    First part of an import, loading data, is done without any problem.

    During an index creation, after hour or two, node restarts due to the
    problem with availability of voting disks:
    "[cssd(28868)]CRS-1615:voting device hang at 50% fatal, termination in
    99620 ms, disk (0//ocfs2/voting1/votingdisk)"

    Impdp parameter file:
    --
    userid=xxx/yyy
    directory=export_dir
    dumpfile=exportdb_%U.dmp
    logfile=import.log
    parallel=16
    exclude=statistics
    schemas=(..list of shemas...)
    exclude=DB_LINK
    --

    Database is in noarchivelog mode with "memory_target=7621050368" parameter
    specified.
    I've noticed that utilization of the server is not significant during
    import.

    As this error is reproducible I'm trying to find how to efficiently
    diagnose this problem and to trace the cause.


    If you have any suggestions I would appreciate any help.


    Regards,
    Marko

    --
    Marko Sutic, dipl.ing.rač.
    My LinkedIn Profile


    --
    Marko Sutic, dipl.ing.rač.
    My LinkedIn Profile <http://hr.linkedin.com/in/markosutic>
    --
    Marko Sutic, dipl.ing.rač.
    My LinkedIn Profile <http://hr.linkedin.com/in/markosutic>

    --
    http://www.freelists.org/webpage/oracle-l
  • D'Hooge Freek at Aug 25, 2011 at 9:08 am
    Marco,

    I don't know the error timings for the other node, but I think the heartbeat fatal messages are coming after the first node has terminated due to the missing voting disk.

    This would indicate that there is no general problem with the voting disk itself, but that the problem is specific to the first node.
    Either the connection itself or the load or an ocfs2 bug would then be the cause of the error.

    Do you know if at the time of the failure the other OCFS2 volumes where still accessible?
    Are your voting disks placed on the same luns as your database files or are they on a separate ocfs2 volume?

    Regards,

    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer

    ---
    From: Marko Sutic
    Sent: donderdag 25 augustus 2011 10:51
    To: D'Hooge Freek
    Cc: oracle-l@freelists.org
    Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

    Errors messages from another node:

    2011-08-25 10:38:33.563
    [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in 14.000 seconds
    2011-08-25 10:38:40.558
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in 7.010 seconds
    2011-08-25 10:38:41.560
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in 6.010 seconds
    2011-08-25 10:38:45.558
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 2.010 seconds
    2011-08-25 10:38:46.560
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 1.010 seconds
    2011-08-25 10:38:47.562
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 0.010 seconds
    2011-08-25 10:38:47.574
    [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in /u01/app/crs/log/l01ora4/cssd/ocssd.log.
    2011-08-25 10:39:01.579
    [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are l01ora4 .

    Regards,
    Marko
  • Marko Sutic at Aug 25, 2011 at 10:42 am
    Freek,

    you are correct - heartbeat fatal messages are there due to the missing
    voting disk.

    I have another database up and running on second node and this database is
    using same ocfs2 volume for Oracle database files as the first one.
    This database is running without any error so I suppose that other OCFS2
    volumes were accessible in the time of the failure.

    In this configuration are 3 voting disk files located on 3 different luns
    and separate OCFS2 volumes. When failure occurs two of three voting devices
    hang.

    It is also worth to mention that nothing else is running on that node except
    import.

    I simply can't figure out why two of three voting disks hang.

    Regards,
    Marko
    On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek wrote:

    Marco,

    I don't know the error timings for the other node, but I think the
    heartbeat fatal messages are coming after the first node has terminated due
    to the missing voting disk.

    This would indicate that there is no general problem with the voting disk
    itself, but that the problem is specific to the first node.
    Either the connection itself or the load or an ocfs2 bug would then be the
    cause of the error.

    Do you know if at the time of the failure the other OCFS2 volumes where
    still accessible?
    Are your voting disks placed on the same luns as your database files or are
    they on a separate ocfs2 volume?

    Regards,


    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer
    ---
    From: Marko Sutic
    Sent: donderdag 25 augustus 2011 10:51
    To: D'Hooge Freek
    Cc: oracle-l@freelists.org
    Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620
    ms

    Errors messages from another node:

    2011-08-25 10:38:33.563
    [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in
    14.000 seconds
    2011-08-25 10:38:40.558
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
    7.010 seconds
    2011-08-25 10:38:41.560
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
    6.010 seconds
    2011-08-25 10:38:45.558
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    2.010 seconds
    2011-08-25 10:38:46.560
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    1.010 seconds
    2011-08-25 10:38:47.562
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    0.010 seconds
    2011-08-25 10:38:47.574
    [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in
    /u01/app/crs/log/l01ora4/cssd/ocssd.log.
    2011-08-25 10:39:01.579
    [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
    l01ora4 .


    Regards,
    Marko
    --
    Marko Sutic, dipl.ing.rač.
    My LinkedIn Profile <http://hr.linkedin.com/in/markosutic>

    --
    http://www.freelists.org/webpage/oracle-l
  • David Barbour at Aug 25, 2011 at 10:14 pm
    Anything in /var/log/messages?
    On Thu, Aug 25, 2011 at 5:42 AM, Marko Sutic wrote:

    Freek,

    you are correct - heartbeat fatal messages are there due to the missing
    voting disk.

    I have another database up and running on second node and this database is
    using same ocfs2 volume for Oracle database files as the first one.
    This database is running without any error so I suppose that other OCFS2
    volumes were accessible in the time of the failure.

    In this configuration are 3 voting disk files located on 3 different luns
    and separate OCFS2 volumes. When failure occurs two of three voting devices
    hang.

    It is also worth to mention that nothing else is running on that node
    except import.


    I simply can't figure out why two of three voting disks hang.


    Regards,
    Marko

    On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek wrote:

    Marco,

    I don't know the error timings for the other node, but I think the
    heartbeat fatal messages are coming after the first node has terminated due
    to the missing voting disk.

    This would indicate that there is no general problem with the voting disk
    itself, but that the problem is specific to the first node.
    Either the connection itself or the load or an ocfs2 bug would then be the
    cause of the error.

    Do you know if at the time of the failure the other OCFS2 volumes where
    still accessible?
    Are your voting disks placed on the same luns as your database files or
    are they on a separate ocfs2 volume?

    Regards,


    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer
    ---
    From: Marko Sutic
    Sent: donderdag 25 augustus 2011 10:51
    To: D'Hooge Freek
    Cc: oracle-l@freelists.org
    Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in
    99620 ms

    Errors messages from another node:

    2011-08-25 10:38:33.563
    [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in
    14.000 seconds
    2011-08-25 10:38:40.558
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
    7.010 seconds
    2011-08-25 10:38:41.560
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
    6.010 seconds
    2011-08-25 10:38:45.558
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    2.010 seconds
    2011-08-25 10:38:46.560
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    1.010 seconds
    2011-08-25 10:38:47.562
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    0.010 seconds
    2011-08-25 10:38:47.574
    [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in
    /u01/app/crs/log/l01ora4/cssd/ocssd.log.
    2011-08-25 10:39:01.579
    [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
    l01ora4 .


    Regards,
    Marko


    --
    Marko Sutic, dipl.ing.rač.
    My LinkedIn Profile <http://hr.linkedin.com/in/markosutic>
    --
    http://www.freelists.org/webpage/oracle-l
  • Marko Sutic at Aug 26, 2011 at 8:34 am
    Hi David,

    /var/log/messages is stuffed with various messages and I cannot identify
    what is important to look for.

    I will attach excerpt of log file from the period during import and when
    failure occurred.

    If you notice something odd please let me know.

    Regards,
    Marko
    On Fri, Aug 26, 2011 at 12:14 AM, David Barbour wrote:

    Anything in /var/log/messages?
    On Thu, Aug 25, 2011 at 5:42 AM, Marko Sutic wrote:

    Freek,

    you are correct - heartbeat fatal messages are there due to the missing
    voting disk.

    I have another database up and running on second node and this database is
    using same ocfs2 volume for Oracle database files as the first one.
    This database is running without any error so I suppose that other OCFS2
    volumes were accessible in the time of the failure.

    In this configuration are 3 voting disk files located on 3 different luns
    and separate OCFS2 volumes. When failure occurs two of three voting devices
    hang.

    It is also worth to mention that nothing else is running on that node
    except import.


    I simply can't figure out why two of three voting disks hang.


    Regards,
    Marko

    On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek wrote:

    Marco,

    I don't know the error timings for the other node, but I think the
    heartbeat fatal messages are coming after the first node has terminated due
    to the missing voting disk.

    This would indicate that there is no general problem with the voting disk
    itself, but that the problem is specific to the first node.
    Either the connection itself or the load or an ocfs2 bug would then be
    the cause of the error.

    Do you know if at the time of the failure the other OCFS2 volumes where
    still accessible?
    Are your voting disks placed on the same luns as your database files or
    are they on a separate ocfs2 volume?

    Regards,


    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer
    ---
    From: Marko Sutic
    Sent: donderdag 25 augustus 2011 10:51
    To: D'Hooge Freek
    Cc: oracle-l@freelists.org
    Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in
    99620 ms

    Errors messages from another node:

    2011-08-25 10:38:33.563
    [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction
    in 14.000 seconds
    2011-08-25 10:38:40.558
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction
    in 7.010 seconds
    2011-08-25 10:38:41.560
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction
    in 6.010 seconds
    2011-08-25 10:38:45.558
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction
    in 2.010 seconds
    2011-08-25 10:38:46.560
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction
    in 1.010 seconds
    2011-08-25 10:38:47.562
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction
    in 0.010 seconds
    2011-08-25 10:38:47.574
    [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in
    /u01/app/crs/log/l01ora4/cssd/ocssd.log.
    2011-08-25 10:39:01.579
    [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
    l01ora4 .


    Regards,
    Marko

    --
    http://www.freelists.org/webpage/oracle-l

    text/plain attachment: messages.txt
  • D'Hooge Freek at Aug 26, 2011 at 3:07 pm
    Marco,

    Your system is complaining about lost disks and io paths.
    I also see that your multipathing is configured so that it will wait when all paths to a disk are lost (queue_if_no_path). When that happens, the clusterware will start reporting unaccessible voting disks, while the other processes will (appear to) hang.

    Can you check if you still get errors like: "tur checker reports path is down" or "kernel: end_request: I/O error,...".
    If not, check if they start to appear when you put some load on the io subsystem.

    Regards,

    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer

    ---
    From: oracle-l-bounce_at_freelists.org On Behalf Of Marko Sutic
    Sent: vrijdag 26 augustus 2011 10:34
    To: David Barbour
    Cc: oracle-l@freelists.org
    Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

    Hi David,

    /var/log/messages is stuffed with various messages and I cannot identify what is important to look for.

    I will attach excerpt of log file from the period during import and when failure occurred.

    If you notice something odd please let me know.

    Regards,
    Marko

    On Fri, Aug 26, 2011 at 12:14 AM, David Barbour wrote:
    Anything in /var/log/messages?

    On Thu, Aug 25, 2011 at 5:42 AM, Marko Sutic wrote:
    Freek,

    you are correct - heartbeat fatal messages are there due to the missing voting disk.

    I have another database up and running on second node and this database is using same ocfs2 volume for Oracle database files as the first one.
    This database is running without any error so I suppose that other OCFS2 volumes were accessible in the time of the failure.

    In this configuration are 3 voting disk files located on 3 different luns and separate OCFS2 volumes. When failure occurs two of three voting devices hang.

    It is also worth to mention that nothing else is running on that node except import.

    I simply can't figure out why two of three voting disks hang.

    Regards,
    Marko

    On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek wrote:
    Marco,

    I don't know the error timings for the other node, but I think the heartbeat fatal messages are coming after the first node has terminated due to the missing voting disk.

    This would indicate that there is no general problem with the voting disk itself, but that the problem is specific to the first node.
    Either the connection itself or the load or an ocfs2 bug would then be the cause of the error.

    Do you know if at the time of the failure the other OCFS2 volumes where still accessible?
    Are your voting disks placed on the same luns as your database files or are they on a separate ocfs2 volume?

    Regards,

    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer
    ---
    From: Marko Sutic
    Sent: donderdag 25 augustus 2011 10:51
    To: D'Hooge Freek
    Cc: oracle-l@freelists.org
    Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

    Errors messages from another node:

    2011-08-25 10:38:33.563
    [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in 14.000 seconds
    2011-08-25 10:38:40.558
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in 7.010 seconds
    2011-08-25 10:38:41.560
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in 6.010 seconds
    2011-08-25 10:38:45.558
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 2.010 seconds
    2011-08-25 10:38:46.560
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 1.010 seconds
    2011-08-25 10:38:47.562
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 0.010 seconds
    2011-08-25 10:38:47.574
    [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in /u01/app/crs/log/l01ora4/cssd/ocssd.log.
    2011-08-25 10:39:01.579
    [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are l01ora4 .

    Regards,
    Marko
  • Marko Sutic at Aug 26, 2011 at 4:05 pm
    Freek,

    there were lots of "tur checker reports path is down" errors.

    Also when I filtered /var/log/messages for "I/O error" there were lots of
    messages like:
    ...
    Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdcn, sector
    2367
    Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdbs, sector
    2247
    Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdbk, sector
    5103799
    Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdbk, sector
    6193375
    Aug 25 22:58:55 l01ora3 kernel: end_request: I/O error, dev sdci, sector
    2367
    Aug 25 22:58:56 l01ora3 kernel: end_request: I/O error, dev sdci, sector
    9330267
    ...

    Sysadmins replaced FC switch and rebooted nodes.
    It seems that not all paths were activated or something like that. (I will
    ask detailed answer next week)

    After that action this errors disappeared and now import has not failed
    after hour/two (it is still active).

    Also I've collected OS statistics and asked Oracle support to help me.

    Answer from Oracle support:

    Please find my analysis below.

    zzz ***Thu Aug 25 22:09:28 CEST 2011
    procs -----------memory---------- ---swap-- -----io---- --system--
    -----cpu------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    6 7 0 59341556 164072 5250912 0 0 107 20 2 32 0 0 99 1 0
    0 7 0 59338752 164072 5250904 0 0 0 0 1017 5378 0 0 87 12 0
    0 7 0 59338628 164072 5250904 0 0 0 8 1016 5168 0 0 87 13 0

    ++ From above we can see there might be some disk problems because b>r

    ++ bi, bo are having high values indicates that there is a problem with I/O
    or storage.

    The vmstat "b" column

    Any count in the "b" column of vmstat is indicative
    of threads blocked via sema_p in the biowait state.

    This indicates that an I/O call has been made and that Solaris
    is waiting on a return response.

    Note: I/O can be blocked prior to waiting on a return response.
    In particular, if the number of commands active on
    a device is greater than or equal to the value of
    sd_max_throttle, all threads requesting I/O through
    that device will block prior to waiting for the I/O.
    This will not be reflected in the "b" column count.

    ++ No heavey CPU usage

    ++ Nothing wrong i can find in top command

    ++ sda, sda2 devices have heavry writes for seconds.

    Please check with your system admin why I/O issues are there in the system.
    If you would like to raise a new SR with storage regarding this, that s
    fine.

    My current vmstat statistics are:

    ....
    zzz ***Fri Aug 26 14:18:39 CEST 2011
    procs -----------memory---------- ---swap-- -----io---- --system--
    -----cpu------
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    6 5 0 59010308 222532 5545300 0 0 436 90 23 24 1 0
    98 2 0
    0 5 0 59008336 222532 5545332 0 0 17 468 1077 4482 0 0
    93 6 0
    0 5 0 59008088 222532 5545356 0 0 2 1 1019 4189 0 0
    94 6 0
    zzz ***Fri Aug 26 14:18:49 CEST 2011
    procs -----------memory---------- ---swap-- -----io---- --system--
    -----cpu------
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    5 5 0 59010852 222540 5545516 0 0 436 90 23 24 1 0
    98 2 0
    0 5 0 59007824 222544 5545532 0 0 1 369 1132 3625 0 0
    93 6 0
    0 5 0 59007576 222544 5545536 0 0 2 1 1021 3219 0 0
    94 6 0
    zzz ***Fri Aug 26 14:18:59 CEST 2011
    procs -----------memory---------- ---swap-- -----io---- --system--
    -----cpu------
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    5 5 0 59009780 222548 5545652 0 0 436 90 23 24 1 0
    98 2 0
    0 5 0 59006268 222548 5545640 0 0 0 449 1034 4368 0 0
    93 6 0
    0 5 0 59006376 222548 5545640 0 0 19 50 1035 3300 0 0
    94 6 0
    zzz ***Fri Aug 26 14:19:09 CEST 2011
    procs -----------memory---------- ---swap-- -----io---- --system--
    -----cpu------
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    5 6 0 59000056 222580 5545700 0 0 436 90 23 24 1 0
    98 2 0
    0 6 0 59001064 222580 5545720 0 0 0 364 1040 5069 0 0
    93 6 0
    0 6 0 59006064 222584 5545676 0 0 1 100 1048 4974 0 0
    94 6 0
    zzz ***Fri Aug 26 14:19:19 CEST 2011
    procs -----------memory---------- ---swap-- -----io---- --system--
    -----cpu------
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    5 0 0 59010232 222668 5545780 0 0 436 90 23 24 1 0
    98 2 0
    0 7 0 59005660 222672 5545932 0 0 14337 197 1233 3823 2 1
    89 8 0
    1 6 0 59005536 222680 5545984 0 0 28 470 1038 3290 0 0
    87 13 0
    zzz ***Fri Aug 26 14:19:29 CEST 2011
    procs -----------memory---------- ---swap-- -----io---- --system--
    -----cpu------
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    6 6 0 58998056 222680 5546032 0 0 436 90 23 24 1 0
    98 2 0
    0 7 0 58995104 222680 5545996 0 0 16 216 1030 4447 0 0
    87 13 0
    0 6 0 58995488 222680 5546056 0 0 3 469 1046 4248 0 0
    87 13 0
    ....

    I hope that this is OK now.

    Thank you for your help.

    Regards,
    Marko
    On Fri, Aug 26, 2011 at 5:07 PM, D'Hooge Freek wrote:

    Marco,

    Your system is complaining about lost disks and io paths.
    I also see that your multipathing is configured so that it will wait when
    all paths to a disk are lost (queue_if_no_path). When that happens, the
    clusterware will start reporting unaccessible voting disks, while the other
    processes will (appear to) hang.

    Can you check if you still get errors like: "tur checker reports path is
    down" or "kernel: end_request: I/O error,...".
    If not, check if they start to appear when you put some load on the io
    subsystem.


    Regards,


    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer
    ---
    From: oracle-l-bounce_at_freelists.org
    On Behalf Of Marko Sutic
    Sent: vrijdag 26 augustus 2011 10:34
    To: David Barbour
    Cc: oracle-l@freelists.org
    Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620
    ms

    Hi David,

    /var/log/messages is stuffed with various messages and I cannot identify
    what is important to look for.

    I will attach excerpt of log file from the period during import and when
    failure occurred.

    If you notice something odd please let me know.

    Regards,
    Marko


    On Fri, Aug 26, 2011 at 12:14 AM, David Barbour
    wrote:
    Anything in /var/log/messages?

    On Thu, Aug 25, 2011 at 5:42 AM, Marko Sutic
    wrote:
    Freek,

    you are correct - heartbeat fatal messages are there due to the missing
    voting disk.

    I have another database up and running on second node and this database is
    using same ocfs2 volume for Oracle database files as the first one.
    This database is running without any error so I suppose that other OCFS2
    volumes were accessible in the time of the failure.

    In this configuration are 3 voting disk files located on 3 different luns
    and separate OCFS2 volumes. When failure occurs two of three voting devices
    hang.

    It is also worth to mention that nothing else is running on that node
    except import.


    I simply can't figure out why two of three voting disks hang.


    Regards,
    Marko


    On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek
    wrote:
    Marco,

    I don't know the error timings for the other node, but I think the
    heartbeat fatal messages are coming after the first node has terminated due
    to the missing voting disk.

    This would indicate that there is no general problem with the voting disk
    itself, but that the problem is specific to the first node.
    Either the connection itself or the load or an ocfs2 bug would then be the
    cause of the error.

    Do you know if at the time of the failure the other OCFS2 volumes where
    still accessible?
    Are your voting disks placed on the same luns as your database files or are
    they on a separate ocfs2 volume?

    Regards,


    Freek D'Hooge
    Uptime
    Oracle Database Administrator
    email: freek.dhooge_at_uptime.be
    tel +32(0)3 451 23 82
    http://www.uptime.be
    disclaimer: www.uptime.be/disclaimer
    ---
    From: Marko Sutic
    Sent: donderdag 25 augustus 2011 10:51
    To: D'Hooge Freek
    Cc: oracle-l@freelists.org
    Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620
    ms

    Errors messages from another node:

    2011-08-25 10:38:33.563
    [cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in
    14.000 seconds
    2011-08-25 10:38:40.558
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
    7.010 seconds
    2011-08-25 10:38:41.560
    [cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in
    6.010 seconds
    2011-08-25 10:38:45.558
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    2.010 seconds
    2011-08-25 10:38:46.560
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    1.010 seconds
    2011-08-25 10:38:47.562
    [cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in
    0.010 seconds
    2011-08-25 10:38:47.574
    [cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in
    /u01/app/crs/log/l01ora4/cssd/ocssd.log.
    2011-08-25 10:39:01.579
    [cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are
    l01ora4 .


    Regards,
    Marko




    --
    http://www.freelists.org/webpage/oracle-l

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouporacle-l @
categoriesoracle
postedAug 25, '11 at 7:07a
activeAug 26, '11 at 4:05p
posts9
users3
websiteoracle.com

People

Translate

site design / logo © 2022 Grokbase