FAQ
Hoping someone else has seen this before.

I have a few dozen Dell R610 systems with CentOS 5.2 that are
using kernels from 5.3 and 5.4 (2.6.18-128.1.10.el5 & 2.6.18-164.6.1.el5),
that at random lose layer 2 network connectivity either partially
or totally. Running tcpdump on the interface reveals only ARP
broadcasts, no responses. Switch reports no packets being
received on the interface.

Systems can run for days/weeks or even months without an issue then
drop off the network. At first I thought it was the Dell switches
which we had lots of problems with but it has happened on two other
brands of switches as well(Cisco and Extreme), so I no longer believe
it's the switch but rather the systems.

The workaround is to restart the network on the system. I have even
configured the bonding driver to do ARP requests and fail over to
the backup link in the event that fails but wasn't successful there
either as both links can go down, and/or the system can go into
"degraded" state where it can reach some systems but not others.

I have ESXi systems running on the same hardware and to-date have not
seen any of them drop off the same way.

System can be under high traffic load at the time or completely
idle, it doesn't seem to make a difference. No log entries indicating
what might be going on.

I have a case open with Dell but am not expecting a whole lot from
them, maybe I'll get lucky though. They asked me to upgrade the NIC
firmware which I did on a batch of systems to no avail(the release
notes for the firmware said nothing about any fixes that sounded
like my issue).

Driver versions:
ESXi (vSphere):
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.6.9 (December 8, 2007)

Most linux systems(5.3 kernel):
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.7.9-1 (July 18, 2008)

Some linux systems(5.4 kernel):
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.9.3 (March 17, 2009)

Happens across at least a dozen systems spread over 4 data centers.

Never seen this sort of behavior before in the hundreds and hundreds
of systems I've run. These systems are all new, the R610 hardware
was released around May 2009, and we've been having issues since
day 1, but only recently have been able to rule the switches out as
the cause.

The latest driver on Broadcom's site is 1.9.20b which seems odd since
CentOS 5.4 seems to come with 1.9.3(the date on the Broadcom site is
more recent than the date on the linux kernel driver in 5.4) Most of
the fixes in the recent driver versions seem to focus around iSCSI,
which I'm not using.

lspci says:
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709
Gigabit Ethernet (rev 20)
Subsystem: Dell Unknown device 0236
Flags: bus master, fast devsel, latency 0, IRQ 114
Memory at dc000000 (64-bit, non-prefetchable) [size2M]
Capabilities: [48] Power Management version 3
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4
Enable-
Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9
Capabilities: [ac] Express Endpoint IRQ 0
Capabilities: [100] Device Serial Number c9-dc-93-fe-ff-9b-21-00
Capabilities: [110] Advanced Error Reporting
Capabilities: [150] Power Budgeting
Capabilities: [160] Virtual Channel

I suppose I could go build the latest driver from their site and see
how it goes..

thanks

nate

Search Discussions

  • James Pearson at Dec 15, 2009 at 12:02 pm

    nate wrote:
    Hoping someone else has seen this before.

    I have a few dozen Dell R610 systems with CentOS 5.2 that are
    using kernels from 5.3 and 5.4 (2.6.18-128.1.10.el5 & 2.6.18-164.6.1.el5),
    that at random lose layer 2 network connectivity either partially
    or totally. Running tcpdump on the interface reveals only ARP
    broadcasts, no responses. Switch reports no packets being
    received on the interface.
    A colleague of mine has seen the exact same issue with Dell R610 systems
    running CentOS 5.x

    It looks like this might be the same issue as:

    <https://bugzilla.redhat.com/show_bug.cgi?idR0888>

    Which seems to suggest disabling MSI - i.e. load the bnx2 module with
    "disable_msi=1"

    We haven't tried this yet (as we went back to using CentOS 4 on these
    boxes - which works OK)

    James Pearson
  • Nate Amsden at Dec 15, 2009 at 5:24 pm

    James Pearson wrote:

    It looks like this might be the same issue as:

    <https://bugzilla.redhat.com/show_bug.cgi?idR0888>

    Which seems to suggest disabling MSI - i.e. load the bnx2 module with
    "disable_msi=1"
    Wow! that looks interesting, will try it! thanks!

    nate
  • Akemi Yagi at Dec 15, 2009 at 5:45 pm

    On Tue, Dec 15, 2009 at 9:24 AM, nate wrote:
    James Pearson wrote:
    It looks like this might be the same issue as:

    <https://bugzilla.redhat.com/show_bug.cgi?idR0888>

    Which seems to suggest disabling MSI - i.e. load the bnx2 module with
    "disable_msi=1"
    Wow! that looks interesting, will try it! thanks!
    This is also being tracked in the CentOS bug tracker:

    http://bugs.centos.org/view.php?id832

    Akemi
  • James Pearson at Dec 16, 2009 at 10:47 am

    nate wrote:
    James Pearson wrote:
    It looks like this might be the same issue as:

    <https://bugzilla.redhat.com/show_bug.cgi?idR0888>

    Which seems to suggest disabling MSI - i.e. load the bnx2 module with
    "disable_msi=1"
    Wow! that looks interesting, will try it! thanks!
    Also, the pre-5.5 kernel at
    <http://people.redhat.com/dzickus/el5/181.el5> has "bnx2: update to
    version 2.0.2" - no idea if this helps in this case.

    James Pearson

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcentos @
categoriescentos
postedDec 14, '09 at 10:50p
activeDec 16, '09 at 10:47a
posts5
users3
websitecentos.org
irc#centos

People

Translate

site design / logo © 2022 Grokbase