Problem

I have a new problem with my Dell PowerEdge 6650 servers. They crash! Well sort of. Under heavy NFS usage (usually compiling a large program over NFS mount) the machine seems to hang. I use the word seems because all SSH connections to the machine timeout and it does not respond to network traffic. However, a look at the console proves the machine is alive and well.

Platform

  • 2x Dell PowerEdge 6650 Servers (One NFS client, One NFS server)
  • Embedded Broadcom BCM5700 Gigabit NICs (Connected via Cisco Gigabit switch)
  • CentOS 5.2 custom kernel 2.6.25

Evidence

Once on the console the obvious act is to check the system logs. If you wait long enough, the network interface begins working again and the following messages are in the system log. Of course eth1 may be replaced by your interface.

kernel: NETDEV WATCHDOG: eth1: transmit timed out
kernel: tg3: eth1: transmit timed out, resetting
kernel: tg3: DEBUG: MAC_TX_STATUS[00000008]  MAC_RX_STATUS[00000008]
kernel: tg3: DEBUG: RDMAC_STATUS[00000000]  WDMAC_STATUS[00000000]
kernel: tg3: tg3_stop_block timed out,  ofs=1800 enable_bit=2
kernel: tg3: tg3_stop_block timed out,  ofs=4800 enable_bit=2
kernel: tg3: eth1: Link is down.

This says that the ethernet watchdog figured out that the network interface hung or crashed or deadlocked or got stuck. The remedy? The watchdog restarts the interface.

So everytime you get the network crash you have two options….

  1. Wait ~15 minutes for it to fix itself (miraculous, I know!)
  2. Restart the networking service

Resolution 1

As I mentioned this was fairly repeatable for me. All I had to do was attempt to compile an application in a remote directory mounted with NFS. The NFS client was always the party to crash. The cheap (well lazy) fix that I took was to add a spare PCI gigabit NIC to the client machine. This resolved the problem on the client side.

Problem Re-emergence

After a couple weeks of operating with the client on PCI NIC and server on embedded NIC, the server’s NIC locked up just like the client’s had previously. This time I got a little fed up because I didn’t have a spare gigabit NIC to put in the server.

Resolution 2 – I Hope

This resolution is tentative. I have implemented it and have not had a crash but I do not trust that it is permanently fixed until more time has passed. I’ll post an update if I have any new issues.

At the advice of the local sysadmin I went to Dell’s website and poked around until I found an ISO cd image that contained all of the possible firmware updates for the PowerEdge 6650 on one CD. He recommended I give that a try and upgrade every single piece of firmware possible.

The result of the scan from the CD was that I was up to date on everything but “BMC” the Board Management Controller. My version was 1.64 and the latest was 1.78. So I let the CD do the firmware upgrade for me.

Since the upgrade (28 days ago) and a reboot I have not had another NIC crash. I don’t consider this conclusive yet because it is very possible that the situation has to be just right.

In summary the correct solution appears to be to update all of the server firmware (duh?). The easiest way to do that is to get the update CD for your OS from Dell. The CD is called something like “Dell CD ISO – PowerEdge Updates”. Let this also be a warning. Until I knew that update CD existed, I thought that I had upgraded all of the firmware possible in the server via individual floppies. Don’t make the same mistake, try Dell’s update all CD.

Failure .. Again!

Today I have crashed the NIC in the NFS server again… I’m looking for a new fix!

See Comments for updates from me.