Problem
I have a new problem with my Dell PowerEdge 6650 servers. They crash! Well sort of. Under heavy NFS usage (usually compiling a large program over NFS mount) the machine seems to hang. I use the word seems because all SSH connections to the machine timeout and it does not respond to network traffic. However, a look at the console proves the machine is alive and well.
Platform
- 2x Dell PowerEdge 6650 Servers (One NFS client, One NFS server)
- Embedded Broadcom BCM5700 Gigabit NICs (Connected via Cisco Gigabit switch)
- CentOS 5.2 custom kernel 2.6.25
Evidence
Once on the console the obvious act is to check the system logs. If you wait long enough, the network interface begins working again and the following messages are in the system log. Of course eth1 may be replaced by your interface.
kernel: NETDEV WATCHDOG: eth1: transmit timed out kernel: tg3: eth1: transmit timed out, resetting kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000008] kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] kernel: tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2 kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2 kernel: tg3: eth1: Link is down.
This says that the ethernet watchdog figured out that the network interface hung or crashed or deadlocked or got stuck. The remedy? The watchdog restarts the interface.
So everytime you get the network crash you have two options….
- Wait ~15 minutes for it to fix itself (miraculous, I know!)
- Restart the networking service
Resolution 1
As I mentioned this was fairly repeatable for me. All I had to do was attempt to compile an application in a remote directory mounted with NFS. The NFS client was always the party to crash. The cheap (well lazy) fix that I took was to add a spare PCI gigabit NIC to the client machine. This resolved the problem on the client side.
Problem Re-emergence
After a couple weeks of operating with the client on PCI NIC and server on embedded NIC, the server’s NIC locked up just like the client’s had previously. This time I got a little fed up because I didn’t have a spare gigabit NIC to put in the server.
Resolution 2 – I Hope
This resolution is tentative. I have implemented it and have not had a crash but I do not trust that it is permanently fixed until more time has passed. I’ll post an update if I have any new issues.
At the advice of the local sysadmin I went to Dell’s website and poked around until I found an ISO cd image that contained all of the possible firmware updates for the PowerEdge 6650 on one CD. He recommended I give that a try and upgrade every single piece of firmware possible.
The result of the scan from the CD was that I was up to date on everything but “BMC” the Board Management Controller. My version was 1.64 and the latest was 1.78. So I let the CD do the firmware upgrade for me.
Since the upgrade (28 days ago) and a reboot I have not had another NIC crash. I don’t consider this conclusive yet because it is very possible that the situation has to be just right.
In summary the correct solution appears to be to update all of the server firmware (duh?). The easiest way to do that is to get the update CD for your OS from Dell. The CD is called something like “Dell CD ISO – PowerEdge Updates”. Let this also be a warning. Until I knew that update CD existed, I thought that I had upgraded all of the firmware possible in the server via individual floppies. Don’t make the same mistake, try Dell’s update all CD.
Failure .. Again!
Today I have crashed the NIC in the NFS server again… I’m looking for a new fix!
See Comments for updates from me.
#1 by Ryan on August 5, 2008 - 9:50 pm
Update 1:
Last week the server which has NOT had the firmware upgrade crashed one its internal Broadcom NICs. Believe it or not it did it while handing out IP addresses with DHCP! Not exactly a very intensive task. Oh well, needless to say I upgraded the firmware on that server too. Now I’m testing the firmware as a fix for the NIC crashes on both PowerEdge 6650’s.
I got a kick out of the change logs for the BMC firmware. For almost every release between 1.64 and 1.78 the Dell comment for the revision is “maintenance”. Oh great, thanks! That really helps me understand if this update might resolve my issue.
#2 by Ryan on August 8, 2008 - 12:16 pm
Update 2:
The fix did not work. Today I have crashed the NIC in the NFS server twice. This is frustrating! Back to the drawing board.
#3 by Justin on February 7, 2009 - 1:11 pm
Oh great! Well at least it’s not just me! I just started a job setting up CentOS servers. I am running samba with an NFS mounted drive. When transferring large amounts of data, the machine just locks up. A restart of the networks seems to clear up the problem, but it takes about 7 mins to see the network drive again.