Archive for category Linux

Virtual server changed

Last April I moved this website to a new virtual server with the same hosting provider. Previously I was on an OpenVZ platform. This was such a nightmare for me. I’ll explain the details in a bit. I really liked the hosting company, VPSLink, because of their communication practices, network speed and full-featured control panel. So I stuck with the same company but bought a new server on the Xen virtual platform. Now I’m much happier.

The real problem I had with OpenVZ was the lack of swap space. Swap space is disk space set aside by the operating system to be used as a stand-in for RAM when there is not enough RAM free to run all of your programs. Using swap space has a penalty and that is access time because program data has to be fetched from your hard drive before it can be used. Leased virtual servers typically are quite limited in the amount of RAM you are given so swap space is really a must unless your server will only be running 1-2 applications.

For example, my leased server is a one stop shop for website and email. To perform these tasks it needs these daemons running all of the time:

  • Apache webserver
  • Named/BIND DNS server
  • Spamassassin spam filter
  • Sendmail smtp
  • Dovecot IMAP server
  • Mysql database server

I should have known I was in for trouble when I couldn’t even start Apache + Named at the same time with their default configuration without running out of memory. I followed a few guides on the net and got their footprints trimmed down to a workable state. But the penalty was that now all of my applications were so memory constrained their performance suffered a bit. Furthermore, I was at the threshold of memory usage. Linux would routinely kill my dovecot mail processes to try to reclaim memory, this of course closed IMAP connections which I noticed from a client user perspective. I also could not run yum to update packages without running out of memory.

So one day I got fed up and bought a new server with the same company but the new server was Xen based. I couldn’t be happier now because I have swap space. Most of my applications are still quite fast and my dovecot processes are no longer getting killed.

Tags:

PowerEdge 6650 NIC Issues

Problem

I have a new problem with my Dell PowerEdge 6650 servers. They crash! Well sort of. Under heavy NFS usage (usually compiling a large program over NFS mount) the machine seems to hang. I use the word seems because all SSH connections to the machine timeout and it does not respond to network traffic. However, a look at the console proves the machine is alive and well.

Platform

  • 2x Dell PowerEdge 6650 Servers (One NFS client, One NFS server)
  • Embedded Broadcom BCM5700 Gigabit NICs (Connected via Cisco Gigabit switch)
  • CentOS 5.2 custom kernel 2.6.25

Evidence

Once on the console the obvious act is to check the system logs. If you wait long enough, the network interface begins working again and the following messages are in the system log. Of course eth1 may be replaced by your interface.

kernel: NETDEV WATCHDOG: eth1: transmit timed out
kernel: tg3: eth1: transmit timed out, resetting
kernel: tg3: DEBUG: MAC_TX_STATUS[00000008]  MAC_RX_STATUS[00000008]
kernel: tg3: DEBUG: RDMAC_STATUS[00000000]  WDMAC_STATUS[00000000]
kernel: tg3: tg3_stop_block timed out,  ofs=1800 enable_bit=2
kernel: tg3: tg3_stop_block timed out,  ofs=4800 enable_bit=2
kernel: tg3: eth1: Link is down.

This says that the ethernet watchdog figured out that the network interface hung or crashed or deadlocked or got stuck. The remedy? The watchdog restarts the interface.

So everytime you get the network crash you have two options….

  1. Wait ~15 minutes for it to fix itself (miraculous, I know!)
  2. Restart the networking service

Resolution 1

As I mentioned this was fairly repeatable for me. All I had to do was attempt to compile an application in a remote directory mounted with NFS. The NFS client was always the party to crash. The cheap (well lazy) fix that I took was to add a spare PCI gigabit NIC to the client machine. This resolved the problem on the client side.

Problem Re-emergence

After a couple weeks of operating with the client on PCI NIC and server on embedded NIC, the server’s NIC locked up just like the client’s had previously. This time I got a little fed up because I didn’t have a spare gigabit NIC to put in the server.

Resolution 2 – I Hope

This resolution is tentative. I have implemented it and have not had a crash but I do not trust that it is permanently fixed until more time has passed. I’ll post an update if I have any new issues.

At the advice of the local sysadmin I went to Dell’s website and poked around until I found an ISO cd image that contained all of the possible firmware updates for the PowerEdge 6650 on one CD. He recommended I give that a try and upgrade every single piece of firmware possible.

The result of the scan from the CD was that I was up to date on everything but “BMC” the Board Management Controller. My version was 1.64 and the latest was 1.78. So I let the CD do the firmware upgrade for me.

Since the upgrade (28 days ago) and a reboot I have not had another NIC crash. I don’t consider this conclusive yet because it is very possible that the situation has to be just right.

In summary the correct solution appears to be to update all of the server firmware (duh?). The easiest way to do that is to get the update CD for your OS from Dell. The CD is called something like “Dell CD ISO – PowerEdge Updates”. Let this also be a warning. Until I knew that update CD existed, I thought that I had upgraded all of the firmware possible in the server via individual floppies. Don’t make the same mistake, try Dell’s update all CD.

Failure .. Again!

Today I have crashed the NIC in the NFS server again… I’m looking for a new fix!

See Comments for updates from me.

LDAP SSL doesn’t like me

I enabled LDAP SSL and I’m having problem’s with user logon. Of course, a reminder, I’m running Fedora Directory Server (FDS). The user gets errors like this:

-bash: [: =: unary operator expected

I figured out that they come from running commands like the following in a shell script or typed by a user:

echo `/usr/bin/id -u`

By process of elimination I found that when I revert my ldap.conf file to not use SSL when connecting to my FDS LDAP server the bash errors go away.

I didn’t want to give up SSL because otherwise passwords are in CLEAR text. I actually verified that with wireshark.. ewww.

After > half a day hunting I stumbled upon a workaround

Turn on nscd. NSCD stands for name service cache daemon, which can lower the stress on auth servers by caching data. I have no idea why this fixed my problem…. but it did.