Note: I thought that I had saved a bunch of documentation, notes, and output from this experience, both for documentation, and becuase I figured I would be writing a post. However, I can't seem to find it anywhere. If I do, there will be an updated version
While I am not the most experienced sysadmin out there, I have been running servers for 5 or so years at this point. At my current job, we have 40+ hosts, the majority of them being cookie-cutter LAMP stacks. Some of these have been running basically untouched for most of their lives.
One particular server started throwing odd MySQL errors on a Saturday morning. This wasn't a heavily-trafficed server, and for MySQL to tip over was not expected. I was awake, and dove in. This was a standard setup for us - a VPS, running a LAMP stack, with a Magento site.
I started by Googling the MySQL error that was being thrown. It was the "out of storage" disk error, with error code (19). Error code 19 is also the code for the disk being out of space. However, when I ran du -h (I'm a sucker for human-readable units), it came back with only 75% of the device in use - high, but not that high. Not enough to cause MySQL to have a fit.
Here is where I made my first mistake that day - I didn't run man du. Doing so would have probably shown, or at least hinted at, what was going on. You long-time *nix admins probably can guess what it was at this point.
I did start Googling around the nexus of the error, framework, and versions, hoping I would stumble across something. At the same time, I kept trying to figure out what was taking up the space. I cleared out a bit of space, got MySQL to restart and come up, and crossed my fingers. It stayed up, and I didn't see anything in my searches, so I updated my co-workers, and made sure to keep and eye on the htop window for that server throughout the day.
As you might have expected, the errors came back. I thought that the restart, rather than the clearing of space, had been the fix the first time. I mean, it was only at 75% usage in the first place - why would clearing out some old log files had helped?
However, this restart resulted in MySQL not coming up. I was frustrated, tired, still not feeling well, and was about ready to kick the whole box, and see if the whole server just needed a reboot. I decided to wait, and spend another 10 minutes searching first - I try to always do this before taking out the sledgehammer approach, just in case a better idea comes up.
While in Google, it finally hit me to read the du manpage. Doing so led me to du -i, which for laypeople is the measurement of the amount of file descriptors in use, or 'inodes' - on POSIX systems, you have to worry about the number of files, not just the size of them. Running du -i on the server in question returned that 99% of the inodes ere in use. A quick bit of Googling to confirm that MySQL would report that it was out of disk space when the system was out of inodes. This is also why deleting the old log files to free up space let MySQL come back up - it freed up enough inodes for it. When I didn't delete more files before restarting it the second time, there were no free inodes to be allocated, and MySQL stayed down.
So now we know the reason, but not the RFO - how is a system that only has 75% of disk in use run out of inodes? Without fixing this, or figuring out where all the inodes were being eaten by, the system wouldn't remain stable. Armed with this new info, I started a find on the system, for any directory with over 100k files in it. I figured that the files would have to be small, since I still had disk space, and would likely be the result of something automated, which is why it kept growing. I had checked the inode use on one of my personal servers, which had a similar uptime - it was at 63%, meaning this was an anomalous event. While the find ran, I resumed my searching.
I ran across someone who had run into a similar issue - MySQL reporting that it was out of space, there was enough space, it was an inode issue, etc. However, he was also running Magento. I opened a new ssh session to the server, and dove into the directory he found the problem files in. Bingo.
I would later find out, when the find finally finished, that there were over 10 million files in that directory. Apache had been configured to use file-based sessions. In the config file, there is a comment that if you choose file-based sessions, you should setup a cron task to clean them up regularly. This server didn't have that, as we rarely use file-based sessions. We would later find out, after auditing our hosts, that it was one of 2 servers configured that way, out of the 35ish we had active at that time.
The files were cleaned up (which was a challenge in and of itself - anything that wasn't a sledgehammer wouldn't run fast enough, because it indexed before it deleted), the setting was changed to use database-based sessions, and the servers were brought up to the standard configuration.
To those of you that run company-wide networks, or more advanced setups than a bunch of LAMP stacks, with a few multi-tenent setups thrown in, this may seem like small potatoes. I admit, I made it worse by not going slower, and checking to see what else the tools I was using could tell me about the situation. But, for the types of servers we run, this is about as left-field as it can get.
I have since added inodes to monitoring tools, and added it to my "sick server" checklist, that I run through whenever I either get a new host, or one that is exhibiting random/unknown issues.
These are the types of outages that are best taught by experience, either first-hand, or through the swapping of war stories. If you have a good one, share it. It might just save a server some day.