Sunday, June 18, 2006

var partition filling up

Recently I had the chance to do some work for a linux server and and encountered some problems with the /var partition filling up really quickly.

du vs df
There are two common tools for looking at disk usage. df which lists all the partitions in your system, with the amount of space used/free and percentage, and du, usually run as du -sh * which gives a listing of all the files in the current directory (including subdirectories). df is helpful in finding if a partition is filling up, and du helpful in finding out which subdirectory is contributing how much to that space usage.

MySQL InnoDB
If your /var/lib/mysql is filling up, its probably innodb tables.

I had MySQL 5.0, and am using innodb tables. The innodb database (unless you configure it to use a separate file for each table, which is useful for things like separate partitions for separate dbs) uses a number of redo (temp) log files (default is /var/mysql/iblogfile0 and /var/mysql/iblogfile1) as well as a appending transaction log (default /var/mysql/ibdata1). The log files have a fixed (small ~5M) size but the transaction log will (note this!) keep growing in size and will not shrink even if you delete records - it logs transaction history, not just data.

In our case the transaction log grew too large, so we added a new autoextend file on a larger partition, with monthly note to dump and reimport all innodb data to clear the transaction log. The following documentation link has details to both
http://dev.mysql.com/doc/refman/5.0/en/adding-and-removing.html


sendmail clientmqueue
Another candidate for "var filling up" troubles is /var/spool/clientmqueue . This is a mail queue that stores as files emails that could not be processed or otherwise need to be cached (for mostly retrying). This is separate from mqueue (google mqueue vs clientmqueue).

Why is clientmqueue filling up? Most likely there are undeliverable messages heading towards your users. If you are running a sendmail server, then likely your sendmail is being misconfigurated and undeliverable spam mail is being stored in clientmqueue.

If you aren't running a sendmail server, then the only way that undeliverable mail gets sent to your server is locally - invoking sendmail separately. You could check if there are automated scripts on your server sending users mail that is not being delivered. One likely cause is cron sending the root or owner user emails about cron jobs status or output. To fix this, you can add MAILTO="" to all the cron files in your system, including user/root crontabs and cron.daily, cron.weekly, etc. I also usually direct cron output to the null device by appending the cron job's command with > /dev/null, just in case.

We also found alternate workarounds like getting cron to use mail.local instead of sendmail to send emails (again google is your friend) but did not try to implement that.

Open files

The way linux treats open files is that if an open file is deleted, then the file doesn't actually get deleted until its closed. This means that for example if your application(s) is reading from/writing to a (log) file, and you delete the file from the shell, the file is not actually deleted until your application(s) close the file. Now the file is removed from the directory listing immediately, so its not apparent to the user that the file is still there and taking up space.

This is where du and df can report different stats - du, which looks at each file, will not report the space taken up by deleted open files. df which looks at disk/block usage, will. If you get a huge difference between the space usage indicated by du and df, this is usually the case.

One useful tool for seeing open files and the application/processes using them is lsof. e.g. lsof /var gives a listing of all the open files under /var as well as the process ids and associated running user using them.

In our case, we had rotated the Apache logs for the day but mistakenly left Apache such that some of its processes were still writing to the old log file (deleted), which grew to be quite big. Forcing Apache to return with httpd -k stop; httpd -k start did the trick. httpd -k restart might work as well, since I believe all processes are eventually deleted and recreated, but im not too sure when it would happen.