How bloated is your PostgreSQL database?

When dealing with databases (or, in fact, any data that you need to read from disk), we all know how important it’s to have a lot of memory.  When we’ve a lot of memory, a good portion of data gets nicely cached due to smart operating system caching and most of the data, when requested comes from memory rather then disk which is much, much faster.  Hence trying to keep your dataset size possibly small becomes quite important maintenance task.

One of the things that take quite a bit of space in PostgreSQL which we use across most of the systems here at Mind Candy, are indexes.  And it’s good because they speed up access to data vastly, however they easily get “bloated”, especially if data you store in your tables gets modified often.

However, before we even are able to tackle the actual problem of bloated indexes, first we need to figure out which indexes are bloated.  There’re some tricky SQL queries that you can run against the database to see the index bloat, but in our experience, results we got were not always accurate (and actually quite far off).

Beside having happy databases we also care a lot about the actual data we store so we back it up very often (nightly backups + PITR backups) and once a day we do a fully automatic database restore to make sure backups we take, work.

Now, a restore operation includes building indexes from scratch, what means, those indexes are fresh and free of bloat.

Now, if we only could compare the sizes of indexes from our production databases to the ones from restored backups, we could easily say, very precisely, how much bloat we’ve got in our production database.  To help with that, we wrote a simple python script.

$ ./ -ps csva.csv csvb.csv
Index idx3 size compare to clean import: 117 % (14.49G vs. 12.35G)
Index idx2 size compare to clean import: 279 % (14.49G vs. 5.18G)
Ough!  idx4 index is missing in the csvb.csv file.  Likely a problem with backup!
Total index bloat: 11.46G

The whole process works as following:

  1. At the time of backup, we run a long SQL query which prints all indexes in production database alongside with their sizes in CSV format
  2. After the backup is restored, we run the same SQL query that prints all indexes in “fresh” database alongside with their sizes in CSV format
  3. We then run the aforementioned Python script which parses both CSV files and prints out human-friendly information show exactly how much bloat we’ve got in our indexes

We also added a percent thresold option so it’ll print out only indexes with bloat more then X %.  This is so we won’t get bothered by little differencies.

The aforementioned script, called pgindexbloat, can be found on Mind Candy Github account.  It’s very easy to be run from cronjob or wrapped around into a check script and used to feed Nagios / Sensu.

As a interesting note, I’ll just add that the first go of the script uncovered we had nearly 40GB worth of bloat on our production database.  Much more then we anticipated and getting rid of that bloat, definitelly made our database much happier.