fsyncers and curveballs

[Update: Hi, interwebs! Had to block my blog for a little bit to deploy the full power of wp-super-cache, everything should be fine now.]

A couple of articles have now been written about the unpleasant behaviour that people encounter with Firefox 3 in certain Linux configurations, related to the flushing of I/O and system lag that can see if there is a lot of other disk activity at the same time. There are a lot of moving parts to this issue, and so it’s not surprising that there’s a fair bit of misunderstanding, though some of it seems less well-meaning that others, which makes me a bit sad.

What’s Firefox doing?

Firefox uses a database called sqlite as the underpinning for many kinds of data storage in Firefox 3, including the browsing history and bookmarks data used to provide the Awesomebar’s awesomeness. sqlite is an excellent piece of software, written by people who take both data integrity and performance very seriously, which makes it a great place to put this sort of data. Lots of people use sqlite these days, and we’re proud to be founding members of the consortium that helps support sqlite development.

Databases, perhaps obviously, usually have complex file formats, and require that different parts of their files agree about things like how many records there are, whether a transaction completed successfully, or how indexes match up with the data to which they refer. This makes them more sensitive to data corruption than some simpler formats, like a basic text file. If you get a chunk of a text file corrupted, you can probably edit around it and salvage the rest of the file, but if you get a chunk of a database file corrupted you can often effectively lose all of the data that’s held there. This is one of the tradeoffs for being able to have efficient access to large sets of data, and it’s common to virtually all databases.

One of the things that sqlite does to ensure that the database is not corrupted in the case of a crash is call a function called fsync. fsync tells the operating system to ensure that this file has been safely written to disk, and waits until that’s complete. This provides what’s known as a “barrier”, and makes sure that we don’t get mismatched parts of transactions (groups of related database operations) when we look at the file after a crash. This is very effective: even if the operating system itself crashes or the computer loses power suddenly, we won’t see the database corrupted.

We don’t want to lose the user’s data, because that makes users sad, and we like to make users happy. So up through Firefox RC1, we set sqlite to its recommended setting of “FULL” synchronization. As release-end-game luck would have it, that made some users sad, because they would find Firefox and sometimes other parts of their system would pause unpleasantly, and it was tracked down to Firefox calling fsync. The bug in question is here, but I caution you to not read it piecemeal; there’s a lot of intertwined conversation there, and some comments are not as correct as they sound.

(There have been other bugs along the way that could cause this, ranging from performance problems with certain sqlite versions to bad interactions with the data set used for malware protection, but those were put behind us by RC1. This bug is the one that has people working weekends at this point.)

Why does that hurt so much?

On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem. Because writing to disk is so much slower than writing to memory, operating systems can buffer a lot of data, especially if you’re doing something that involves a lot of I/O, like unpacking a zip file or compiling software. It gets written out in the background, giving you vastly, vastly improved performance. It’s no exaggeration to say that without this sort of buffering your computer would be entirely unusable.

I think you can see where this is going: if there’s a lot of data waiting to be written to disk, and Firefox’s (sqlite’s) request to flush the data for one file actually sends all that data out, we could be waiting for a while. Worse, all the other applications that are writing data may end up waiting for it to complete as well. In artificial, but not entirely impossible, test conditions, those delays can be 30 seconds or more. That experience, to coin a phrase, kinda sucks. Does it suck as much as file corruption wiping out your bookmarks after your computer (not Firefox) crashes? As you might imagine, opinions vary.

This problem with ext3 is well-known to Linux kernel developers, and there’s great work underway as part of the “ext4″ project to remedy it. Other filesystems (like reiser4, I have heard) have similar problems, and I presume that their developers are also working on resolving them.

Why doesn’t other (non-sqlite) software do this?

Actually, a lot of software that’s concerned with data integrity does this, including the editors emacs and vim, and mail clients like mutt and Evolution, as well as bigger databases like MySQL and Postgres. In some cases, those programs are in fact adding more calls to fsync to protect user data better.

In fact, so many programs use fsync to ensure data integrity, and actually writing to disk is so expensive, that some operating systems make fsync not be a “real” fsync: the data is scheduled for (hopefully) immediate write-out, but the call doesn’t wait until it’s all the way to the disk, so it’s not really an effective barrier. This may be permitted by various standards like POSIX, but it’s certainly surprising for programs like Firefox that use it to protect against data corruption in the case of a crash!

Here’s what Apple has to say about fsync, sqlite, and data corruption:

fsync on Mac OS X: Since on Mac OS X the fsync command does not make the guarantee that bytes are written, SQLite sends a F_FULLFSYNC request to the kernel to ensures that the bytes are actually written through to the drive platter. This causes the kernel to flush all buffers to the drives and causes the drives to flush their track caches. Without this, there is a significantly large window of time within which data will reside in volatile memory—and in the event of system failure you risk data corruption.


If this is an operating system bug, why is Firefox being patched?

Because we want to make users happy. Whether Linux should have better fsync behaviour or not isn’t really going to matter to our users — we want to support Linux, which means Linux-as-she-is-shipped, not Linux-as-we-would-like-her-to-be. That means that we need to deal with X servers, with font-selection systems, and with filesystem behaviour as we find it, because that’s where our Linux users are. (It’s not like Windows and OS X don’t have their own annoying things to work around either, though they don’t seem to have this specific one!)

So is it always going to be like this?

No! (Yay!)

In the immediate term, there is a patch that controls how aggressively we sync, and defaults to a slightly less-aggressive state that is equivalently safe on modern operating systems. That patch will be in either Firefox 3.0.1 at the latest, and we’ve been in contact with Linux distributors to make sure they’re aware that the patch is fine to take in their builds — desirable, even. It might be in Firefox 3.0 proper, depending on what happens with an RC2, but either way the vast majority of affected users (Linux users, who usually get their Firefoxes from their distributors) will get the fix right away. This patch also lets users, who might decide that their systems are stable enough and their backups good enough that they don’t need the extra protection, turn off the data-integrity fsyncs almost entirely.

In the medium term, we’re going to be making our database use more asynchronous in Firefox, and batching transactions for things like history. You’ll likely see those effects in the major version of Firefox that follows 3 (might be called 3.1, might be called 3.5, might be called Firefox Magenta, who knows?). This will keep Firefox from pausing in these states, but may not entirely keep the fsync calls from affecting other applications on the system. The sqlite developers are also looking at adding support for a newer, Linux-only API that doesn’t have the system-wide effects as fsync, which could help as well.

Longer term, as I mentioned above, the Linux filesystem situation will improve in this respect, and the work is well-underway. It’s probably a year at least before most Linux users are running systems with fixed fsync behaviour, but at that point application developers won’t have to worry about their data-integrity needs causing pain for the whole system. I, for one, am looking forward to it.

I’d like to thank the sqlite developers for their help analyzing the effects of different fsync patterns, various Linux kernel developers (especially my former colleague Andreas Dilger), and the users who have helped test different settings and I/O loads. We’re going to ship a better Firefox 3 on Linux because of it, and we have even more ways to improve that experience on all operating systems in the future. I would not like to thank the people who have substituted vitriol and dogma for analysis and understanding of the release cycle, but we all get frustrated sometimes. I hope they enjoy Firefox 3 too.