fsyncers and curveballs

[Update: Hi, interwebs! Had to block my blog for a little bit to deploy the full power of wp-super-cache, everything should be fine now.]

A couple of articles have now been written about the unpleasant behaviour that people encounter with Firefox 3 in certain Linux configurations, related to the flushing of I/O and system lag that can see if there is a lot of other disk activity at the same time. There are a lot of moving parts to this issue, and so it’s not surprising that there’s a fair bit of misunderstanding, though some of it seems less well-meaning that others, which makes me a bit sad.

What’s Firefox doing?

Firefox uses a database called sqlite as the underpinning for many kinds of data storage in Firefox 3, including the browsing history and bookmarks data used to provide the Awesomebar’s awesomeness. sqlite is an excellent piece of software, written by people who take both data integrity and performance very seriously, which makes it a great place to put this sort of data. Lots of people use sqlite these days, and we’re proud to be founding members of the consortium that helps support sqlite development.

Databases, perhaps obviously, usually have complex file formats, and require that different parts of their files agree about things like how many records there are, whether a transaction completed successfully, or how indexes match up with the data to which they refer. This makes them more sensitive to data corruption than some simpler formats, like a basic text file. If you get a chunk of a text file corrupted, you can probably edit around it and salvage the rest of the file, but if you get a chunk of a database file corrupted you can often effectively lose all of the data that’s held there. This is one of the tradeoffs for being able to have efficient access to large sets of data, and it’s common to virtually all databases.

One of the things that sqlite does to ensure that the database is not corrupted in the case of a crash is call a function called fsync. fsync tells the operating system to ensure that this file has been safely written to disk, and waits until that’s complete. This provides what’s known as a “barrier”, and makes sure that we don’t get mismatched parts of transactions (groups of related database operations) when we look at the file after a crash. This is very effective: even if the operating system itself crashes or the computer loses power suddenly, we won’t see the database corrupted.

We don’t want to lose the user’s data, because that makes users sad, and we like to make users happy. So up through Firefox RC1, we set sqlite to its recommended setting of “FULL” synchronization. As release-end-game luck would have it, that made some users sad, because they would find Firefox and sometimes other parts of their system would pause unpleasantly, and it was tracked down to Firefox calling fsync. The bug in question is here, but I caution you to not read it piecemeal; there’s a lot of intertwined conversation there, and some comments are not as correct as they sound.

(There have been other bugs along the way that could cause this, ranging from performance problems with certain sqlite versions to bad interactions with the data set used for malware protection, but those were put behind us by RC1. This bug is the one that has people working weekends at this point.)

Why does that hurt so much?

On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem. Because writing to disk is so much slower than writing to memory, operating systems can buffer a lot of data, especially if you’re doing something that involves a lot of I/O, like unpacking a zip file or compiling software. It gets written out in the background, giving you vastly, vastly improved performance. It’s no exaggeration to say that without this sort of buffering your computer would be entirely unusable.

I think you can see where this is going: if there’s a lot of data waiting to be written to disk, and Firefox’s (sqlite’s) request to flush the data for one file actually sends all that data out, we could be waiting for a while. Worse, all the other applications that are writing data may end up waiting for it to complete as well. In artificial, but not entirely impossible, test conditions, those delays can be 30 seconds or more. That experience, to coin a phrase, kinda sucks. Does it suck as much as file corruption wiping out your bookmarks after your computer (not Firefox) crashes? As you might imagine, opinions vary.

This problem with ext3 is well-known to Linux kernel developers, and there’s great work underway as part of the “ext4″ project to remedy it. Other filesystems (like reiser4, I have heard) have similar problems, and I presume that their developers are also working on resolving them.

Why doesn’t other (non-sqlite) software do this?

Actually, a lot of software that’s concerned with data integrity does this, including the editors emacs and vim, and mail clients like mutt and Evolution, as well as bigger databases like MySQL and Postgres. In some cases, those programs are in fact adding more calls to fsync to protect user data better.

In fact, so many programs use fsync to ensure data integrity, and actually writing to disk is so expensive, that some operating systems make fsync not be a “real” fsync: the data is scheduled for (hopefully) immediate write-out, but the call doesn’t wait until it’s all the way to the disk, so it’s not really an effective barrier. This may be permitted by various standards like POSIX, but it’s certainly surprising for programs like Firefox that use it to protect against data corruption in the case of a crash!

Here’s what Apple has to say about fsync, sqlite, and data corruption:

fsync on Mac OS X: Since on Mac OS X the fsync command does not make the guarantee that bytes are written, SQLite sends a F_FULLFSYNC request to the kernel to ensures that the bytes are actually written through to the drive platter. This causes the kernel to flush all buffers to the drives and causes the drives to flush their track caches. Without this, there is a significantly large window of time within which data will reside in volatile memory—and in the event of system failure you risk data corruption.

Exciting!

If this is an operating system bug, why is Firefox being patched?

Because we want to make users happy. Whether Linux should have better fsync behaviour or not isn’t really going to matter to our users — we want to support Linux, which means Linux-as-she-is-shipped, not Linux-as-we-would-like-her-to-be. That means that we need to deal with X servers, with font-selection systems, and with filesystem behaviour as we find it, because that’s where our Linux users are. (It’s not like Windows and OS X don’t have their own annoying things to work around either, though they don’t seem to have this specific one!)

So is it always going to be like this?

No! (Yay!)

In the immediate term, there is a patch that controls how aggressively we sync, and defaults to a slightly less-aggressive state that is equivalently safe on modern operating systems. That patch will be in either Firefox 3.0.1 at the latest, and we’ve been in contact with Linux distributors to make sure they’re aware that the patch is fine to take in their builds — desirable, even. It might be in Firefox 3.0 proper, depending on what happens with an RC2, but either way the vast majority of affected users (Linux users, who usually get their Firefoxes from their distributors) will get the fix right away. This patch also lets users, who might decide that their systems are stable enough and their backups good enough that they don’t need the extra protection, turn off the data-integrity fsyncs almost entirely.

In the medium term, we’re going to be making our database use more asynchronous in Firefox, and batching transactions for things like history. You’ll likely see those effects in the major version of Firefox that follows 3 (might be called 3.1, might be called 3.5, might be called Firefox Magenta, who knows?). This will keep Firefox from pausing in these states, but may not entirely keep the fsync calls from affecting other applications on the system. The sqlite developers are also looking at adding support for a newer, Linux-only API that doesn’t have the system-wide effects as fsync, which could help as well.

Longer term, as I mentioned above, the Linux filesystem situation will improve in this respect, and the work is well-underway. It’s probably a year at least before most Linux users are running systems with fixed fsync behaviour, but at that point application developers won’t have to worry about their data-integrity needs causing pain for the whole system. I, for one, am looking forward to it.

I’d like to thank the sqlite developers for their help analyzing the effects of different fsync patterns, various Linux kernel developers (especially my former colleague Andreas Dilger), and the users who have helped test different settings and I/O loads. We’re going to ship a better Firefox 3 on Linux because of it, and we have even more ways to improve that experience on all operating systems in the future. I would not like to thank the people who have substituted vitriol and dogma for analysis and understanding of the release cycle, but we all get frustrated sometimes. I hope they enjoy Firefox 3 too.

72 comments to “fsyncers and curveballs”

  1. entered 25 May 2008 @ 11:03 am

    One thing I’m confused by – why doesn’t this affect OS X, if sqlite uses the F_FULLFSYNC as it says?

    And is there any explanation of how Windows does it and why it doesn’t affect them?

  2. entered 25 May 2008 @ 11:07 am

    OS X and Windows don’t seem to sync the entire filesystem when they’re told to fsync, so it just has to commit the small Firefox transactions before it returns, and the rest of the system isn’t affected. On an otherwise idle system (even one with traffic on another partition on the same disk, to make seeks hurt more), it’s hard to get fsync to take longer than 200ms, and more likely will be 20ms, even on Linux: it’s basically just a few platter rotations. So if we were only being “charged” for Firefox’s own I/O when we called fsync, we’d pretty much be golden.

  3. entered 25 May 2008 @ 11:40 am

    Hmm, but the Apple article you quoted above says: “SQLite sends a F_FULLFSYNC request to the kernel to ensures that the bytes are actually written through to the drive platter. This causes the kernel to flush all buffers to the drives and causes the drives to flush their track caches.”

    That certainly sounds like it would affect the whole system. I wonder if for some reason the #define test in sqlite3.c for F_FULLFSYNC is failing in the Mozilla sqlite build?

    But yeah, certainly does seem that ext3 could be improved in this area; this fsync discussion also parallels one about hard drive write barriers (http://lwn.net/Articles/283161/).

  4. entered 25 May 2008 @ 12:11 pm

    I don’t know enough about how OS X schedules I/O to know whether “all buffers” means “all system buffers” or “all the buffers for that file”, for one thing, but I’m not 100% sure we’re calling with F_FULLFSYNC either since the default is off (ref: http://www.sqlite.org/pragma.html#pragma_fullfsync). We might need to revisit that, but I suspect that OS X users are just more used to dataloss from system crashes anyway, given that HFSJ provides no guarantees about user data (only metadata), and OS X’s fsync is sort of a fake. :)

  5. jwz
    entered 25 May 2008 @ 12:27 pm

    Boy, it sure was a good idea to store browser history in a database. Again.

  6. entered 25 May 2008 @ 1:01 pm

    I found this interesting thread where Dominic Giampaolo explains OS X’s fsync: http://lists.apple.com/archives/Darwin-dev/2005/Feb/msg00087.html

    Basically F_FULLFSYNC is going through the hard drive caches. Which is exactly the same issue as the discussion of barriers from LWN; it sounds like Dominic’s claim “As far as I know, MacOS X is the only OS to provide this feature for apps that need to truly guarantee their data is on disk.” is right.

    And without barriers on Linux, OS X’s fsync() is about as good as Linux’s fsync().

    So I guess maybe the bottom line from this discussion is:

    • Linux: Slow fsync() and could still lose data on system crash
    • OS X: Fast fsync() and could still lose data on system crash
    • Windows: Unknown

    I’d be curious to see someone who actually knows about the details of NTFS etc. explain what the situation is like on Windows.

  7. entered 25 May 2008 @ 1:42 pm

    Then why can i run postgresql and evolution on my desktop without getting stuck for 30s ?

  8. Frank Ch. Eigler
    entered 25 May 2008 @ 6:20 pm

    Mike, thanks for the pointer to those fsync operations in other programs. No wonder my machines seem like they’re hanging more frequently. These will be getting turned off anon.

    In any case, if mozilla’s baby-database transactions occur so frequently (several times during a awesomebar entry, as ISTR), then it’s still showing a whole different level of paranoia.

  9. entered 25 May 2008 @ 7:22 pm

    Thanks for the detailed explanation of the situation. jwz’s sarcasm notwithstanding, it really is a good idea to store browser history in a database, where we can make much better guarantees about data integrity and on top of which we can provide features like the awesome bar, which is a game-changing improvement in history usability.

  10. entered 25 May 2008 @ 9:43 pm

    @colin: I thought that Linux 2.6 did have write barriers for fsync now, though it doesn’t always flush when you think it might (due to it being keyed off of inode changes, which means mtime for data changes, which means single-second granularity).

    @jwz: yawn.

    @frank: yes, which is why I said that we’re going to be batching history transactions in the future, right there in the post to which you replied. Those extra fsyncs “should” be a matter of a few dozen milliseconds, which would make them one of many future optimizations we’ll make to our app; the ext3 design problem gives them a whole new worst-likely-case bound.

    @benoit: dunno; postgres definitely fsyncs, so if you’re doing I/O to the same partition as the postgres DB lives on, and using postgres transactionally, then only you and your system debugger can tell.

  11. Ev
    entered 26 May 2008 @ 1:27 am

    Why doesn’t FireFox include its own filesystem? That will go along nicely with it’s own embedded SQL server. Why the hell not, they’ve got to store those cookies somewhere.

    Look, it’s not an “OS bug”. It’s “people’s bug” – they’ve added a freakin SQL server (!) into a desktop app and (surprise!) it feels sluggish when it needs to flush the writes. OS is to blame, who else’s fault can it be?

  12. entered 26 May 2008 @ 2:41 am

    @Myk,

    Awesome bar is pretty cool, but in Flock we built perhaps more-awesome functionality (a full-text content index for history) on Lucene which felt a lot more appropriate for this kind of application than a general-purpose relational database. While SQLite makes a lot of sense for certain types of data (cookies, for example), I think it’s probably a relatively bad place to put history for the purposes of searching. You just don’t need the data integrity and relational queries that you’re getting, and paying for in SQLite.

    An inverted index, is a really great data structure for storing browser history. Exact key matches on particular fields (like you need to turn your visited links purple) are really fast and you can do full-text content searches without storing full document text. I don’t remember the particulars of the Lucene on-disk format, but I think it can enforce validity without requiring fsyncs all the time. Who cares if you lose a history entry or two in a crash. Really.

    FWIW, we never got around to replacing the platform’s history store entirely with Lucene (at least before I left). We had some stability issues with the particular implementation we were using and lacked the resources to track them down, and there just wasn’t a compelling reason, so as far as I know Flock is still just using Lucene to complement the existing history store with full-text search. When I think about all this it makes me wish we could all just get along…

    Ian

  13. mpt
    entered 26 May 2008 @ 5:19 am

    Myk, bookmarks and history search from the address field was introduced in Internet Explorer 5.0 for Mac in March 2000, and has also been present for years in Opera and Epiphany.

    It is a cool feature in Firefox 3, but I suggest reserving the term “game-changing” for features that Firefox implements earlier than other widely-used browsers, rather than eight years later.

  14. msot
    entered 26 May 2008 @ 6:29 am

    Thanks for working on this, I just hope there is an RC2 to make sure everyone gets the fix :)

    mpt: I think they are different things because when I use Epiphany I don’t get the same effect as FF3. I only found something similar in Opera 9.5 which is still a beta.

    jwz: Go back to your crappy little nightclub while the rest of Mozilla continues not giving up.

  15. entered 26 May 2008 @ 6:59 am

    FTS isn’t a replacement, since you also need the full inverted index, plus you need weighting and ordering on other keys if you’re going to have a decent user experience. (That Opera does FTS but doesn’t appear to have any adaptive behaviour is why it feels clumsier to use, I believe; I tried it out for a while after the 2.3e14th Opera fan mentioned it bitterly in a blog comment. The adaptive learning is what made the location bar go from “handy” to “awesome” for me, and I think a lot of the other people who are raving about it.) And if you have the full inverted index, you need to serialize that index somewhere, so you have a file that’s sensitive to partial-update corruption cases. Avoiding that corruption requires an I/O barrier, and here we are.

    I don’t know any particulars of the Lucene file format, but it didn’t take long to discover that they had exactly the same “sudden power loss is total data corruption” problem until very recently — and that they solved it with some fsync. If you were just losing the last few history entries when your computer crashed, you were very lucky indeed. I sympathize with tracking down the last niggling stability issues in something like a database; that’s one reason I’m very glad we used sqlite instead of rolling out own index, paging, transaction, schema and query language and threadsafety code.

    (sqlite supports full-text search as an extension, and we made sure that the API hooks needed to use it with Firefox’s sqlite are present, so Flock could just hook that in if they want to index content. The performance is quite excellent, from the tests I recalled, even vs. Lucene at the time.)

    sqlite is on Symbian phones, too; people need to get over their “sql in the name means teh bloatz” dogma, and start thinking about what you end up with if you build indexing + paging + data integrity + query API.

  16. Frederik
    entered 26 May 2008 @ 9:30 am

    I think it is rather curious to recommend Linux distributions to include the patch, but not yet include it yourself. That’s like saying: our version sucks, and we don’t want to do anything about it for now; just use anybody’s else’s version, which sucks less.

  17. Dave
    entered 26 May 2008 @ 9:42 am

    I love sqlite but it has its place. Symbian isn’t a good thing to point at for performance. From S60 1st ed. to 3rd ed. the performance went to hell. That’s why I switched to BlackBerry.

  18. entered 26 May 2008 @ 9:42 am

    First of all, I’d like to congratulate the Mozilla devs who have been working on this bug. It’s the kind of bug that drives developers crazy — a leaky abstraction — and the fact that we’re even discussing it speaks to your commitment to cross-platform feature parity.

    But… I’m baffled by the following combination of sentiments:

    1. You admit that virtually every Linux user gets their Firefox from their distribution, AND
    2. You agree that this is a blocking bug that renders Firefox-on-Linux unusable for some people, AND
    3. You think you might have a solution, BUT
    4. You’re not confident enough to put it in your code (yet) because of possible data corruption, BUT
    5. You want Linux distributions to put it in THEIR code

    Hello? Didn’t we all just have a long, heated, acrimonious discussion about Linux distributions adding patches they didn’t understand? If the patch is good, put it in your own damn code; if it’s not (or you’re not sure), don’t tell downstream packagers to put it in theirs.

  19. entered 26 May 2008 @ 9:52 am

    SQLite is pretty cool, but for this kind of task, the SQL language is really not a requirement.

    Why not use something like Tokyo Cabinet / Tokyo Dystopia instead : http://tokyocabinet.sourceforge.net/

    It also fsync() but only when necessary to ensure that the database is never corrupted.

  20. entered 26 May 2008 @ 9:59 am

    Thanks. As a better solution though, should we adjust our usage habits in any way to allow for this feature. And, will having the system on ext4 (at this point in time) be better?

  21. entered 26 May 2008 @ 10:36 am

    I ran into a similar situation with Vim, btw: http://taint.org/2008/03/12/122601a.html

  22. entered 26 May 2008 @ 10:36 am

    Frank: sqlite only fsyncs when necessary to protect against database corruption as well. Nobody likes calling fsync! There’s nothing about the SQL language that necessitates additional fsyncing. See also dovecot, lucene, berkeley db, etc.

    Mark: happily, you’re mistaken this time. We have a patch that reduces the number of fsyncs, which we have taken into the trunk precisely because we’re confident in it being safe; we would not have recommended it to distributors otherwise. If we take an RC2, then it will be in Firefox 3.0, and if not, it will be in 3.0.1. Cracking the case on FF3 requires a lot of revalidation, and since distributors have agreed to take the patch we don’t believe that this patch alone is enough reason to further delay FF3, versus shipping sooner and starting 3.0.1′s clock sooner. We’re looking at the whole set of issues identified in RC1 and making that RC2-or-no call very shortly.

    Distributors also tend to service much more narrow set of configurations than we do, and can know more about the filesystem options deployed by their users. They can run distro-specific commands to ensure that data=journal is set and disable fsyncs entirely, for example, which we also believe is safe. (We can’t set it ourselves, because it’s root-only in stock kernels. :( )

    This is very much not a case of distributors taking patches they don’t understand. This is an example of the sort of direct collaboration between upstream and distributors that probably would have prevented the Debian/OpenSSL problem. (“Hey, we’d like to ship this patch. Would you take this patch in your product, if you only cared about Linux/this-config/etc.?” “Hell, no.” Or maybe they would have said yes, because upstream can certainly make mistakes too, but I like the odds better.) Blizzard working with the Fedora guys to nail down the that-version-of-sqlite-makes-things-slow problem is another such example. We haven’t always worked well with distributors — a two-way problem, but that doesn’t make users any happier — but we’re doing much better these days, and I think this case is one that’s pretty much bang-on.

  23. James
    entered 26 May 2008 @ 10:45 am

    Seriously, jwz, show us the code or STFU.

  24. entered 26 May 2008 @ 10:51 am

    @lingghezi: if you take frequent (and reliable!) backups and don’t mind rolling back to them in the case of a corrupting OS crash, you could disable fsyncs entirely. Or you can set data=journal, and then disable fsyncs. If you don’t do a lot of I/O on the same filesystem as your Firefox profile, you will likely not notice the problem at all. ext4 today doesn’t have the fixes needed to make fsync file-local, so switching over now won’t help.

  25. wayne
    entered 26 May 2008 @ 10:52 am

    When your solution to the problem that maybe 10% of users would ever encounter introduces another problem that 99% of the users would definitely experience, you lose.

  26. Malor
    entered 26 May 2008 @ 11:13 am

    I just don’t see that you gained anything by going with a database for this stuff. You have this long litany of ‘things that can go wrong with databases’, and you are entirely correct about all of them… so what the heck were you people thinking to use one?

    Browsing history is not that important, and probably doesn’t need to be overwhelmingly fast. Bookmarks are critical, but since they’re held in memory, a non-indexed file format shouldn’t be any performance loss. Both cases would be handled fine by a flat text file.

    If, for some reason, you decided that in-memory isn’t fast enough, you could generate a binary index when bookmarks are stored. Most speed-sensitive Unix program since the beginning of time, practically, have done this. You get fast lookups AND avoid all the inherent I/O problems and fragility of databases. You’d have to regenerate it every time a bookmark was added, but even a very large bookmark file should take no more than a second or so to index.

    If a text file is too slow for history, then do the same thing you do with bookmarks: hold it in memory. On restart, reload the text file and re-index.

    I just don’t see that you gained a damn thing from implementing the database, and you lost a ton. It just wasn’t clear thinking, IMO. The earlier designers, all of whom used flat text files, had it right. They didn’t have the weird performance regressions and requirements of databases, and didn’t lock up user data in a semi-proprietary format.

    Text files are good. Binary blobs are bad. If you need the speed of a binary blob, start with a text file and generate it. Databases are meant to be stores of large volumes of data that needs to be heavily indexed and queried for arbitrary data. Bookmarks and browser history are so far outside this solution domain that it’s almost laughable.

    I just don’t understand what you folks were thinking. Using SQLite for such trivial data storage was a PHB-worthy decision. You’ve taken on all the problems of carrying around a database, and gained nothing you couldn’t accomplish via other methods.

    All the problems, none of the benefits… PHB for sure.

  27. entered 26 May 2008 @ 11:29 am

    Shaver: Not sure about write barriers driven by fsync; the LWN discussion was about how write barriers aren’t used by default for metadata.

    However it also turned out from that discussion that LVM doesn’t support barriers, and at least Fedora defaults to installs on LVM.

  28. entered 26 May 2008 @ 11:43 am

    Browser history needs to be pretty fast, since it’s consulted for link colouring and CSS rule values; we have to be very careful whenever we touch history because it can easily lead to performance regressions. You probably know that from your own work on browsers. The speed of even just URL autocomplete in FF2 (no multi-word, no adaptive, no title searches, no tags, no bookmark inclusion) was a common source of complaints for users with large histories, and we are very glad to be rid of it. I love that you think that bookmarks are a bigger performance issue than browsing history; just the smile I needed today!

    If you have a binary index, you have corruption risk. Indexing + querying + paging + data integrity is what you get with sqlite. That you can use SQL to query it, rather than having to work with a totally novel API is just gravy. There are of course all sorts of other methods to accomplish the same things, which involve writing and testing a lot of cross-platform code just to end up back where we are today with more NIH and almost certainly more bugs. Nobody would have batted an eye if some more-vocal-than-representative folks hadn’t run into a bad interaction with an ext3 design problem. fsync-for-data-integrity is part of Berkeley DB too (whoops, that’s a database; too unfashionable?), which has been used since the beginning of time, and it’s used by flat-file systems as well if they care about avoiding corruption. Only since the widespread advent of journalling filesystems has it really made sense to worry about data integrity at the application level, since you can’t do much if the filesystem itself lacks integrity across a crash, so cracking out the Data Management Principles Of Multics doesn’t always make as much sense as people would like it to.

    Browser history, as used by Firefox 3, is exactly a large set of data that needs to be heavily indexed and queried in a number of different ways. Earlier designers weren’t doing as much with history, and users didn’t have as much of it; likewise with bookmarks.

    It’s fun to throw peanut shells from the armchair, and hand-wave away the engineering problems that motivate a solution. We’re responsible for shipping fast, powerful features to many many millions of people, not theoretical blog comments, so we see things a little differently. There are definitely areas of Firefox that need improvement, and we’ll be improving how we use our indexed storage layer over time, but the only thing that makes this issue more visible than any of the other places where we can get small I/O wins is the fact that ext3 trips over itself when there is an fsync coming up against a lot of outstanding data.

    (If browser history is not that important, who cares if it’s “locked up”? We export bookmarks in a textual format that can be read out-of-the-box by just about any language you’d like to program in. I’m sure you just overlooked that when forming your cogent and insightful opinions about how to manage browsing data; hope this helps.)

  29. Nate Hollingsworth
    entered 26 May 2008 @ 12:02 pm

    Thank you for clearing up the FUD :-). Great overview! Looking forward to Firefox 3 RTM.

  30. John
    entered 26 May 2008 @ 12:10 pm

    It’s slow on my 2001 128MB PC, as well as my spanking Pentium Dual Core 2GB machine. pfft

  31. entered 26 May 2008 @ 12:14 pm

    Getting data integrity right is hard. Using a library like SQLite makes it much easier. For many applications SQLite makes a good replacement for the the POSIX file API. Stewart Smith’s “Eat My Data” talk from LCA last year makes the point well:

    http://lca2007.linux.org.au/talk/278.html

  32. entered 26 May 2008 @ 12:33 pm

    Mark: happily, you’re mistaken this time. We have a patch that reduces the number of fsyncs, which we have taken into the trunk

    You’re right; I’m wrong. The last time I looked at the Bugzilla bug, the patch hadn’t yet been approved. But I’m still unconvinced by your later statement:

    Distributors also tend to service much more narrow set of configurations than we do

    That’s not true for two reasons: (1) distributors have to support just as many configurations as you do, and (2) you basically don’t have any Linux users. To clarify, I mean “Linux users who download Firefox from mozilla.com”. I forget the exact statistics that were thrown around in the let’s-work-with-Linux-distributors session at the Firefox 2 release party, but it was somewhere on the order of 10,000 direct downloads vs. 5 million distributor-packaged downloads. This whole song-and-dance of “we’ll look out for our users and let the distros look out for theirs” is a linguistic red herring. Your users are a rounding error, a few lost souls who don’t understand about package management. There was talk (in that session) of removing the direct Linux downloads altogether, and directing Linux users to their distribution’s packages. What happened to that?

  33. entered 26 May 2008 @ 12:58 pm

    @Shaver, I think you’re right about Lucene’s corruptibility, but I’m still unconvinced that it makes sense to put history in SQLite. I don’t think that SQLite is bloat, but it’s optimized for certain things that don’t make sense for all the use cases. Putting everything in SQLite is as much of a mistake as putting everything in Mork or RDF.

    @Mark, the rumor on bugzilla is that release builds for Linux are a thing of the past: https://bugzilla.mozilla.org/show_bug.cgi?id=402892#c5

  34. entered 26 May 2008 @ 1:16 pm

    Mark: RHEL’s packages only need to worry about RHEL users; they can and do produce different packages for different versions of RHEL, which can make different calls or rely on different filesystem and kernel guarantees. Our binaries must and do run on a wider array of distributions than any one distributor’s packages; our dependencies on things like Pango and kernel behaviour and GTK need to reflect the minimum version available in any of our set of supported configurations, while a given distributor usually only care about a narrower config. That given might have Pango $minversion (which could well mean that their config is the reason we have that as the Pango $minversion), but they’re unlikely to have GTK $minversion and X $minversion and kernel $minversion as well.

    Or they can know that they use LVM and therefore get no barriers anyway, so turn to OFF if they believe that things are likely to be corrupted even in the presence of fsync, and so not take the performance hit because there isn’t much gain. :(

    We’re exactly looking out for distro users here, which is why we contacted distros about the patch, and why Blizzard was helping to find the problem that only manifested in the Fedora packages and not our builds.

    Re: distro primacy, I refer you to mconnor’s post on exactly that topic.

    Ian: what is sqlite optimized for? When researching your answer, note that “data integrity” is the only part that causes us to stub our toe on ext3′s behaviour here, and that tools which manipulate text formats like mbox safely also have this behaviour. “Build all of history on Lucene” seems much more ridiculous than “build all of history on sqlite”, so I suspect I’m misunderstanding part of your position. Lots of people seem excited about being able to use SQL to work with Songbird’s catalogue — a task that I seem to recall being handled with “text files” or “binary indexes” in other players. Is it so surprising that people are as interested in working similarly with their browser history and bookmarks — which change perhaps 1000x as often as a music catalogue, and I suspect are much larger in aggregate for the majority of users? Awesomebar is just the tip of the iceberg, as people following Ed’s work know.

  35. Carlie Cots
    entered 26 May 2008 @ 1:37 pm

    I still don’t see a good reason to use fsync() instead of fdatasync() – the latter flushes data, but not necessarily metadata like access time, with a typical 20% performance edge.

    FWIW

  36. entered 26 May 2008 @ 1:53 pm

    fdatasync == fsync in ext3 as shipped today, so if you’re seeing fdatasync being 20% faster than fsync there then you have some experimental error that needs correcting.

  37. else
    entered 26 May 2008 @ 3:09 pm

    Thanks for clarifying things. I absolutely appreciate your commitment for Linux software.

  38. Paul
    entered 26 May 2008 @ 3:14 pm

    I’m wondering why an application needs to call fsync. If the app crashes then shouldn’t the OS detect that an call fsync for the app as appropriate. This is what happens on Windows, as I understand it. On Windows you need a serious kernel-level blue-screen crash in order for file writes to not be flushed. Am I missing something?

  39. pablo
    entered 26 May 2008 @ 3:17 pm

    In the database world, only the `log’ portion of the database needs to be durable. When a transaction is committed, the transaction is blocked until the data is written to disk.

    Why couldn’t sqlite remove the fsync' call,open()’ its log with OSYNC && ODIRECT?

    -pablo

  40. entered 26 May 2008 @ 3:38 pm

    Ya, I’ve had like 0 problems with Firefox 3 after using 1 and 2 on Linux for over 4 years. Firefox 3 is faster, more full-featured, and crashes a LOT less than the previous versions on Linux for me. I have Ubuntu Hardy Heron with the default Firefox installation.

  41. entered 26 May 2008 @ 4:35 pm

    It has nothing to do with app crashes; it’s about hardware and OS crashes.

    O_SYNC proved slower than fsync, in testing; details in the bug.

  42. entered 26 May 2008 @ 4:40 pm

    Hank: Firefox won’t be 4 years old until this November, but I’m glad you’re enjoying it. :)

  43. pablo
    entered 26 May 2008 @ 5:05 pm

    Any tests with ODIRECT? Reading the man page, I see there’s no need to call OSYNC when it’s set.

  44. xhantt
    entered 26 May 2008 @ 5:45 pm

    I’ve used Gigabase (http://www.garret.ru/~knizhnik/gigabase.html) as an embedded database which is excelent. Perhaps the only downside for FF is that it is programmed in C++, but it include a C interface.

  45. Gustavo
    entered 26 May 2008 @ 5:46 pm

    It’s nice to read that this behaviour is going to be fixed soon. Thanks!

  46. cdrworm
    entered 26 May 2008 @ 5:47 pm

    Yeah firefox crashes but data loss won’t happen since the data is already written to the buffer.

    But the worrying about the OS crashing or power failure data loss is a bit extreme.

  47. entered 26 May 2008 @ 10:06 pm

    O-DIRECT requires writes aligned and sized to specific values and those values, last I checked, were not queryable in a uniform way across kernel versions (2.4 vs 2.6) and filesystems (xfs vs ext3 vs reiser).

    Feel free to run some tests, though, rather than programming via comments on my blog. :)

  48. entered 27 May 2008 @ 12:12 am

    Here’s an interesting LWN article on barriers and journaled filesystems with a healthy discussion on fsync.

    http://lwn.net/SubscriberLink/283161/db8845de76d9095a/

  49. pablo
    entered 27 May 2008 @ 5:46 am

    Sorry, you are right re: programming in your blog. heh!

  50. Chris Sherlock
    entered 27 May 2008 @ 9:15 am

    Shaver, excellent responses.

    I believe that Microsoft flushes only individual files to disk via the _commit() function.

    I found this link for info on the function, and this KB article for an example of how to force flush the file to disk.

    Not sure if this is how the Windows portion of Mozilla’s code works. Would be interesting… could it be that NTFS is (in this respect) superior to ext3? Please tell me it ain’t so!

  51. pablo
    entered 27 May 2008 @ 11:33 am

    Hi,

    I ran some tests as requested. I found some `C’ code which I tweaked slightly …

    Application Differences

    < testfd = open(“./test.dat”, OCREAT|OWRONLY|ODIRECT|O_TRUNC, 0666);

      test_fd = open("./test.dat", O_CREAT|O_WRONLY|O_SYNC|O_TRUNC, 0666);
    

    Benchmark

    Use `time’ to time writing, sequentially; 10MB in 4KB chunks.

    The disk used has no other access.

    Results

    Normally I’d run three runs but the results below are pretty compelling :

    O_SYNC

    • IOPS: ~70 • Total time (mm:ss:ms): 02:17.55

    O_DIRECT

    • IOPS: • Total time (mm:ss:ms): 00:00.31

  52. pablo
    entered 27 May 2008 @ 11:36 am

    Hi,

    I ran some tests as requested. I found some `C’ code which I tweaked slightly …

    Application Differences

    < testfd = open(“./test.dat”, OCREAT|OWRONLY|ODIRECT|O_TRUNC, 0666);

      test_fd = open("./test.dat", O_CREAT|O_WRONLY|O_SYNC|O_TRUNC, 0666);
    

    Benchmark

    Use `time’ to time writing, sequentially; 10MB in 4KB chunks.

    The disk used has no other access.

    Results

    Normally I’d run three runs but the results below are pretty compelling :

    O_SYNC

    o IOPS: ~70 o Total time (mm:ss:ms): 02:17.55

    O_DIRECT

    o IOPS: ~780 o Total time (mm:ss:ms): 00:00.31

  53. pablo
    entered 27 May 2008 @ 11:38 am

    Hmmm, I can’t seem to post my entire details … let me try synopsis … timings on 10MB of data, written sequentially in 4KB chunks.

    O_SYNC

    o IOPS: ~70 o Total time (mm:ss:ms): 02:17.55

    O_DIRECT

    o IOPS: ~780 o Total time (mm:ss:ms): 00:00.31

  54. pablo
    entered 27 May 2008 @ 11:39 am

    If you’d like the source code I used, please e-mail me .. I believe you have access to it from your blog?

  55. Tormod Volden
    entered 27 May 2008 @ 5:10 pm

    What’s all the fuss about data loss and integrity? I am using Linux, it’s not like the system is crashing that often. I would prefer to lose all my browsing history and the last minutes bookmarks every single time this happen – once a year? – rather than to suffer from performance problems, sudden freezes, frequent disk spin-ups (more precisely disk spinning all the time), battery drain, heat and fan noise all the freaking time!

    This seems to be a developers’ discussion totally out of touch with users expectations and normal usage. Grrr. Well, anyway thanks a lot to those of you that make Firefox otherwise rock.

  56. entered 27 May 2008 @ 6:15 pm

    @Tormod: please read the post you replied to. You can turn off fsyncs entirely if you want, and we’re going to be batching more heavily in future versions of FF.

    @pablo: 4K writes aren’t what we’re talking about here; try switching sqlite to use O_DIRECT, and we’ll see if aggregate performance is better?

  57. pablo
    entered 27 May 2008 @ 7:23 pm

    Shaver, you’re confusing me for a C developer. :) I could probably make the changes but it’d take me forever as I normally am not coding in C (more in SQL than anything else).

    The proof-of-concept was simply to show the difference in performance between OSYNC and ODIRECT. I chose 4K only because I thought it was a small enough I/O.

  58. pablo
    entered 27 May 2008 @ 7:34 pm

    Would you like me to post the source and my finds in the currently open bug? No worries if not …

  59. entered 27 May 2008 @ 10:29 pm

    Our binaries must and do run on a wider array of distributions than any one distributor’s packages

    Your binaries don’t need to do jack sh*t, because nobody uses them. (Please, somebody inject some facts into this debate — I freely admit that I’m relying on what I remember hearing from an informal session 18 months ago.) Mconnor’s post says nothing relevant about this except to confirm the obvious, that you’re still building binaries for your constituency of none.

    While I acknowledge and wholeheartedly applaud your newfound respect for the subset of downstream distributors that are willing to play by your rules, it just seems weird to tell them to apply a patch that you’re unwilling to apply yourself.

    Except I just read that you’re going to do an RC2 after all, so the patch will go into your code after all. (Right?) If so, then I suppose the entire discussion is moot.

  60. entered 28 May 2008 @ 8:13 am

    People use our binaries, and they must work. If nobody used them, please rest assured that we would happily drop the build and test effort. Please don’t feel that you need to repeatedly comment on things that you admit you know nothing about; I don’t need the traffic, it’s OK.

    The patch was always going into the very-next build of our code, as I’ve said repeatedly; turns out that’s FF3.0 rather than FF3.0.1, but the quality of the patch is not what turned that decision. We didn’t tell them to take anything, we told them that we were going to take the patch in our next spin (across all platforms), and that if they hadn’t yet built final-candidate packages they could take this as well with our blessing. ESPECIALLY, since you seem to not have acknowledged this key issue, because they can specialize their behaviour based on additional knowledge of their narrower set of supported systems. (Apparently the Debian maintainer was mulling taking an unreviewed patch for “places database backup” and turning the pref to “OFF”; not the sort of specialization I had in mind, but I think that was averted by sufficiently strident bug comments that the patch wasn’t safe. I hope so, at least.)

    We’ve always had respect for downstream distributors who were willing to play by the rules, there just didn’t used to be very many of them. :/

  61. Vasilis Vasaitis
    entered 28 May 2008 @ 5:14 pm

    Thanks for the interesting post. I really hope sqlite updates get more asynchronous at some point or something (say, separate thread for example). While it looks like certain issues can be fixed with local filesystems, there are also some of us that work with home directories on network filesystems, and have been hit by this issue pretty hard.

  62. Srdjan
    entered 28 May 2008 @ 11:42 pm

    I’d be happy just to see this option in (maybe) advanced tab of FF’s settings – That would be one SLIDER where user can set a value of (for example) 0 to 5 for a “speed” versus “data security”, which will do no sync at all when set to 0, and do more havy syncyng when set to max (5). 3 would be some kind of a defoult value or something.

    Under this option (under a slider) coluld be a brief explanation of consequences of different settings.

    What do you think of this idea? Greetings form Srdjan Pavlovic form Serbija, keep up the exellent work!

  63. entered 29 May 2008 @ 1:26 pm

    People use our binaries

    [citation needed]

  64. entered 29 May 2008 @ 1:27 pm

    Well it would be nice if there was a config option to run the db unsafely (and maybe there already is) but the risk of data corruption always seems like it’s not worth worrying about until something actually goes wrong. Remember the problem with cached writes is that you might end up with an on disk state that is totally trashed. Many people, myself included, would be most upset to find out that we lost all our bookmarks, cookies and whatever other data to run slightly faster.

    This having been said it strikes me as mistaken to block on the call to fsync/F_FULLFSYNC. This is the correct semantics for dealing with standard database type operations, submitting orders for sale, inventory etc.. etc.. where it’s very important not to let a transaction appear to complete without storing that data to disk. This isn’t the case with firefox. No further harm will occur if the user goes about his buisness and bookmarks another site in the second that firefox would otherwise have been stalled on the write to return.

    A well designed database could guarantee that no system corruption had occured by still occasionally submitting fsync requests and checking when they complete and ensuring that information needed to regenerate the prior version of the database in case of a crash is never overwritten until the fsync calls return telling it that new metadata blocks have been written.

    Also just to clarify about what OS X is doing. OS X is offering exactly the correct semantics. In OS X fsync guarantees that the file has been written to the disk drive. However, with NCQ the disk drive may choose to reorder that write or leave it in buffer for some time. F_FULLFSYNC does a normal fsync and also flushes the buffers on the disk. I know this is what you meant above but it wasn’t very clear to me until I looked it up so I thought I would clarify.

    The reason this is the correct semantics is that it lets the programmer optimize their interactions with storage if they know that the disk is supported by a battery or otherwise guaranteed to actually flush it’s internal buffers. Hopefully one day all disks will be so designed.

  65. entered 29 May 2008 @ 4:01 pm

    Oh man, why do I get myself into blog flamewars late at night. FWIW, I don’t think SQLite’s a great choice for a lot of things, I mostly dislike the apparent knee-jerk reaction from all quarters to use SQLite. That (bad) attitude sometimes causes me to fly off the handle when I should be sleeping instead.

    In Songbird SQLite has given us a fairly quick, fairly reliable datastore, but we’ve come across a ton of limitations, bugs, and unexpected performance problems. We’ve also had to make compromises when modeling our application’s data as a set of tables, columns and relations that hurt us a little now and will probably hurt us more in the long-term.

    In my experience here I’ve found SQLite’s best feature is the development community. Once we’re able to produce reduced test cases for the problems we’ve had there’s a really quick turn around with fixes, unit tests and releases.

  66. Mike Jeeves
    entered 29 May 2008 @ 10:00 pm

    Hey (Don’t know your name…)

    I read up on this from the bug, and this story, a few days ago. I re-read today and something struck me. You’re writting this off as an OS bug and ‘we didn’t do this’.

    Well… I also ran through your proposed fix for ‘lowering fsync temporarily’.

    So, in the final fix of this: You will fsync the same amount as you do now. Correct?

    Therefore, do you anticipate that Firefox will cause disk activity continually? You’ve just said here that your fsyncing isn’t a problem. You say that when it doesn’t actually write, that is a problem.

    So you think Firefox should write out to disk that often?

    FF 2.0.10 + and FF 3.0.* have been the most unstable releases for me, both on Windows and Linux. I moved to 3.0 when 2.0.10 and above stopped being usable, and had to keep up with each 3.0 release because the number of ‘click link, firefox dies’ bugs was unbelievable.

    I installed Opera. I just want to know – what is causing all these regressions? Adding pingback support? I bet it bloody well is, some stupid feature that nobody wants, and it is causing crashing issues. Anyway.

  67. entered 30 May 2008 @ 5:19 am

    Mark: you can quote me on that, if you want. On what more robust citation do you base your current belief? You’re making much more specific claims.

  68. anon
    entered 31 May 2008 @ 4:57 pm

    Why is firefox the only SQLITE application that does this? Amarok for example, with an absurdly large collection, has no problem whatsoever with it’s sqlite database pulling my system down into the mud. And it has never lost my data either (and I’ve mishandled it quite a lot, killing it, pulling the power plug while updating the collection, etc.). So, it seems to me the firefox developers are again blaming he outside world for their lack of skill and ingenuity (“oooh, we don’t leak memory! it’s the os! and, oh, opera doesn’t? let’s ignore them, noone use them anyways!”).

  69. entered 4 June 2008 @ 9:32 pm

    The 3.0 RC2 announcement does not seem to mention the fsync() issue at all. Was it simply not communicated to the release team? The tarballed Linux build refuses to detect my network connection on my Fedora 9 / x86_64 laptop, so I guess I’ll have to rebuild from source. Would be nice to know if the patch has indeed been applied.

  70. Richard Lloyd
    entered 15 June 2008 @ 5:23 am

    Just thought I’d mention that I do use the release version Linux .tar.gz of Firefox (and Thunderbird) from mozilla.com, but only because I have a script to create an RPM from it (using a spec file, icon and the .tar.gz) and I’m using CentOS 5 (which is slow-moving and doesn’t even have Firefox 2 yet!).

    I suspect the reason the Linux release binaries don’t get downloaded a lot is because they’re not shipped in any distro packaging format (RPM and deb being the obvious ones). Let’s also not forget that there’s no 64-bit builds of Firefox/Thunderbird on any platform either (another reason Linux distros have to step in and provide them…).

    In an ideal world, there should be Mozilla repos for RPM/deb and 32-bit/64-bit, but the impression I get is that either there isn’t enough resources/willpower to get this done or the “easy” option of leaving the distros to do the builds/packaging takes the load off the Mozilla developers (but isn’t this like saying “Microsoft should build and ship Windows Firefox for us”?!).

  71. Sean
    entered 19 June 2008 @ 9:40 pm

    I’d like to point out that this is not just a Linux, or Linux Filesystem, or even EXT3 only problem. PortableApps.com packages FireFox up for users who run it from their flash drives. FF2 was a good performer on this media. FF3 is a spectacular failure on flash drives. The heavy IO on every page load crushes the flash drives (with typically slow write times) and makes FF3 and any other app running off the drive unusable. Waiting for Linux to speed up the fsync() or for EXT3 to fix their ordered mode is not going to make the problem go away. I’m running off a 4Gig flash drive formatted Fat32, and the IO LED is on constantly while browsing. Every page browsed freezes for 2-4 seconds. Unusable.

  72. entered 30 December 2008 @ 9:52 pm

    Do we have any idea if ext4 has this fixed finally? I haven’t found any information indicating either way after a bit of digging tonight (admittedly, I haven’t looked at source code). It’s no longer considered as under development, and is included in Kernel 2.6.28 (recently released).