approaches to performance

[This post doesn't have links to anything, and it really should. I'm a bit pressed for time, but I'll try to come back later and fix that.]

Important: I no longer work for Mozilla, and I haven’t yet started working for Facebook, so what I write here shouldn’t be taken as being the stance of those organizations.

Platforms always have to get faster, either to move down the hardware stack to lesser devices, or up the application stack to more complex and intensive applications. The web is no exception, and a critical piece of web-stack performance is JavaScript, the web’s language of computation. This means that improvements in JS performance have been the basis of heated competition over the last several years, and — not coincidentally — an area of tremendous improvement.

There’s long been debate about how fast JS can be, and whether it can be fast enough for a given application. We saw it with the chess-playing demo at the Silverlight launch (for which I can’t find a link, sadly), we saw it with the darkroom demo at the Native Client launch, and I’m sure we’ll keep seeing it. Indeed, when those comparisons were made, they were accurate: web technology of the day wasn’t capable of running those things as quickly. There’s a case to be made for ergonomics and universality and multi-vendor support and all sorts of other benefits, but it doesn’t change the result of the specific performance comparison. So the web needs to get faster, and probably always will. There are a number of approaches being mooted to this by various parties, specifically around computational performance.

One approach is to move computationally-intensive work off to a non-JS environment, such as Silverlight, Flash, or Google’s Native Client. This can be an appealing approach for a number of reasons. JS can be a hard language to optimize, because of its dynamism and some language features. In some cases there are existing pieces of non-web code that would be candidates for re-use in web-distributed apps. On the other hand, these approaches represent a lot of semantic complexity, which makes it very hard to get multiple interoperating implementations. (Silverlight and Moonlight may be a counter-example here; I’m not sure how much they stay in sync.) They also don’t benefit web developers unless those developers rewrite their code to the new environment.

Another approach is to directly replace JS with a language designed for better optimization. This is the direction proposed by Google’s Dart project. It shares some of the same tradeoffs as the technologies noted above (easier to optimize, but complex semantics and requires code to be rewritten), but is probably better in that interaction with existing JS code can be smoother, and it is being designed to work well with the DOM.

A third approach, which is the one that Mozilla has pursued, is to just make JS faster. This involves implementation optimizations and adding language features (like structured types and binary arrays) for more efficient representations. As I mentioned above, we’ve repeatedly seen that JS can be improved to do what is claimed as impossible in terms of performance, and there are still many opportunities to make JS faster still. This benefits not only new applications, but also existing libraries and apps that are on the web today, and the people who use them.

Yesterday at the SPLASH conference, the esteemed Brendan Eich demonstrated that another interesting milestone has been reached: a JavaScript decoder for the H.264 video format. Video decoding is a very computationally-intensive process, which is one reason that phones often provide specialized hardware implementations. Being able to decode at 30 frames a second on laptop hardware is a Big Deal, and points to a new target for JS performance: comparable to tightly-written C code.

There’s lots of work left to do before JS is there in the general case: SIMD, perhaps structural types, better access to GPU resources, and many more in-engine optimizations that are underway in several places I’m sure. JS will get faster still, and that means the web gets faster still, without ripping and replacing it with something shinier.

Aside: the demonstration that was shown at SPLASH was based on a C library converted to JS by a tool called “emscripten”. This points towards being able to reuse existing C libraries as well, which has been a selling point for Native Client thus far.

As Brendan would say, always bet on JS.

onward, nimble monkey

Busy times, busy times. You guys see this browser from Google? A few people sent me that link today, definitely interesting stuff. They have a spiffy new JS engine, which has some genuinely cool stuff in it; I’m enjoying reading through the code, and I think I’m learning some Smalltalk along the way.

Since we landed TraceMonkey 10 (ahem) days ago, we’ve been working mostly on stability and bug fixes, as is usual for the early-but-promising stage of new tech. Jesse’s fuzzers have been as deliciously useful as always, and we’ve been able to make some small performance improvements along the way. Things are looking pretty good.

Pretty soon we’ll be able to dive back in and work on more optimizations, and perhaps even write a paper or two. (We’ve learned a lot in the past couple of months, and it would be great to share those lessons with others who might be looking to make dynamic languages fast.) I have lots to keep me busy, but I’m hoping I’ll be able to sneak off and dabble here and there. Don’t tell my boss.

the birth of a faster monkey

Over the past year, JavaScript performance on the web has undergone a striking revolution. Virtually every browser has improved its engine to produce significant gains in execution speed; Firefox is about 3 times faster than Firefox 2 in various JavaScript benchmarks, for example. But of course, developer and user demand for performance is insatiable, and at Mozilla we demand it ourselves, since our application itself is largely and increasingly written in JavaScript. In addition to improving the performance of web applications, our work on JS performance in Firefox 3 made our own application snappier and more responsive.

We’re not done. In addition to continuing to work on our existing JavaScript interpreter (some 20% improved over Firefox 3 already), we’re also looking farther into the future of JS performance, and have some early news to share. Permit me, if you will, to set the stage:

Core primitives improved by 20-40x

These are the early results from a project we’ve been calling TraceMonkey, which adds native code compilation to Mozilla’s JavaScript engine (“SpiderMonkey”). Based on a technique developed at UC Irvine called “trace trees“, and building on code and ideas shared with the Tamarin Tracing project, a few of us have spent the last 2 months (and most of the last few nights) teaching SpiderMonkey some exciting new tricks.

The goal of the TraceMonkey project — which is still in its early stages — is to take JavaScript performance to another level, where instead of competing against other interpreters, we start to compete against native code. Even with this very, very early version we’re already seeing some promising results: a simple “for loop” is getting close to unoptimized gcc:

for loop: gcc 1070ms, js 910ms

Yesterday we landed TraceMonkey in the Firefox 3.1 development tree, configured off by default. We have bugs to fix, and an enormous number of optimizations still to choose from, but we’re charging full speed ahead on the work we need to do for this to be a part of Firefox 3.1. Depending on the benchmarks you choose, you might see massive speed-up, minor speed-up, or maybe even some slowdown — those latter cases are definitely bugs, and reporting them through bugzilla will be a big help.

Here are the current speedups of some common and uncommon benchmarks, as compared to Firefox 3: Apple’s SunSpider, the SunSpider “ubench” tests added for squirrelfish; an image manipulation demo; and a test of the Sylvester 3D JavaScript library doing matrix multiplication.

4 bench graph: sunspider at 1.8x, ubench at 22.4x, image manipulation at 6.5x, sylvester at 6.2x

There are many wins left in each one of those benchmarks, and we’ll be working on those through Firefox 3.1 and beyond: better code generation, more efficient guards, improvements to some data structures, parallel compilation, use of specific processor features, new optimization passes, tracing more code patterns, and many more. Right now we write all values back to memory at the end of every loop, for example, so there are some easy wins available in avoiding that — perhaps 2-3x on tight loop performance.

There’s lots more to write about why we chose tracing as the path to future performance, what to expect in terms of future work, how these sorts of performance gains can translate into new web experiences and capabilities, and what we’ve learned along the way. (One example out of left field: the static analysis tools are written in JavaScript, and apparently they are immensely faster due to even the current JIT work.) Look for posts about that from myself and other TraceMonkey hackers soon.

If you’re the sort of person who reads computer science papers, you may find the Hotpath paper to be of interest: it’s a great paper, and a great introduction to the art and science of tracing.

Thanks are especially due to Brendan, Andreas and David for making it fun to be the dumbest guy on the team; Andreas’ colleagues at UCI (Michael Franz, Mason Chang, Michael Bebenita, Gregor Wagner) for their advice and help; Ed Smith and the Adobe Tamarin team for their tech and wisdom; Rob Sayre, Vlad Vukicevic, Blake Kaplan, Boris Zbarsky and Bob Clary for testing and timely guidance; and the Mozilla developer community for letting us hold the tree closed for more than a day to get it landed.

fsyncers and curveballs

[Update: Hi, interwebs! Had to block my blog for a little bit to deploy the full power of wp-super-cache, everything should be fine now.]

A couple of articles have now been written about the unpleasant behaviour that people encounter with Firefox 3 in certain Linux configurations, related to the flushing of I/O and system lag that can see if there is a lot of other disk activity at the same time. There are a lot of moving parts to this issue, and so it’s not surprising that there’s a fair bit of misunderstanding, though some of it seems less well-meaning that others, which makes me a bit sad.

What’s Firefox doing?

Firefox uses a database called sqlite as the underpinning for many kinds of data storage in Firefox 3, including the browsing history and bookmarks data used to provide the Awesomebar’s awesomeness. sqlite is an excellent piece of software, written by people who take both data integrity and performance very seriously, which makes it a great place to put this sort of data. Lots of people use sqlite these days, and we’re proud to be founding members of the consortium that helps support sqlite development.

Databases, perhaps obviously, usually have complex file formats, and require that different parts of their files agree about things like how many records there are, whether a transaction completed successfully, or how indexes match up with the data to which they refer. This makes them more sensitive to data corruption than some simpler formats, like a basic text file. If you get a chunk of a text file corrupted, you can probably edit around it and salvage the rest of the file, but if you get a chunk of a database file corrupted you can often effectively lose all of the data that’s held there. This is one of the tradeoffs for being able to have efficient access to large sets of data, and it’s common to virtually all databases.

One of the things that sqlite does to ensure that the database is not corrupted in the case of a crash is call a function called fsync. fsync tells the operating system to ensure that this file has been safely written to disk, and waits until that’s complete. This provides what’s known as a “barrier”, and makes sure that we don’t get mismatched parts of transactions (groups of related database operations) when we look at the file after a crash. This is very effective: even if the operating system itself crashes or the computer loses power suddenly, we won’t see the database corrupted.

We don’t want to lose the user’s data, because that makes users sad, and we like to make users happy. So up through Firefox RC1, we set sqlite to its recommended setting of “FULL” synchronization. As release-end-game luck would have it, that made some users sad, because they would find Firefox and sometimes other parts of their system would pause unpleasantly, and it was tracked down to Firefox calling fsync. The bug in question is here, but I caution you to not read it piecemeal; there’s a lot of intertwined conversation there, and some comments are not as correct as they sound.

(There have been other bugs along the way that could cause this, ranging from performance problems with certain sqlite versions to bad interactions with the data set used for malware protection, but those were put behind us by RC1. This bug is the one that has people working weekends at this point.)

Why does that hurt so much?

On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem. Because writing to disk is so much slower than writing to memory, operating systems can buffer a lot of data, especially if you’re doing something that involves a lot of I/O, like unpacking a zip file or compiling software. It gets written out in the background, giving you vastly, vastly improved performance. It’s no exaggeration to say that without this sort of buffering your computer would be entirely unusable.

I think you can see where this is going: if there’s a lot of data waiting to be written to disk, and Firefox’s (sqlite’s) request to flush the data for one file actually sends all that data out, we could be waiting for a while. Worse, all the other applications that are writing data may end up waiting for it to complete as well. In artificial, but not entirely impossible, test conditions, those delays can be 30 seconds or more. That experience, to coin a phrase, kinda sucks. Does it suck as much as file corruption wiping out your bookmarks after your computer (not Firefox) crashes? As you might imagine, opinions vary.

This problem with ext3 is well-known to Linux kernel developers, and there’s great work underway as part of the “ext4″ project to remedy it. Other filesystems (like reiser4, I have heard) have similar problems, and I presume that their developers are also working on resolving them.

Why doesn’t other (non-sqlite) software do this?

Actually, a lot of software that’s concerned with data integrity does this, including the editors emacs and vim, and mail clients like mutt and Evolution, as well as bigger databases like MySQL and Postgres. In some cases, those programs are in fact adding more calls to fsync to protect user data better.

In fact, so many programs use fsync to ensure data integrity, and actually writing to disk is so expensive, that some operating systems make fsync not be a “real” fsync: the data is scheduled for (hopefully) immediate write-out, but the call doesn’t wait until it’s all the way to the disk, so it’s not really an effective barrier. This may be permitted by various standards like POSIX, but it’s certainly surprising for programs like Firefox that use it to protect against data corruption in the case of a crash!

Here’s what Apple has to say about fsync, sqlite, and data corruption:

fsync on Mac OS X: Since on Mac OS X the fsync command does not make the guarantee that bytes are written, SQLite sends a F_FULLFSYNC request to the kernel to ensures that the bytes are actually written through to the drive platter. This causes the kernel to flush all buffers to the drives and causes the drives to flush their track caches. Without this, there is a significantly large window of time within which data will reside in volatile memory—and in the event of system failure you risk data corruption.

Exciting!

If this is an operating system bug, why is Firefox being patched?

Because we want to make users happy. Whether Linux should have better fsync behaviour or not isn’t really going to matter to our users — we want to support Linux, which means Linux-as-she-is-shipped, not Linux-as-we-would-like-her-to-be. That means that we need to deal with X servers, with font-selection systems, and with filesystem behaviour as we find it, because that’s where our Linux users are. (It’s not like Windows and OS X don’t have their own annoying things to work around either, though they don’t seem to have this specific one!)

So is it always going to be like this?

No! (Yay!)

In the immediate term, there is a patch that controls how aggressively we sync, and defaults to a slightly less-aggressive state that is equivalently safe on modern operating systems. That patch will be in either Firefox 3.0.1 at the latest, and we’ve been in contact with Linux distributors to make sure they’re aware that the patch is fine to take in their builds — desirable, even. It might be in Firefox 3.0 proper, depending on what happens with an RC2, but either way the vast majority of affected users (Linux users, who usually get their Firefoxes from their distributors) will get the fix right away. This patch also lets users, who might decide that their systems are stable enough and their backups good enough that they don’t need the extra protection, turn off the data-integrity fsyncs almost entirely.

In the medium term, we’re going to be making our database use more asynchronous in Firefox, and batching transactions for things like history. You’ll likely see those effects in the major version of Firefox that follows 3 (might be called 3.1, might be called 3.5, might be called Firefox Magenta, who knows?). This will keep Firefox from pausing in these states, but may not entirely keep the fsync calls from affecting other applications on the system. The sqlite developers are also looking at adding support for a newer, Linux-only API that doesn’t have the system-wide effects as fsync, which could help as well.

Longer term, as I mentioned above, the Linux filesystem situation will improve in this respect, and the work is well-underway. It’s probably a year at least before most Linux users are running systems with fixed fsync behaviour, but at that point application developers won’t have to worry about their data-integrity needs causing pain for the whole system. I, for one, am looking forward to it.

I’d like to thank the sqlite developers for their help analyzing the effects of different fsync patterns, various Linux kernel developers (especially my former colleague Andreas Dilger), and the users who have helped test different settings and I/O loads. We’re going to ship a better Firefox 3 on Linux because of it, and we have even more ways to improve that experience on all operating systems in the future. I would not like to thank the people who have substituted vitriol and dogma for analysis and understanding of the release cycle, but we all get frustrated sometimes. I hope they enjoy Firefox 3 too.

year of the Gecko

Stuart put up a great post today describing the results of our intensive focus on memory use in Firefox 3 (and followed up, after many requests from commenters on his blog and elsewhere, with a graph including Safari and Opera). The memory gains are great, and they cover all sorts of improvements: leak fixes, allocator changes, new facilities to eliminate classes of troublesome entrainment, and better cache management.

It’s a time-honoured programming tradeoff that using more space speeds you up, but that’s not what happened here: our memory-reduction regimen actually made us faster in a lot of cases by making us more cache-friendly and by side-effects like using a better allocator. And we didn’t stop there, dropping the hammer on major performance gains in rendering and JavaScript as well, and leaving us as of today right at the top of tests like Apple’s SunSpider.

Productivity and feature wins in Firefox-the-application are really coming together as well, with the AwesomeBar leading many people’s lists of favourite new feature. It really has changed the way I use the web, and I feel like everything I’ve ever seen is right at my fingertips. Add to that the great strides in OS integration and theming for Mac and Linux and it really is shaping up to be the best browser the web has ever known.

I’m obviously excited; this feels like exactly the right sort of everything-coming-together that should be in the air on the cusp of the 10th anniversary of the original source release. It hasn’t been an easy ride, especially pre-Firefox, and nobody on the project takes our success so far for granted — which makes it all the more satisfying to see years of investment pay off in a fantastic product.

Other people are excited too, from users and journalists to extension developers and companies looking to add web tech to their products. In the mobile arena especially we’re seeing a ton of excitement about the gains in speed and size. A lot of people aren’t yet used to thinking of Mozilla as a source of mobile-grade technology, but they weren’t used to thinking of us as a major browser force either. It’s fun to break the model.

Fast, small, cross-platform, industry-leading stability, solid OS integration, excellent standards support, excellent web compatibility, great security, ridiculously extensible, a productive app platform, accessible, localized to heck and back, open source from top to bottom: it’s a great time to be building on top of Gecko, and Firefox 3 is just the beginning. Wait until you see what we have in store for the next release…

Tbeachball

One aspect of our product performance that I’m especially interested in knowing more about is responsiveness: when the user does something, how long does it take for us to act on that stimulus? (We might not complete it instantly, as when loading a new page, but we should endeavour to give feedback pretty much instantly.) One thing that sometimes interferes with our interface responsiveness is spending too long away from the main event loop, perhaps in layout or some other intensive computation. This leads to us not noticing — and thus, obviously, acting on — new events from the user such as mouse clicks or keypresses.

I’ve been talking for a while about wanting to measure our UI-thread event service latency as a proxy for that kind of responsiveness, and so this morning I finally just sat down and did it while I was waiting for Tyla to get out of bed. The patch (link fixed now, sorry) is pretty small, though it’s also rough: no control of buffer size or minimum reporting latency without recompiling, that sort of thing.

I plugged the samples from a short run into Numbers — which then beachballed a little while generating the graph, naturally — and this is the result:

Tbeachball graph

The taller portion at the right is from me going to an SVG statistics widget and dragging the sample-picker around like a bandit. On my laptop, we got a little above the 200 ms rule-of-thumb perception threshold for interactivity, and indeed I did start to notice that it was lagging a little behind my motion. I had other tabs open and stuff going on on my computer, etc., so it’s not Very Good Science, but it does give me hope that something like this monitoring will prove a decent proxy for one aspect of responsiveness. (Note that I just drop all samples that are less than 10 milliseconds, so these result sets will tend to give a negative impression of the application, by focusing only on events that actually take meaningful time to process.)