it spreads

Looks like we’re on track to make “Apple customers”:http://clusterfs.com/db/pr/2004-01-07.html kinda sad when “they type ls”:http://www.off.net/diary/2004/01/#20040106. (Actually, there’s some excellent work underway to improve that dramatically for the next release after Lustre 1.0.x, which I would like to think was in some way inspired by “my previous work in this area”:https://bugzilla.lustre.org/showbug.cgi?id=1645. I don’t _actually think I contributed a lot to the rocking “LVB”:http://www-124.ibm.com/developerworks/oss/dlm/reviewbook/maniplvb.html work that Peter and Phil and Zach are doing, but I would like to.)

and another thing

I finally found the “other bug”:https://bugzilla.lustre.org/show_bug.cgi?id=2332 today, and more or less fixed it. Nobody in particular will care, I’m sure. (I say “more or less” because I’m not sure if people are going to go for my solution. I’m ready to wash my hands of it, though.)

small victories

So, we’ve been having a little bit of trouble getting our test suite running and passing on a “new cluster”:http://www.lanl.gov/projects/pink/ of late. I should say that Robert has been having trouble, because I don’t exactly have access to the cluster right now, due to a Marx Brothers-grade set of difficulties with my “cryptocard”:http://www.cmf.nrl.navy.mil/CCS/people/kenh/cryptocard-wp.html and whatnot. We also have some new things in the mix this time, like a mostly-righteous new configuration system, a totally “alien cluster organization”:http://bproc.sourceforge.net/, all the usual fun you want right before an acceptance milestone.

Robert has been describing his troubles to me over the last little while, since I’m the other guy who has an oversized recoverythalmus. I haven’t really been poring over the code like I would like to, in this situation, but I have been thinking about it in the back of my head while I go about my other business. Last night, we got a clue in the form of some atypically-cryptic console spew:

[lots of this sort of noise] fa3cac65 < [mds]mds_destroy_export+1e9/394> fa1c6adc < [obdclass]g_uuid_lock+18c/24c4f> fa1c69a0 < [obdclass]g_uuid_lock+50/24c4f> fa1a9858 < [obdclass]obd_destroy_export+118/186> fa1bfeb5 < [obdclass]__kstrtab_obdo_to_ioobj+4bbb/b006> fa1a691b < [obdclass]__class_export_put+7f/250> fa1bfe15 < [obdclass]__kstrtab_obdo_to_ioobj+4b1b/b006> fa1c016d < [obdclass]__kstrtab_obdo_to_ioobj+4e73/b006> fa1a7dd7 < [obdclass]class_disconnect+1b3/5e4> [and so forth]

From that, the answer is obvious, of course: connect and disconnect are racing, and the bitmap maintenance in mdsclientfree isn’t safe against such races. Robert’s whipping up “a patch”:https://bugzilla.lustre.org/show_bug.cgi?id=2417 now, and I am going to treat myself to a tasty apple.

sound and fury

5 computers arrived for me today (3 of them mine, for “dogfood”:http://catb.org/~esr/jargon/html/D/dogfood.html work; one for each of Andrei and Vlad), and I set them up. Holy crap, is it loud and hot in this office. My machines seem to be perfectly happy — kudos, indeed, to the people responsible for the VNC installer in Fedora Core and for open-carpet! Andrei’s machine isn’t so happy, so I think he’ll be getting someone to replace part or all of it.

Tomorrow I get to finish setting up Lustre and pdsh and distcc and all the other nice toys for making use of a handful of machines, and then beat the crud out of them. And maybe find a way to make them quieter, good heavens. (more…)

come together, right now

The demo is shaping up nicely, and some of the far-flung CFS crew are chipping in to extend the geographical reach of the filesystem in question. I put together a little ball of networking twine and “UML”:http://user-mode-linux.sourceforge.net/ bubblegum to contribute a Canadian mount:

bash-2.05b# df -h /mnt/lustre
Filesystem            Size  Used Avail Use% Mounted on
sc03_dual_eth3         15T  1.3T   12T  10% /mnt/lustre

Robert and Eric managed to also join us from California and Bristol, but the main demo component will be joining the party over the impressive “SC2003 WAN”:http://weathermap.sc03.org/sc03wmap.html.

I fixed some real bugs today, too, so a good show all around.

going the distance, going for speed

A handful of my esteemed co-workers are down at “Supercomputing 2003″:http://www.sc-conference.org/sc2003/ working on, among “a tutorial”:http://www.sc-conference.org/sc2003/intercal/intercaldetail.php?eventid=10678 and other things, “our contribution”:http://www.clusterfs.com/ncsa-111203.html to the crazy “Bandwidth Challenge”:http://www.sc-conference.org/sc2003/infrabwc2.html shenanigans.

Apparently the tutorial — in which a small horde of people is being tutorialized as I write this very entry — is going quite well, so things are off to a good start. It’s pretty exciting, even for those of us who are cheering from megametres away.

Now that people are doing “real sciencey stuff”:http://www.clusterfs.com/llnl-111303.html on top of our baby, I think things are going to pick right up around here. Not that there’s really been a lot of thumb-twiddling recently anyway, but.

Addendum: “zoom”:http://www.clusterfs.com/ncsa-111803.html.

just turn that smile upside-down, mister

Bah:

ASSERTION(rec->ur_fid2->id == inode->i_ino) failed

Not a hard problem to fix — I can already think of not one but two solutions to the problem, and I’ve only been awake for an hour! — but the timing of this find truly, truly sucks.

Furthermore: bah.

culmination

I owe myself more and better entries, and I owe them with usurious interest. They’ll come soon, because I’m coming out of a dark, dank, productive place, in which I’ve spent most of the last month.

This won’t mean anything to any of you, but please rest assured that it means the world to me:

status: COMPLETE
recovered_clients: 714
last_transno: 10081296
replayed_requests: 61

paroxysms of anti-joy

I lost an entire day today. The pathology is all very geeky and work-specific, but the effects are simple enough: I want to punch the universe until it stops twitching, and then set it on fire.

Thank you for your attention to this matter. (more…)

« previous page