small victories

So, we’ve been having a little bit of trouble getting our test suite running and passing on a “new cluster”:http://www.lanl.gov/projects/pink/ of late. I should say that Robert has been having trouble, because I don’t exactly have access to the cluster right now, due to a Marx Brothers-grade set of difficulties with my “cryptocard”:http://www.cmf.nrl.navy.mil/CCS/people/kenh/cryptocard-wp.html and whatnot. We also have some new things in the mix this time, like a mostly-righteous new configuration system, a totally “alien cluster organization”:http://bproc.sourceforge.net/, all the usual fun you want right before an acceptance milestone.

Robert has been describing his troubles to me over the last little while, since I’m the other guy who has an oversized recoverythalmus. I haven’t really been poring over the code like I would like to, in this situation, but I have been thinking about it in the back of my head while I go about my other business. Last night, we got a clue in the form of some atypically-cryptic console spew:

[lots of this sort of noise] fa3cac65 < [mds]mds_destroy_export+1e9/394> fa1c6adc < [obdclass]g_uuid_lock+18c/24c4f> fa1c69a0 < [obdclass]g_uuid_lock+50/24c4f> fa1a9858 < [obdclass]obd_destroy_export+118/186> fa1bfeb5 < [obdclass]__kstrtab_obdo_to_ioobj+4bbb/b006> fa1a691b < [obdclass]__class_export_put+7f/250> fa1bfe15 < [obdclass]__kstrtab_obdo_to_ioobj+4b1b/b006> fa1c016d < [obdclass]__kstrtab_obdo_to_ioobj+4e73/b006> fa1a7dd7 < [obdclass]class_disconnect+1b3/5e4> [and so forth]

From that, the answer is obvious, of course: connect and disconnect are racing, and the bitmap maintenance in mdsclientfree isn’t safe against such races. Robert’s whipping up “a patch”:https://bugzilla.lustre.org/show_bug.cgi?id=2417 now, and I am going to treat myself to a tasty apple.

Comments are closed.