Model run lost due to extreme reaction to minor network problems

Author	Message
old_user26469 Send message Joined: 23 Oct 04 Posts: 13 Credit: 70,493 RAC: 0	Message 6675 - Posted: 7 Dec 2004, 6:49:29 UTC I just lost a 90%-completed model run (my first) because BOINC's response to a temporary NFS-failure-induced I/O error caused by a machine reboot is to kill the model and try to download another one. The original model is perfectly OK, but BOINC refuses to process it anymore. Might I suggest that this is a slightly excessive reaction to one and a half minutes without network service? CPDN requires vast amounts of disk space, and there are probably many sites where the BOINC directories must be NFS-mounted if BOINC is to run at all: but guaranteeing zero downtime of NFS servers over a period of months is entirely impractical. BOINC recovers from local systems failures: it should recover from remote ones as well. ID: 6675 · Reply Quote

old_user3 Send message Joined: 5 Aug 04 Posts: 173 Credit: 1,843,046 RAC: 0	Message 6694 - Posted: 7 Dec 2004, 10:43:46 UTC My initial guess is that the model was at a critical point or in the middle of writing to a file. NFS I/O is expensive and this would have been more obvious when writing large amt. of data. Fault recovery is not dictated by the BOINC core client but by the worker process. The original model might appear to be ok but its probably in an inconsistent state by now. The partial results ( P1 & p2 ) are available to you & us and you'll still earn credits for model years acquired until the crash.. Sorry about this. ID: 6694 · Reply Quote

old_user1 Send message Joined: 5 Aug 04 Posts: 907 Credit: 299,864 RAC: 0	Message 6722 - Posted: 7 Dec 2004, 20:58:53 UTC I would think that, unfortunately, NFS disk space is not a good idea with the climate model. The file I/O is beyond our normal means to control, and even on a local disk can be very flakey & sensitive to disk errors --- and this will be greatly magnified with NFS volumes. ID: 6722 · Reply Quote

old_user26469 Send message Joined: 23 Oct 04 Posts: 13 Credit: 70,493 RAC: 0	Message 6723 - Posted: 7 Dec 2004, 21:04:23 UTC - in response to Message 6694. > My initial guess is that the model was at a critical point or in the middle of > writing to a file. Curses. What bad luck. > NFS I/O is expensive and this would have been more obvious when writing large > amt. of data. It didn't die at a day-end or a month-end, as near as I can tell: it was probably one of the big I/O spikes during the lengthy (radiosity?) computations that got interrupted. I've cleared enough space up to avoid using NFS for the BOINC state. > Fault recovery is not dictated by the BOINC core client but by the worker > process. Ah. OK, the worker died. Very well; there's not much any of us can do about that. (The error response from BOINC itself was amusing: something about the disk possibly being full, followed by, er, downloading a new model; surely if the disk were full this wouldn't be a very good thing. `The disk is full.' WHAM `Well, it's fuller now!' > The original model might appear to be ok but its probably in an inconsistent > state by now. Curses. :( > The partial results ( P1 & p2 ) are available to you & us and you'll > still earn credits for model > years acquired until the crash.. OK, so I'll archive the intermediate model data just as if it were OK then. (I take it the dead model directory can be removed from the CPDN tree after archiving... nothing seems to be referencing it anymore, after all.) ID: 6723 · Reply Quote