climateprediction.net home page
Model run lost due to extreme reaction to minor network problems

Model run lost due to extreme reaction to minor network problems

Questions and Answers : Unix/Linux : Model run lost due to extreme reaction to minor network problems
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user26469

Send message
Joined: 23 Oct 04
Posts: 13
Credit: 70,493
RAC: 0
Message 6675 - Posted: 7 Dec 2004, 6:49:29 UTC

I just lost a 90%-completed model run (my first) because BOINC's response to a temporary NFS-failure-induced I/O error caused by a machine reboot is to kill the model and try to download another one. The original model is perfectly OK, but BOINC refuses to process it anymore.

Might I suggest that this is a slightly excessive reaction to one and a half minutes without network service? CPDN requires vast amounts of disk space, and there are probably many sites where the BOINC directories must be NFS-mounted if BOINC is to run at all: but guaranteeing zero downtime of NFS servers over a period of months is entirely impractical.

BOINC recovers from local systems failures: it should recover from remote ones as well.
ID: 6675 · Report as offensive     Reply Quote
old_user3

Send message
Joined: 5 Aug 04
Posts: 173
Credit: 1,843,046
RAC: 0
Message 6694 - Posted: 7 Dec 2004, 10:43:46 UTC

My initial guess is that the model was at a critical point or in the middle of writing to a file.
NFS I/O is expensive and this would have been more obvious when writing large amt. of data.
Fault recovery is not dictated by the BOINC core client but by the worker process.
The original model might appear to be ok but its probably in an inconsistent state by now.
The partial results ( P1 & p2 ) are available to you & us and you'll still earn credits for model
years acquired until the crash..
Sorry about this.
ID: 6694 · Report as offensive     Reply Quote
old_user1
Avatar

Send message
Joined: 5 Aug 04
Posts: 907
Credit: 299,864
RAC: 0
Message 6722 - Posted: 7 Dec 2004, 20:58:53 UTC

I would think that, unfortunately, NFS disk space is not a good idea with the climate model. The file I/O is beyond our normal means to control, and even on a local disk can be very flakey & sensitive to disk errors --- and this will be greatly magnified with NFS volumes.
ID: 6722 · Report as offensive     Reply Quote
old_user26469

Send message
Joined: 23 Oct 04
Posts: 13
Credit: 70,493
RAC: 0
Message 6723 - Posted: 7 Dec 2004, 21:04:23 UTC - in response to Message 6694.  

> My initial guess is that the model was at a critical point or in the middle of
> writing to a file.

Curses. What bad luck.

> NFS I/O is expensive and this would have been more obvious when writing large
> amt. of data.

It didn't die at a day-end or a month-end, as near as I can tell: it was probably one of the big I/O spikes during the lengthy (radiosity?) computations that got interrupted.

I've cleared enough space up to avoid using NFS for the BOINC state.

> Fault recovery is not dictated by the BOINC core client but by the worker
> process.

Ah. OK, the worker died. Very well; there's not much any of us can do about that. (The error response from BOINC itself was amusing: something about the disk possibly being full, followed by, er, downloading a new model; surely if the disk were full this wouldn't be a very good thing.

`The disk is full.' *WHAM* `Well, it's fuller now!'

> The original model might appear to be ok but its probably in an inconsistent
> state by now.

Curses. :(

> The partial results ( P1 & p2 ) are available to you & us and you'll
> still earn credits for model
> years acquired until the crash..

OK, so I'll archive the intermediate model data just as if it were OK then.

(I take it the dead model directory can be removed from the CPDN tree after archiving... nothing seems to be referencing it anymore, after all.)

ID: 6723 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Model run lost due to extreme reaction to minor network problems

©2024 cpdn.org