climateprediction.net (CPDN) home page
Thread 'Computation error'

Thread 'Computation error'

Message boards : Cafe CPDN : Computation error
Message board moderation

To post messages, you must log in.

AuthorMessage
Ron Voss

Send message
Joined: 15 Jun 07
Posts: 3
Credit: 985,430
RAC: 967
Message 58984 - Posted: 9 Nov 2018, 20:59:24 UTC

Most of my BOINC projects take a few minutes to a few hours, but CP takes days to run, so I was bummed that after two days (blocking other projects because "Switch between tasks every N minutes" isn't working) my two CP tasks aborted simultaneously with "Computation error" after a restart, despite checkpointing. So I'm sorry to have to abandon CP; I don't want it wasting more cycles. These were my first two tasks after rejoining CP since being away a few years; I don't remember why I left, perhaps for the same reason. 16GB iMac MacOS 10.13.6.
ID: 58984 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58985 - Posted: 9 Nov 2018, 21:36:54 UTC

BOINC has to be Suspended, and then Exited BEFORE any computer restart.
And the only model type that's currently available for the Mac are very touchy anyway.

Also, see Why Macs are on the way out at cpdn at the top of the Macintosh section.

Thanks for trying. This isn't an easy project to handle.
ID: 58985 · Report as offensive     Reply Quote
Ron Voss

Send message
Joined: 15 Jun 07
Posts: 3
Credit: 985,430
RAC: 967
Message 58986 - Posted: 9 Nov 2018, 22:30:24 UTC - in response to Message 58985.  

I meant a restart of BOINC; Mac wasn't rebooted. But I would think (naively?) the last CP checkpoint should *always* survive *any* kind of restart. Thanks for your efforts; I'm a volunteer board mod elsewhere.
ID: 58986 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58987 - Posted: 10 Nov 2018, 6:15:50 UTC

That's another of the many problems with this project - the large number of files that are open.
I don't recall anyone having looked into it, but "check pointing" may not be the same as "all files saved". Stopping / shutting down parts of it, may just happen to occur while some of the files are still waiting to be saved.

Back when there were still graphics, with some info about the model's state on it, I used to wait until the countdown timer (to next checkpoint), showed zero, and then a few more, before I Suspended that model. And each model was Suspended individually, before Suspending BOINC, and then Exiting BOINC.
I don't know how much overkill this was, but it worked, and it didn't take long.

The new modelling programs seem to need lots of tlc for certain types of OS, and certain versions of the OS.
e.g. Windows 10 may be the cause of a lot of the failures with the South American models, (sas25), a lot of which fail at about 3 minutes.
But on my Linux Mint computers, running the latest version of WINE, with a Windows version of BOINC, I don't have that problem.
ID: 58987 · Report as offensive     Reply Quote

Message boards : Cafe CPDN : Computation error

©2024 cpdn.org