climateprediction.net (CPDN) home page
Thread 'Computation Error'

Thread 'Computation Error'

Message boards : Number crunching : Computation Error
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user675821

Send message
Joined: 27 Mar 12
Posts: 1
Credit: 3,191
RAC: 0
Message 43996 - Posted: 11 Apr 2012, 21:59:46 UTC

Both of my climate prediction tasks just failed with the error above. Can they be recovered and restarted, or do new tasks have to be downloaded?
ID: 43996 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 43999 - Posted: 11 Apr 2012, 23:15:48 UTC - in response to Message 43996.  
Last modified: 11 Apr 2012, 23:15:59 UTC

Yes, failed models can often be recovered, using backups made in anticipation BEFORE the failure.
See my sig for details.

Otherwise, you'll have to try and get new work from the VERY little that's currently available.
See my post here about the availability of work.
Backups: Here
ID: 43999 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 44001 - Posted: 12 Apr 2012, 9:31:20 UTC - in response to Message 43996.  

As all three models on that computer have failed, it may be worth checking some of the known causes of crashes before downloading any more. - Running memtest to ensure it is not a faulty memory problem that only show up under the intensive load cpdn puts on some machines, making sure the boinc data directory is excluded from any virus scans as they can put a lock on the file when boinc needs to write to it. Also on my machine the odds improve if I suspend work units and exit boinc before turning machine off. Not quite sure if this last one is relevant under windows or just linux.

Dave
ID: 44001 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 44004 - Posted: 12 Apr 2012, 15:49:05 UTC - in response to Message 44001.  

Suspending and exiting from Boinc before shutdown is relevant in Windows. You can loose a model if the shutdown catches it at a crucial moment such as when it is writing to the disk.
ID: 44004 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 44005 - Posted: 12 Apr 2012, 21:01:50 UTC - in response to Message 44004.  

Oh for an intelligent OS that will shut everything down cleanly before closing itself down!
ID: 44005 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44006 - Posted: 12 Apr 2012, 21:45:05 UTC - in response to Message 44005.  

Unfortunately, that would require "slow and careful", and people want "Faster! Faster!", which is what is happening with each new version of Windows. :(
Sometimes people just have to include themselves in the loop.

The problem with this project, is that there's a LOT of ancillary files open, all of which need to be shut down. I think that the problem mainly occurs if shut down occurs while the files are being saved at a checkpoint, and only some have been saved. Then some of them are out of sync with the others, and the program can't restart.


Backups: Here
ID: 44006 · Report as offensive     Reply Quote
ProfileRandi
Avatar

Send message
Joined: 28 Jun 07
Posts: 31
Credit: 4,341,796
RAC: 624
Message 44034 - Posted: 17 Apr 2012, 13:00:01 UTC - in response to Message 44006.  
Last modified: 17 Apr 2012, 13:35:59 UTC

I normally do 'File | Exit BOINC' and tell it that I do want to stop running science applications. That seems to work OK, but maybe I have been lucky.

Do I need to do something else before doing that (suspend the CPDN tasks in the task tab or the CPDN project in in the project tab)?

Thanks
ID: 44034 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 44035 - Posted: 17 Apr 2012, 15:27:06 UTC

That's prudent and proper, so boinc should "do right" every time.

As an extra measure of caution, I developed the habit, years ago, of clicking "Suspend" in boinc "Activity" before "File/Exit." (As you are aware, having been around since 2007, we used to run some really long tasks and there was no such thing as being too safe.)

Cheers, Randi.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 44035 · Report as offensive     Reply Quote
ProfileRandi
Avatar

Send message
Joined: 28 Jun 07
Posts: 31
Credit: 4,341,796
RAC: 624
Message 44036 - Posted: 17 Apr 2012, 17:44:10 UTC - in response to Message 44035.  

Thank you!
ID: 44036 · Report as offensive     Reply Quote

Message boards : Number crunching : Computation Error

©2024 cpdn.org