climateprediction.net (CPDN) home page
Thread '"Model crash detected, will try to restart..."'

Thread '"Model crash detected, will try to restart..."'

Message boards : Number crunching : "Model crash detected, will try to restart..."
Message board moderation

To post messages, you must log in.

AuthorMessage
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 55744 - Posted: 20 Feb 2017, 7:16:00 UTC

I get the above message in at least half of my stderr out files, even though the work units complete successfully.
https://www.cpdn.org/cpdnboinc/result.php?resultid=20245071
https://www.cpdn.org/cpdnboinc/result.php?resultid=20225149
https://www.cpdn.org/cpdnboinc/result.php?resultid=20226338
https://www.cpdn.org/cpdnboinc/result.php?resultid=20227217
https://www.cpdn.org/cpdnboinc/result.php?resultid=20198597
https://www.cpdn.org/cpdnboinc/result.php?resultid=20221188

I see at least hadcm3s, wah2_pnw25, wah2_wus25 and wah2_eas50 and there may be others. Is it anything to worry about?
ID: 55744 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 55748 - Posted: 20 Feb 2017, 12:42:22 UTC - in response to Message 55744.  

There have been models from other batches that crashed after the last Zip had been generated, in which case waiting for that Zip and then suspending saved the Zip upload. Rather more effort than can reasonably be expected for a multi-core machine.

If this is a general problem rather than specific to your machine then it would be a particular problem for Mac users, whose Zips are not uploaded until right at the end (because WAH2 8.12 does not make trickles on the Mac): however, the application versions you report are 8.24 and 8.29.

One of your crashed models is WUS25/25 batch #508: I've run one of those to completion without that error (here), so my guess is that it's a problem on your machine.
ID: 55748 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 55750 - Posted: 20 Feb 2017, 14:07:01 UTC - in response to Message 55748.  
Last modified: 20 Feb 2017, 14:30:40 UTC

Thanks. Does it affect the science? I don't want to return bad results, or fail the last zip if that is what it is doing.

EDIT: I am changing my ramdisk a little to try to eliminate it. It appears something is hanging up the save right at the end, but don't know what it is.
ID: 55750 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,808,726
RAC: 5,192
Message 55752 - Posted: 20 Feb 2017, 17:50:24 UTC - in response to Message 55750.  

It would affect the science if the last Zip was missing. People have pressed for many years for shorter models but a downside of requiring the project to reassemble chains of separately run models is that each link in the chain must be complete; as a minimum, the restart file at the end must be uploaded. However, there's been some weird behaviour with restart files on recent models, such as the restart file being generated after Zip 3 for a 10-Zip EAS50 model so I'm not at all sure what now constitutes an acceptable model.

Apart from disk space restrictions and the usual problems with virus checkers interfering with disk writes I don't know what would cause a problem only at the end of a run.
ID: 55752 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55755 - Posted: 20 Feb 2017, 19:36:31 UTC

The new(ish) w@h programs are quite versatile; both the area and the run length can be changed using parameters, without requiring a recompile.
And the restart file can be placed anywhere in the length of the run, and not just at the end. I don't know why this is used, and I don't want to get that deep into climate modelling.
Plus there's also an out file, which contains diagnostics to help pinpoint problems with the program.

But, yes ALL zips need to be returned to make up the complete set, if you want all of your crunching to get used.
ID: 55755 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 55757 - Posted: 20 Feb 2017, 21:31:45 UTC - in response to Message 55755.  
Last modified: 20 Feb 2017, 21:32:19 UTC

And the restart file can be placed anywhere in the length of the run, and not just at the end. I don't know why this is used, and I don't want to get that deep into climate modelling.

I was wondering about that. The problem may not be at the end. But I have been resizing the ramdisk the last week (since 15 Feb), when the problems started. I first exit BOINC, change (i.e., increase) the ramdisk size, and then reboot. That causes BOINC to re-check the size of the disk, which may interfere with the restart of the CPDN job. I will, among other things, just leave it alone for a while and see if the problem goes away. Insofar as I know, I am not losing any zips over it anyway.
ID: 55757 · Report as offensive     Reply Quote

Message boards : Number crunching : "Model crash detected, will try to restart..."

©2024 cpdn.org