Thread '"Model crash detected, will try to restart..."'

Author	Message
Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 55744 - Posted: 20 Feb 2017, 7:16:00 UTC I get the above message in at least half of my stderr out files, even though the work units complete successfully. https://www.cpdn.org/cpdnboinc/result.php?resultid=20245071 https://www.cpdn.org/cpdnboinc/result.php?resultid=20225149 https://www.cpdn.org/cpdnboinc/result.php?resultid=20226338 https://www.cpdn.org/cpdnboinc/result.php?resultid=20227217 https://www.cpdn.org/cpdnboinc/result.php?resultid=20198597 https://www.cpdn.org/cpdnboinc/result.php?resultid=20221188 I see at least hadcm3s, wah2_pnw25, wah2_wus25 and wah2_eas50 and there may be others. Is it anything to worry about? ID: 55744 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,824,485 RAC: 4,956	Message 55748 - Posted: 20 Feb 2017, 12:42:22 UTC - in response to Message 55744. There have been models from other batches that crashed after the last Zip had been generated, in which case waiting for that Zip and then suspending saved the Zip upload. Rather more effort than can reasonably be expected for a multi-core machine. If this is a general problem rather than specific to your machine then it would be a particular problem for Mac users, whose Zips are not uploaded until right at the end (because WAH2 8.12 does not make trickles on the Mac): however, the application versions you report are 8.24 and 8.29. One of your crashed models is WUS25/25 batch #508: I've run one of those to completion without that error (here), so my guess is that it's a problem on your machine. ID: 55748 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 55750 - Posted: 20 Feb 2017, 14:07:01 UTC - in response to Message 55748. Last modified: 20 Feb 2017, 14:30:40 UTC Thanks. Does it affect the science? I don't want to return bad results, or fail the last zip if that is what it is doing. EDIT: I am changing my ramdisk a little to try to eliminate it. It appears something is hanging up the save right at the end, but don't know what it is. ID: 55750 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,824,485 RAC: 4,956	Message 55752 - Posted: 20 Feb 2017, 17:50:24 UTC - in response to Message 55750. It would affect the science if the last Zip was missing. People have pressed for many years for shorter models but a downside of requiring the project to reassemble chains of separately run models is that each link in the chain must be complete; as a minimum, the restart file at the end must be uploaded. However, there's been some weird behaviour with restart files on recent models, such as the restart file being generated after Zip 3 for a 10-Zip EAS50 model so I'm not at all sure what now constitutes an acceptable model. Apart from disk space restrictions and the usual problems with virus checkers interfering with disk writes I don't know what would cause a problem only at the end of a run. ID: 55752 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 55755 - Posted: 20 Feb 2017, 19:36:31 UTC The new(ish) w@h programs are quite versatile; both the area and the run length can be changed using parameters, without requiring a recompile. And the restart file can be placed anywhere in the length of the run, and not just at the end. I don't know why this is used, and I don't want to get that deep into climate modelling. Plus there's also an out file, which contains diagnostics to help pinpoint problems with the program. But, yes ALL zips need to be returned to make up the complete set, if you want all of your crunching to get used. ID: 55755 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 55757 - Posted: 20 Feb 2017, 21:31:45 UTC - in response to Message 55755. Last modified: 20 Feb 2017, 21:32:19 UTC And the restart file can be placed anywhere in the length of the run, and not just at the end. I don't know why this is used, and I don't want to get that deep into climate modelling. I was wondering about that. The problem may not be at the end. But I have been resizing the ramdisk the last week (since 15 Feb), when the problems started. I first exit BOINC, change (i.e., increase) the ramdisk size, and then reboot. That causes BOINC to re-check the size of the disk, which may interfere with the restart of the CPDN job. I will, among other things, just leave it alone for a while and see if the problem goes away. Insofar as I know, I am not losing any zips over it anyway. ID: 55757 · Reply Quote