Message boards : Number crunching : Stalled regional (PNW) model, is this a known behavior?
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0 |
Just started running CPDN again, regional PNW models only. I run only 1 CPDN task at a time; the other cores run other projects primarily WCG. This combination has not had any problems in the past and nothing else has changed on this rig. (6.10.58 on XP Pro 4GB, VM usage is staying within tolerable range) The 2nd model accumulated over 4 hours of clock time before I noticed it had zero cpu time. After suspending then restarting the client, the model restarted at zero elapsed time, and so far seems to be running normally. I've seen ice worlds but not had a model stall in this way before. If it's an occasional normal then never mind, but was just curious about this behavior. Apologize in advance if this is old news; didn't see it in the READMEs I looked at. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
This problem has never been reported before with any of the regional model types; in fact I can't remember anyone reporting this problem for any model type. Of course there are hundreds of people whose models have problems (or, more accurately in most cases, whose misconfigured computers have problems with models) but never report it, so there's no guarantee that this has never happened before. There are AFAIK no known bugs in any of the regional model types. They are very memory-intensive, so running a full load of them on a multicore means they will probably slow each other down, but that won't apply to your situation and we have warned members about this several times in the News thread. It will be interesting to see whether your PNW now progresses normally. Thanks for reporting this. Cpdn news |
Send message Joined: 10 Sep 09 Posts: 9 Credit: 253,090 RAC: 0 |
10/19/2010 10:20:09 AM climateprediction.net task hadsm3dhet2_u6ac_006725419_1 aborted by user This WU was "in progress", stuck at 33.3333% for over a week. Rebooted PC, suspended, restarted, etc. It wouldn't progress past 33.3333%. After I aborted it and did an "update" it reported it as a completed task, so hopefully someone there will be able to figure out why it crashed. Pulling up the Graphics on this WU showed a completely blue-covered planet, so maybe there was a flood (didn't see an Ark though). |
Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0 |
Just FYI, this model completed fine. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Larry This thread is about hadam3p models. The one that you posted about is a hadsm3 (slab ocean) model. Totally different. These are prone to "ice world" behaviour, for which there's a thread just below this one, and an information sticky further up near the top of the list. Backups: Here |
Send message Joined: 10 Sep 09 Posts: 9 Credit: 253,090 RAC: 0 |
Thanks for checking on it. |
Send message Joined: 10 Sep 09 Posts: 9 Credit: 253,090 RAC: 0 |
and Les, sorry I posted in the wrong thread. Didn't notice the other one. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Posting in the wrong place isn't a problem. I was concerned that you'd missed all of the info about ice worlds, but I guess that you know about them now. :) The 'slab' models are the only ones that do this, by the way. Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Larry I've looked at your HadSM's web page and noticed a few things. * It crashed at 33.3% which is the end of Phase 1 but there's no graph for Phase 1. HadSM produces 24 trickles for each phase plus a file at the end of each. Your model only sent 23 trickles. * This means it probably crashed while it was post-processing Phase 1 and generating its file. HadSMs hate to be disturbed during post-processing and this job takes them quite a while. If the computer is shut down or the owner exits from Boinc during this job or during the next whole countdown when progress resumes the model's likely to go wrong. It may crash or go back to the beginning of the phase or, as seems to be the case with yours, just stay stuck there constantly trying but failing. * So this isn't a case of the typical iceworld which is caused by an as yet undiagnosed flaw within some models. * Within a workunit one expects the model to behave the same way on all computers with the same CPU typed (AMD or Intel) and operating sysem (Win, Linux or Mac). Your computer has AMD + Windows. Here's the workunit. * Computers #7 and #10 in the list also have AMD + Windows. They completed the model. This means that the model itself was almost certainly OK. * Even if models crash they then say 'Completed' in the Boinc manager Status column. I think crashed tasks should say something like 'Finished prematurely' if Boinc can reliably detect this. I suggested this some months ago on a Boinc email list but the Boinc programmers mustn't have been very keen on my idea. Anyway, I wouldn't worry about it. You did the right thing to abort a model that wouldn't advance. Everybody crashes a model or two or more from time to time. But if you're running HadSMs it's worth looking to see what point they've reached before exiting from Boinc. Cpdn news |
©2024 cpdn.org