climateprediction.net (CPDN) home page
Thread 'Bug in in Hadcm3n'

Thread 'Bug in in Hadcm3n'

Message boards : Number crunching : Bug in in Hadcm3n
Message board moderation

To post messages, you must log in.

AuthorMessage
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 12,007,087
RAC: 3,354
Message 42893 - Posted: 14 Sep 2011, 11:40:30 UTC

Hello!

This task stopped at 529992 and is showing no signs of further progress.
At this moment task is suspended, but not aborted yet.
Any recommendations?

P.S. Such situation (or bug) is the worst scenario for remote hosts...

Thank You!
ID: 42893 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,827,799
RAC: 5,038
Message 42894 - Posted: 14 Sep 2011, 14:32:51 UTC

A windows/Intel computer in that work unit got farther than the one on your computer, so there doesn't appear to be a bug in the model.

The model on your computer has stopped at trickle #20, which may suggest a problem during preparation of the 10-year Zip file upload. Was the model doing anything at all before you suspended it? Or had the percentage progress reverted to zero or something similar?
ID: 42894 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 12,007,087
RAC: 3,354
Message 42895 - Posted: 14 Sep 2011, 20:10:45 UTC - in response to Message 42894.  
Last modified: 14 Sep 2011, 20:14:13 UTC

A windows/Intel computer in that work unit got farther than the one on your computer, so there doesn't appear to be a bug in the model.

Are both models identical?

The model on your computer has stopped at trickle #20, which may suggest a problem during preparation of the 10-year Zip file upload.

Trickle #20 is at time step 518,400, the model stopped later - at time step 529,992 and 72 steps before next checkpoint.

Was the model doing anything at all before you suspended it? Or had the percentage progress reverted to zero or something similar?

The model was running normal, the progress was normal - at least, I saw no anomalies before. I suspended it after I found - it is stopped.
ID: 42895 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,827,799
RAC: 5,038
Message 42896 - Posted: 14 Sep 2011, 22:13:16 UTC - in response to Message 42895.  

Are both models identical?
The model specification is identical, but the model may develop differently on different hardware. If both platforms are Windows/Intel then the model development will usually be identical.

Trickle #20 is at time step 518,400, the model stopped later - at time step 529,992 and 72 steps before next checkpoint.
At ~3.7 s/timestep that's ~12 hours after the decade trickle. The model should be well clear of anything Zip-related.

The model was running normal, the progress was normal - at least, I saw no anomalies before. I suspended it after I found - it is stopped.
What happens when it is restarted?
ID: 42896 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 12,007,087
RAC: 3,354
Message 42900 - Posted: 15 Sep 2011, 7:58:02 UTC - in response to Message 42896.  
Last modified: 15 Sep 2011, 8:06:09 UTC

What happens when it is restarted?

Nothing positive, and I tried all variations of restart - from "suspend / resume" to "turn off / turn on computer". :-)
Unfortunately!
I found second stopped model - hadcm3n_yms2_1940_40_007432202_1. On another host.
Last trickle sent - at time step 259200, model stopped at time step 259488 and again 72 steps before next checkpoint. Screen saver shows constant 130+ hours elapsed, BOINC manager - 176+ hours allready.
Details.
1. Both hosts are using BOINC 6.12.33. For this version of BOINC on screen saver start I see sometimes messages, similar to "BOINC screen saver diagnostics error".
2. BOINC manager is "pulling the wool over my eyes" :-) - elapsed time is going up, remaining time is going down. So, in manager it looks like - the process is going normal and, maybe, this is a normal behavior of BM, because contact with the task may be lost.
What can be wrong finally?
IMHO, it does not look like - only BOINC manager's progress bar is frozen. It looks like - models are realy dead, because hadcm3n_yms2_1940_40_007432202_1 sent it's last trickle at 8th September - 1 week ago.
ID: 42900 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,827,799
RAC: 5,038
Message 42901 - Posted: 15 Sep 2011, 9:25:04 UTC

This looks like something new: I'll pass it on and see if anyone else knows what's going on.

The "BOINC screen saver diagnostics error" is fixed in 6.12.34.
ID: 42901 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 42902 - Posted: 15 Sep 2011, 9:28:34 UTC

If you open the model's graphics and look at the timesteps and countdown, do you see the model repeating the same timesteps again and again, ie a sort of looping behaviour?
Cpdn news
ID: 42902 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 12,007,087
RAC: 3,354
Message 42903 - Posted: 15 Sep 2011, 10:16:35 UTC - in response to Message 42902.  

If you open the model's graphics and look at the timesteps and countdown, do you see the model repeating the same timesteps again and again, ie a sort of looping behaviour?

All of scenes or pictures are normal, but they are absolutely STATIC - no signs of life at all.
ID: 42903 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 12,007,087
RAC: 3,354
Message 42904 - Posted: 15 Sep 2011, 10:19:21 UTC - in response to Message 42901.  
Last modified: 15 Sep 2011, 10:21:59 UTC

The "BOINC screen saver diagnostics error" is fixed in 6.12.34.

Thank You for info!
ID: 42904 · Report as offensive     Reply Quote

Message boards : Number crunching : Bug in in Hadcm3n

©2024 cpdn.org