climateprediction.net (CPDN) home page
Thread 'Hadcm3n crash'

Thread 'Hadcm3n crash'

Message boards : Number crunching : Hadcm3n crash
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42101 - Posted: 3 May 2011, 15:25:14 UTC

Hadcm3n_o53i_1900_40_007201937_0 failed at 52% just after restarting the program immediately after making backup. Restored backup also fails immediately at startup.

OS is Windows 7 64 bit SP1 running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.


ID: 42101 · Report as offensive     Reply Quote
ProfileWarped

Send message
Joined: 12 Sep 04
Posts: 34
Credit: 1,017,702
RAC: 0
Message 42103 - Posted: 3 May 2011, 16:49:30 UTC

There could be any of a number of causes.

Check out the links in This Post.

Is it possible that it was busy saving (i.e. at the checkpoint) when you exited? Did you suspend the model before backing-up?
ID: 42103 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42115 - Posted: 5 May 2011, 4:01:31 UTC

Your right, I should have suspended before exiting. The WU probably crashed as I shut it down. That’s why the backup didn’t work. It was made after the crash. Fortunately, I found an older backup made 2 days ago on my external backup drive. I only lost about 48 hours of crunching not 600+ hours. Frequent backups do pay off.

ID: 42115 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 42118 - Posted: 5 May 2011, 5:38:11 UTC

Roger that!

Good on ya, JIM.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 42118 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42131 - Posted: 6 May 2011, 19:45:35 UTC - in response to Message 42101.  

Hadcm3n_o53i_1900_40_007201937_0 failed at 52% just after restarting the program immediately after making backup. Restored backup also fails immediately at startup.

Unfortunately, the restored WU crashed again at the same point. That means that there must be something wrong with the WU.

ID: 42131 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42133 - Posted: 6 May 2011, 20:24:53 UTC - in response to Message 42131.  

Or with your backup/restore procedure.

Else this would make it the first one of these models reported as failing.


Backups: Here
ID: 42133 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42139 - Posted: 8 May 2011, 2:46:32 UTC - in response to Message 42133.  
Last modified: 8 May 2011, 2:47:49 UTC

The restored WU ran for 36 hours to the point that it failed the first time before failing again. That doesn’t sound like something wrong with the backup and restore procedures.I have successfully restored CM WU's in the past and completed them.
ID: 42139 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42140 - Posted: 8 May 2011, 4:36:36 UTC - in response to Message 42139.  
Last modified: 8 May 2011, 4:37:27 UTC

That's slightly different from what was said last time:
Restored backup also fails immediately at startup.


So my second comment applies:
Else this would make it the first one of these models reported as failing.

Backups: Here
ID: 42140 · Report as offensive     Reply Quote

Message boards : Number crunching : Hadcm3n crash

©2024 cpdn.org