climateprediction.net (CPDN) home page
Thread 'after 333 hours, computation error! why? any way to fix it?'

Thread 'after 333 hours, computation error! why? any way to fix it?'

Message boards : Number crunching : after 333 hours, computation error! why? any way to fix it?
Message board moderation

To post messages, you must log in.

AuthorMessage
llagos

Send message
Joined: 21 Jul 05
Posts: 9
Credit: 163,228
RAC: 0
Message 47794 - Posted: 16 Dec 2013, 20:21:34 UTC

Hi,

I don't want to lose these 333 hours of work.

This is the output:
12/16/2013 1:29:53 PM | climateprediction.net | Sending scheduler request: To send trickle-up message.
12/16/2013 1:29:53 PM | climateprediction.net | Not requesting tasks: "no new tasks" requested via Manager
12/16/2013 1:29:58 PM | climateprediction.net | Scheduler request completed
12/16/2013 1:30:05 PM | climateprediction.net | Computation for task hadcm3n_ob2t_1900_40_008469480_0 finished
12/16/2013 1:30:05 PM | climateprediction.net | Output file hadcm3n_ob2t_1900_40_008469480_0_3.zip for task hadcm3n_ob2t_1900_40_008469480_0 absent
12/16/2013 1:30:05 PM | climateprediction.net | Output file hadcm3n_ob2t_1900_40_008469480_0_4.zip for task hadcm3n_ob2t_1900_40_008469480_0 absent

I will suspend the project, to keep the files.

Thanks
ID: 47794 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 47795 - Posted: 16 Dec 2013, 20:32:46 UTC
Last modified: 16 Dec 2013, 20:48:14 UTC

Only possible way is with a complete backup of CPDN files and folders. Sorry.

[Edit] You received credit for all Trickles returned, so not much lost there. Even failed tasks can contain valuable information for the scientists -- largely defining the envelope of valid parameter sets.

[Edit2] The task in question isn't 'Reported' yet, so the error code isn't known. However, it shows 20 Trickles received -- suggesting it was interrupted at the 50% mark, while generating the Restart Dump. Any interruption while that is being done is fatal.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 47795 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 47796 - Posted: 16 Dec 2013, 21:30:45 UTC

And there's no point in keeping the files; the model has failed.

Lots of info on the zip creation problem in Number Crunching if you want more.


Backups: Here
ID: 47796 · Report as offensive     Reply Quote
llagos

Send message
Joined: 21 Jul 05
Posts: 9
Credit: 163,228
RAC: 0
Message 47803 - Posted: 17 Dec 2013, 21:32:25 UTC - in response to Message 47795.  

Thanks!

So how do I know if I can interrupt it or not? let's say I want to shutdown my laptop.. is there any (easy to read) message so I can wait a few secs before actually shutting it down?

Regards,
ID: 47803 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 47804 - Posted: 17 Dec 2013, 21:59:56 UTC - in response to Message 47803.  

Just use the Show Graphics button to look at the model.
In the bottom left corner is some info about the current state of the processing, including how many more time steps to the next check point.

After the number reaches zero it will go to a high number, (either 72 or 360, for the current types of model), at which point it starts saving all of the open files.
Let the number run down a bit to give it time, and then Suspend the model.
(Perhaps 65 or 350, till you work out something for yourself. The closer it is to zero, the more of the model will have to be re-run when it's re-started.)

However, the usual 'catch' applies: if you've got other models waiting to run they'll start, so any models waiting to start should be Suspended first.

There's another thing to watch for: When the date gets close to the start of December, it's not far from a long pause while it works on converting lots of small files from one data type to another, and then adding them to a zip file for uploading to the server. While it's doing this, there is a message below the others to say that it's doing this.
From the start of December until the zip has been created, plus waiting until after the next check point, (when a new set of files will exist on the disk), is the time to NOT interrupt it.


Backups: Here
ID: 47804 · Report as offensive     Reply Quote

Message boards : Number crunching : after 333 hours, computation error! why? any way to fix it?

©2024 cpdn.org