climateprediction.net (CPDN) home page
Thread 'HADCM3PN DEAD???'

Thread 'HADCM3PN DEAD???'

Message boards : Number crunching : HADCM3PN DEAD???
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 44114 - Posted: 29 Apr 2012, 16:59:56 UTC
Last modified: 29 Apr 2012, 17:57:41 UTC

I think that I have a problem. Hadcm3n_yfok_1980_40_00784442_0 reached 100% a few hours hours ago, but, there is no sign of the final zip file. The boinc manager says that the WU is still �running� instead of uploading.Elapsed time indicator still going up. According to the graphics the model is stuck at 99.97%. Model is stuck at Timestep 1038232. The messages are a bit confusing due to all the backed up zip file from the Hadam3p_eu that can�t upload due to server problem.

Messages below:

4/29/2012 11:44:50 AM | | Resuming network activity
4/29/2012 11:44:50 AM | climateprediction.net | Started upload of hadam3p_eu_8fap_2001_1_007868816_0_4.zip
4/29/2012 11:44:50 AM | climateprediction.net | Started upload of hadam3p_eu_8fap_2001_1_007868816_0_5.zip
4/29/2012 11:44:50 AM | climateprediction.net | Sending scheduler request: To send trickle-up message.
4/29/2012 11:44:50 AM | climateprediction.net | Requesting new tasks for CPU
4/29/2012 11:44:58 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_8fap_2001_1_007868816_0_4.zip: transient HTTP error
4/29/2012 11:44:58 AM | climateprediction.net | Backing off 4 hr 9 min 36 sec on upload of hadam3p_eu_8fap_2001_1_007868816_0_4.zip
4/29/2012 11:44:58 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_8fap_2001_1_007868816_0_5.zip: transient HTTP error
4/29/2012 11:44:58 AM | climateprediction.net | Backing off 5 hr 34 min 14 sec on upload of hadam3p_eu_8fap_2001_1_007868816_0_5.zip
4/29/2012 11:44:58 AM | climateprediction.net | Started upload of hadam3p_pnw_c6g3_1970_1_007941314_0_2.zip
4/29/2012 11:44:58 AM | climateprediction.net | Started upload of hadam3p_pnw_c6g3_1970_1_007941314_0_3.zip
4/29/2012 11:45:01 AM | climateprediction.net | Scheduler request completed: got 1 new tasks
4/29/2012 11:45:03 AM | climateprediction.net | Started download of hadam3p_pnw_cbzw_1963_1_007948507.zip
4/29/2012 11:45:05 AM | climateprediction.net | Finished download of hadam3p_pnw_cbzw_1963_1_007948507.zip
4/29/2012 11:46:00 AM | climateprediction.net | Finished upload of hadam3p_pnw_c6g3_1970_1_007941314_0_2.zip
4/29/2012 11:46:00 AM | climateprediction.net | Finished upload of hadam3p_pnw_c6g3_1970_1_007941314_0_3.zip
4/29/2012 11:46:01 AM | climateprediction.net | Started upload of hadam3p_eu_8fap_2001_1_007868816_0_6.zip
4/29/2012 11:46:01 AM | climateprediction.net | Started upload of hadam3p_eu_8fap_2001_1_007868816_0_7.zip
4/29/2012 11:46:06 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_8fap_2001_1_007868816_0_6.zip: transient HTTP error
4/29/2012 11:46:06 AM | climateprediction.net | Backing off 2 hr 34 min 9 sec on upload of hadam3p_eu_8fap_2001_1_007868816_0_6.zip
4/29/2012 11:46:06 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_8fap_2001_1_007868816_0_7.zip: transient HTTP error
4/29/2012 11:46:06 AM | climateprediction.net | Backing off 30 min 7 sec on upload of hadam3p_eu_8fap_2001_1_007868816_0_7.zip
4/29/2012 11:46:08 AM | climateprediction.net | Started upload of hadam3p_eu_8fap_2001_1_007868816_0_8.zip
4/29/2012 11:46:08 AM | climateprediction.net | Started upload of hadam3p_eu_8fap_2001_1_007868816_0_9.zip
4/29/2012 11:46:09 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_8fap_2001_1_007868816_0_8.zip: transient HTTP error
4/29/2012 11:46:09 AM | climateprediction.net | Backing off 9 min 14 sec on upload of hadam3p_eu_8fap_2001_1_007868816_0_8.zip
4/29/2012 11:46:09 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_8fap_2001_1_007868816_0_9.zip: transient HTTP error
4/29/2012 11:46:09 AM | climateprediction.net | Backing off 14 min 44 sec on upload of hadam3p_eu_8fap_2001_1_007868816_0_9.zip
4/29/2012 11:46:31 AM | | Suspending network activity - user request
4/29/2012 11:48:06 AM | | Resuming network activity
4/29/2012 11:48:19 AM | climateprediction.net | update requested by user
4/29/2012 11:48:22 AM | climateprediction.net | Sending scheduler request: Requested by user.
4/29/2012 11:48:22 AM | climateprediction.net | Not reporting or requesting tasks
4/29/2012 11:48:24 AM | climateprediction.net | Scheduler request completed
4/29/2012 12:05:33 PM | climateprediction.net | Started upload of hadam3p_eu_8fap_2001_1_007868816_0_8.zip
4/29/2012 12:05:33 PM | climateprediction.net | Started upload of hadam3p_eu_8fap_2001_1_007868816_0_9.zip
4/29/2012 12:05:35 PM | climateprediction.net | Temporarily failed upload of hadam3p_eu_8fap_2001_1_007868816_0_8.zip: transient HTTP error
4/29/2012 12:05:35 PM | climateprediction.net | Backing off 21 min 16 sec on upload of hadam3p_eu_8fap_2001_1_007868816_0_8.zip
4/29/2012 12:05:35 PM | climateprediction.net | Temporarily failed upload of hadam3p_eu_8fap_2001_1_007868816_0_9.zip: transient HTTP error
4/29/2012 12:05:35 PM | climateprediction.net | Backing off 18 min 18 sec on upload of hadam3p_eu_8fap_2001_1_007868816_0_9.zip


IS THE MODEL DEAD. Should I try again? I have a back up from 2 day ago.

UPDATE: WU crashed. I am now running restored with 40 hours left.
ID: 44114 · Report as offensive     Reply Quote
BeBiMaGe

Send message
Joined: 5 Aug 04
Posts: 6
Credit: 184,430
RAC: 0
Message 44117 - Posted: 30 Apr 2012, 10:09:31 UTC

As you may have noticed some upload servers are out of service.
ID: 44117 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 44118 - Posted: 30 Apr 2012, 10:35:52 UTC - in response to Message 44117.  

According to the posts and my experience the hadamcn3 models are unaffected by uploader1.atm being out of action as they use a different server.
ID: 44118 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 44131 - Posted: 1 May 2012, 16:01:02 UTC

Bad new to report. The restored WU progressed to the exect same spot at 99.97% and hung up again. It has been aborted. There is one thing that I was wondering. I recently upgraded to the new version [7.0.25 (x64)] of Boinc from the 6.10.58. Could upgrading while the hadcm3n was running have caused this? Has anyone else finished a CM model with the new Boinc manager?

I still have the WU backup and copy of the 6.10.58 manager stored on my computer if you think it might help. The CM�s are such a big commitment of time (about 60 days) that I hate to just give up on this one.

ID: 44131 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44133 - Posted: 1 May 2012, 22:25:32 UTC - in response to Message 44131.  

There is a condition, of unknown cause, whereby the Coupled Ocean models will get to a point where the data is usually gathered up, zipped, and sent back to the server, and then the model just stops. I've had one or two that didn't even produce the zip. The model doesn't even self-abort, it just sits there doing nothing.

It's been discussed at the project level, and something may come of it in time.

However, lots of models do complete successfully, although I don't know how many, what percentage, etc.


Backups: Here
ID: 44133 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,374,828
RAC: 10,749
Message 44136 - Posted: 2 May 2012, 6:44:37 UTC - in response to Message 44133.  

However, lots of models do complete successfully, although I don't know how many, what percentage, etc.

From four PCs, running XP or Linux with BM 6.n.n.
Results: 85 CM3n started, 66 completed, 19 failed at the 25/50/75/100% points.

I.e. just over 75% CM3n complete successfully, and just under 25% fail at the zip points.
.

ID: 44136 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,374,828
RAC: 10,749
Message 44137 - Posted: 2 May 2012, 7:18:10 UTC - in response to Message 44131.  

There is one thing that I was wondering. I recently upgraded to the new version [7.0.25 (x64)] of Boinc from the 6.10.58. Could upgrading while the hadcm3n was running have caused this? Has anyone else finished a CM model with the new Boinc manager?

Jim, e.g. this CM task 14363087 completed on BM 7.0.25. Note that BM 7 makes changes to client_state.xml, as per the release notes, and the the V7 to V6 downgrade incompatibility.

On balance, I'd be more suspicious of the empirically high fail rate of these CM3n models at the zip points.

ID: 44137 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 44143 - Posted: 2 May 2012, 19:01:52 UTC

Thanks everyone. I guess that the WU is just dead. It is hard to give up on it when you have 700+ hours of crunching invested.

ID: 44143 · Report as offensive     Reply Quote

Message boards : Number crunching : HADCM3PN DEAD???

©2024 cpdn.org