climateprediction.net (CPDN) home page
Thread 'HadCM3n release'

Thread 'HadCM3n release'

Message boards : Number crunching : HadCM3n release
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 53379 - Posted: 3 Feb 2016, 5:50:02 UTC

I've got only one of these still running, re-issued 3rd time late January and looking to finish in 2-3 days (maybe with the infamous "INVALID THETA" like happened to one of the wingmen very near end of model.
The next-to-last of this batch on my machines finished a few hours ago, issued December but unluckily at about 90% the host died of PSU failure. Weeks later was able to copy the BOINC folder from the surviving disk into a virtualbox and it completed ok.
ID: 53379 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 53382 - Posted: 3 Feb 2016, 8:35:40 UTC - in response to Message 53379.  

I too have just one of these grinding along. 63%complete. 463hours elapsed 658 hours remaining! My second one failed when I had to reboot the machine for something, I forget exactly what now.
ID: 53382 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 53395 - Posted: 4 Feb 2016, 2:01:57 UTC - in response to Message 53379.  
Last modified: 4 Feb 2016, 3:05:30 UTC

Hi, Eirik,

My i5-3350 (Desktop) has a third copy of:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10218540
running well beyond its two predecessors, which died at different percentages and for different reasons.
So far, 24 Trickles and 60.7% @ T.S. 630,950 (0.74 s/TS [it seems to thrive in Win10]).

I suspect it will fail at the end, failing to upload #12 .zip file. However, it should send what is (based on earlier experience with this release) a full #13 .zip file (Restart Dump) -- at least I hope it will be complete.

[EDIT: Corrected monthly Trickle count; I forgot that only 20 are shown unless one 'clicks' for the full list ...]
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 53395 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 53531 - Posted: 27 Feb 2016, 1:32:59 UTC
Last modified: 27 Feb 2016, 2:41:09 UTC

That task finished OK, as did another after it. The bad part is CPU time and Time Remaining don't work. The good news is that graphics work so that progress can be followed easily. Unfortunately, boinc can't decrement time remaining to allow proper management for downloading new work according to your settings.


Three new HadCM3n batches released today: 350/351/352. Some details:
352 HadCM3N perturbed physics low sensitivity resubmissions for control experiment
351 HadCM3N perturbed physics low sensitivity resubmissions for step experiment
350 HadCM3N perturbed physics low sensitivity resubmissions for ramp experiment


A few are running on my machines. One from #352 crashed in less than six minutes with "INVALID THETA" error. That isn't surprising for the array of tests covered by this set. I won't post to scientists unless more meet the same early fate.

By the way, CPU time and Time remaining work on some of these tasks.


[Edit] So far, three of four tasks from Batch 352 crashed, in seconds, with "INVALID THETA" -- the other nears two hours as this is written. I'll advise staff.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 53531 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 53536 - Posted: 27 Feb 2016, 17:28:47 UTC

So far, three of four tasks from Batch 352 crashed, in seconds, with "INVALID THETA" -- the other nears two hours as this is written. I'll advise staff.


Sarah has come back on this with,

This is a batch that is pushing the limits of parameter space so we would expect higher than normal failures with this so hopefully this is nothing unexpected. Looking at the statistics now it doesn�t look like all are failing but there are still a number in the queue!


Though looking at the numbers there won't be any left in the queue shortly.
ID: 53536 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 53537 - Posted: 27 Feb 2016, 21:16:58 UTC
Last modified: 27 Feb 2016, 21:24:31 UTC

My one linux machine that hasn't switched to wine -
Just got 4 wu's of the 351 and 352 batch. (those batches are already all issued)
These are now 1-3 hours running no errors yet.
And yes, the cpu time and the "time remaining" numbers that BOINC estimates are - well within an order of magnitude :) --
"Remaining time (estimated) 700 hours" or so -- not so. More like a week or two
C'est la software.
ID: 53537 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53547 - Posted: 1 Mar 2016, 23:56:56 UTC

I haven't seen the HadCM3n for a while, so with my larger cache and working memory I was eager to give them a try. Unfortunately, two days into the run, I had to shut down the machine to change the UPS. It was just a normal software shutdown of Win7 64-bit, with the contents of the large (20 GB)write-cache being written to the SSD. However, upon startup, all five of the tasks had errored out.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10333673
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10332728
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10332668
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10335105
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10334857

That seems even more fragile than I remembered them to be. But they have errored out on other machines as well. Maybe the shutdown of my machine just accelerated the inevitable?
ID: 53547 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53548 - Posted: 2 Mar 2016, 4:07:33 UTC - in response to Message 53547.  

Jim

They're the ones that Astro and Dave mentioned just below, that are "pushing the limits of stability", so that may indeed have exacerbated the failures.

ID: 53548 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 53549 - Posted: 2 Mar 2016, 5:04:27 UTC

I haven�t had the same problem with the hadcm3n�s. Earlier today I rebooted my system (after first suspending the running models and exiting boinc manager) with no problem. Everything started right back up on reboot and unsuspend. So the problem is not general.

ID: 53549 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 53550 - Posted: 2 Mar 2016, 7:25:01 UTC - in response to Message 53549.  
Last modified: 2 Mar 2016, 7:25:24 UTC

Like JIM, I haven't had the same problems with hadcm3n. I am closing my system down twice a day and taking backups. (Got builders in and I am anticipating them crashing the power supply!) Perhaps it's in the method? I suspend all Tasks, wait 10 seconds, suspend the cpdn Project, wait 10 seconds, resume all tasks (but keep the project suspended), then exit BOINC Manager. Then, on opening BOINC Manager again, it all just carries on smoothly. Or perhaps I've just hit lucky with my allocated hadcm3n (351 type).
ID: 53550 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53551 - Posted: 2 Mar 2016, 8:47:47 UTC

Thanks for the input. A reboot should be possible, but somehow between the shutting down of BOINC and the writing of the cache to the SSD a few bits are being lost. I will try Lockleys technique of shutting down BOINC first, to see if I can make it work more reliably.

I was running another work unit at the time, a hadam3p_eu, which started just before the reboot and which did not error out, so it seems that the HadCM3n are the most vulnerable (no surprise there, but interesting).
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=19321778

That accounts for all six cores of my i7-4790, plus the two that are supporting GTX 960s on POEM. It is all I run on that machine; not even an AV, and it has plenty of memory, stable power, etc. However, the SSD, a Samsung 840 Pro, is not above reproach. Once in a while I have seen a few bad blocks, which may call for some remedial action or replacement, as the case may be. I don't want to give up the HadCM3n work, but need to get it more reliable to justify the long times spent on it. But you have convinced me that it should be possible.
ID: 53551 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 53552 - Posted: 2 Mar 2016, 8:55:56 UTC

Like y'all say, no problems here with this batch being more vulnerable to shutdowns and reboots.
I do the drill when I need to reboot for security upgrades and such - suspend all tasks , suspend network, for good measure suspend the project - sync sync sync - shutdown - - - -
Reboot - all is well. As for hardware problems -- who knows?

BUT - there was a recent stretch with one of my hosts, where, dunno why? -
Even normally safe shutdowns crashed some models on restart. Tried Windows update, restore saved files from backup, kept crashing new WU's --
Now magically, problem gone. No clue.

BUT also, the uploads are going so slow now, maybe the server just lost track? :)


ID: 53552 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 53553 - Posted: 2 Mar 2016, 9:03:51 UTC

My experience with these tasks (not this batch - don't have any of them.) is that on my Linux boxes, something like 3/4 fall over following restart after kernel update. If shut down and restart is for any other reason except power failure over 9/10 survive.

Clearly I need to get a UPS!
ID: 53553 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 53554 - Posted: 2 Mar 2016, 9:27:17 UTC - in response to Message 53553.  

My experience with these tasks (not this batch - don't have any of them.) is that on my Linux boxes, something like 3/4 fall over following restart after kernel update. If shut down and restart is for any other reason except power failure over 9/10 survive.

Clearly I need to get a UPS!


Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot.
Both on linux, linux with wine, and Virtualbox Windows 10.
Maybe 1 in 10 tasks fails after reboot after clean shutdown, whatever reason for shutdown.
But - I've had a few clusters of "every model craps out" after the cleanest shutdown,
And -- I've had all models survive and complete OK after power failures.
I've no clue?

UPS can help, my 2 out of 7 UPS protected hosts do better, but not perfect.

Me, no clue why some models more vulnerable -- or maybe accumulated errors show up after restart?
ID: 53554 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 53555 - Posted: 2 Mar 2016, 9:40:57 UTC

Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot.


It may just be that my sample size isn't big enough for the results to be significant.
ID: 53555 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53556 - Posted: 2 Mar 2016, 10:17:30 UTC

I have just noticed the obvious: each of my five errors above were "INVALID THETA DETECTED" (I am the 1349694 machine by the way). Whether that absolves my machine is not clear, and why it would take a reboot to bring that out is also not clear. Someone who knows more about the models than I do will have to puzzle that one out.
ID: 53556 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 53557 - Posted: 2 Mar 2016, 11:24:32 UTC - in response to Message 53555.  

Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot.


It may just be that my sample size isn't big enough for the results to be significant.


:) You got the idea!


ID: 53557 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 53573 - Posted: 5 Mar 2016, 11:11:25 UTC

i have 3 of these WUs (hadcm3n) running on my machine 1317408...clock running but NO cpu usage visible...progress still going up at 7% to 13%...

don't remember seeing this behavior before...are these units doing anything ??? or just shoveling sand uphill ???

frank

ID: 53573 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 53574 - Posted: 5 Mar 2016, 15:25:01 UTC - in response to Message 53573.  

Are they sending trickles and zip files?
ID: 53574 · Report as offensive     Reply Quote
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 53575 - Posted: 5 Mar 2016, 15:32:39 UTC - in response to Message 53573.  

hello jim

haven't seen any...should the trickles be at 8% or 12.5% or some other interval ???

frank
ID: 53575 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : HadCM3n release

©2024 cpdn.org