Message boards : Number crunching : HadCM3n release
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I've got only one of these still running, re-issued 3rd time late January and looking to finish in 2-3 days (maybe with the infamous "INVALID THETA" like happened to one of the wingmen very near end of model. The next-to-last of this batch on my machines finished a few hours ago, issued December but unluckily at about 90% the host died of PSU failure. Weeks later was able to copy the BOINC folder from the surviving disk into a virtualbox and it completed ok. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I too have just one of these grinding along. 63%complete. 463hours elapsed 658 hours remaining! My second one failed when I had to reboot the machine for something, I forget exactly what now. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Hi, Eirik, My i5-3350 (Desktop) has a third copy of: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10218540 running well beyond its two predecessors, which died at different percentages and for different reasons. So far, 24 Trickles and 60.7% @ T.S. 630,950 (0.74 s/TS [it seems to thrive in Win10]). I suspect it will fail at the end, failing to upload #12 .zip file. However, it should send what is (based on earlier experience with this release) a full #13 .zip file (Restart Dump) -- at least I hope it will be complete. [EDIT: Corrected monthly Trickle count; I forgot that only 20 are shown unless one 'clicks' for the full list ...] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
That task finished OK, as did another after it. The bad part is CPU time and Time Remaining don't work. The good news is that graphics work so that progress can be followed easily. Unfortunately, boinc can't decrement time remaining to allow proper management for downloading new work according to your settings. Three new HadCM3n batches released today: 350/351/352. Some details: 352 HadCM3N perturbed physics low sensitivity resubmissions for control experiment A few are running on my machines. One from #352 crashed in less than six minutes with "INVALID THETA" error. That isn't surprising for the array of tests covered by this set. I won't post to scientists unless more meet the same early fate. By the way, CPU time and Time remaining work on some of these tasks. [Edit] So far, three of four tasks from Batch 352 crashed, in seconds, with "INVALID THETA" -- the other nears two hours as this is written. I'll advise staff. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
So far, three of four tasks from Batch 352 crashed, in seconds, with "INVALID THETA" -- the other nears two hours as this is written. I'll advise staff. Sarah has come back on this with, This is a batch that is pushing the limits of parameter space so we would expect higher than normal failures with this so hopefully this is nothing unexpected. Looking at the statistics now it doesn�t look like all are failing but there are still a number in the queue! Though looking at the numbers there won't be any left in the queue shortly. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
My one linux machine that hasn't switched to wine - Just got 4 wu's of the 351 and 352 batch. (those batches are already all issued) These are now 1-3 hours running no errors yet. And yes, the cpu time and the "time remaining" numbers that BOINC estimates are - well within an order of magnitude :) -- "Remaining time (estimated) 700 hours" or so -- not so. More like a week or two C'est la software. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I haven't seen the HadCM3n for a while, so with my larger cache and working memory I was eager to give them a try. Unfortunately, two days into the run, I had to shut down the machine to change the UPS. It was just a normal software shutdown of Win7 64-bit, with the contents of the large (20 GB)write-cache being written to the SSD. However, upon startup, all five of the tasks had errored out. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10333673 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10332728 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10332668 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10335105 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10334857 That seems even more fragile than I remembered them to be. But they have errored out on other machines as well. Maybe the shutdown of my machine just accelerated the inevitable? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Jim They're the ones that Astro and Dave mentioned just below, that are "pushing the limits of stability", so that may indeed have exacerbated the failures. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I haven�t had the same problem with the hadcm3n�s. Earlier today I rebooted my system (after first suspending the running models and exiting boinc manager) with no problem. Everything started right back up on reboot and unsuspend. So the problem is not general. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Like JIM, I haven't had the same problems with hadcm3n. I am closing my system down twice a day and taking backups. (Got builders in and I am anticipating them crashing the power supply!) Perhaps it's in the method? I suspend all Tasks, wait 10 seconds, suspend the cpdn Project, wait 10 seconds, resume all tasks (but keep the project suspended), then exit BOINC Manager. Then, on opening BOINC Manager again, it all just carries on smoothly. Or perhaps I've just hit lucky with my allocated hadcm3n (351 type). |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Thanks for the input. A reboot should be possible, but somehow between the shutting down of BOINC and the writing of the cache to the SSD a few bits are being lost. I will try Lockleys technique of shutting down BOINC first, to see if I can make it work more reliably. I was running another work unit at the time, a hadam3p_eu, which started just before the reboot and which did not error out, so it seems that the HadCM3n are the most vulnerable (no surprise there, but interesting). http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=19321778 That accounts for all six cores of my i7-4790, plus the two that are supporting GTX 960s on POEM. It is all I run on that machine; not even an AV, and it has plenty of memory, stable power, etc. However, the SSD, a Samsung 840 Pro, is not above reproach. Once in a while I have seen a few bad blocks, which may call for some remedial action or replacement, as the case may be. I don't want to give up the HadCM3n work, but need to get it more reliable to justify the long times spent on it. But you have convinced me that it should be possible. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Like y'all say, no problems here with this batch being more vulnerable to shutdowns and reboots. I do the drill when I need to reboot for security upgrades and such - suspend all tasks , suspend network, for good measure suspend the project - sync sync sync - shutdown - - - - Reboot - all is well. As for hardware problems -- who knows? BUT - there was a recent stretch with one of my hosts, where, dunno why? - Even normally safe shutdowns crashed some models on restart. Tried Windows update, restore saved files from backup, kept crashing new WU's -- Now magically, problem gone. No clue. BUT also, the uploads are going so slow now, maybe the server just lost track? :) |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
My experience with these tasks (not this batch - don't have any of them.) is that on my Linux boxes, something like 3/4 fall over following restart after kernel update. If shut down and restart is for any other reason except power failure over 9/10 survive. Clearly I need to get a UPS! |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
My experience with these tasks (not this batch - don't have any of them.) is that on my Linux boxes, something like 3/4 fall over following restart after kernel update. If shut down and restart is for any other reason except power failure over 9/10 survive. Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot. Both on linux, linux with wine, and Virtualbox Windows 10. Maybe 1 in 10 tasks fails after reboot after clean shutdown, whatever reason for shutdown. But - I've had a few clusters of "every model craps out" after the cleanest shutdown, And -- I've had all models survive and complete OK after power failures. I've no clue? UPS can help, my 2 out of 7 UPS protected hosts do better, but not perfect. Me, no clue why some models more vulnerable -- or maybe accumulated errors show up after restart? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot. It may just be that my sample size isn't big enough for the results to be significant. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I have just noticed the obvious: each of my five errors above were "INVALID THETA DETECTED" (I am the 1349694 machine by the way). Whether that absolves my machine is not clear, and why it would take a reboot to bring that out is also not clear. Someone who knows more about the models than I do will have to puzzle that one out. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot. :) You got the idea! |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
i have 3 of these WUs (hadcm3n) running on my machine 1317408...clock running but NO cpu usage visible...progress still going up at 7% to 13%... don't remember seeing this behavior before...are these units doing anything ??? or just shoveling sand uphill ??? frank |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Are they sending trickles and zip files? |
Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0 |
hello jim haven't seen any...should the trickles be at 8% or 12.5% or some other interval ??? frank |
©2024 cpdn.org