Message boards : Number crunching : HadCM3 short errors
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
Just been checking some of my other tasks which have dates 1994 and 2004 which are giving compute errors on other machines. Either no heartbeat or invalid theta errors. I guess these are going to fail at some point. These are not "no resubmission" tasks. 'No heartbeat' is generally an operational problem with the computer the model is running on. You should let those run, please - they should run on your (different) computer. 'Invalid theta' may cause the task to fail, but still provides valuable information to the scientific researchers. Again, they're worth running, even if they eventually fail. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,977,555 RAC: 14,225 |
Thanks Richard - I'll let them run. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Yes, I notice I too have a number of tasks that are on their 4th attempt but not marked, "No resubmission" They also have all the other tasks in the work unit down as computer error. So far I have only been deleting the no resubmission tasks. Once I am sure that they are all gone I might abort the ones on 4th attempt. |
Send message Joined: 31 Aug 04 Posts: 10 Credit: 2,538,005 RAC: 0 |
Alan, While most of my "No Resubmission" tasks are the 1980s batch, a few are not, so it is necessary to check them all. I don't know if I am referring to the same problem here, but I recently had a number of tasks that failed with an 'out of memory' message. Same message reported by other tasks for the same workunit. Examples: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9406431 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9413766 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9408774 The last one in this list was sent out again yesterday. Seems like a waste of resources ... Tom |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
Those 3 tasks were all 1980 "No Resubmission" tasks. In the last few days I have aborted around 20 of those. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Looking at the current batch of short models in more detail. I decided to do a little more delving. As far as I can see on every one of the short tasks I am running currently and on those completed successfully all my wingmen have failed. Where these are linux computers they are all down to missing libraries. I note that none of them seem to complete on windows machines, mostly falling over within 15 seconds with lots of suspend requests and sometimes also with the, "no hearbeat" message. There are also a handful with invalid theta. Don't know if this is of any help but thought it worth noting. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I note that none of them seem to complete on windows machines, mostly falling over within 15 seconds with lots of suspend requests and sometimes also with the, "no hearbeat" message. There are also a handful with invalid theta. Then what OS are you running? They complete OK on my Win7 64-bit machine if I abort all the "no resubmissions". http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1349694 I am using a write-cache as noted previously, which seems to be quite necessary to bring down the error rate. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7978&nowrap=true#51395 But I got tired of having to watch over the machine, and have stopped doing the shorts entirely, since if I miss too many bad work units it seems to have the ability to crash the machine, though that may be a rare event. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Then what OS are you running? They complete OK on my Win7 64-bit machine if I abort all the "no resubmissions". I am running linux having only ran that OS this century! I am using a write-cache as noted previously, which seems to be quite necessary to bring down the error rate. That might explain how few of these tasks are completing. Most crunchers install and forget which may be fine for some toher projects, often doesn't work for CPDN. I am just running the short tasks at the moment because the others available for my machine seem to have problems if I shut down at night. It just strikes me that when the vast majority of tasks even excluding the, "no resubmission ones, (which I seem to have stopped getting now) are falling over it isn't the best use of computing time, either the project's or the crunchers'. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I suppose this project started out on mainframes years ago, and no one ever took a close look at the hardware differences with PCs, with disk drives and how operating systems interact with them being perhaps the leading example. We are left to fend for ourselves, and I am about fended out on this one, though the others are going nicely. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I am just running the short tasks at the moment because the others available for my machine seem to have problems if I shut down at night. I can't speak for your experiences, but the MOSES EU models seem to do alright with shutdowns and reboots on my PCs. Even starting up after power outages due to recent lightning strikes haven't had any problems. It's those MOSES "global-only" models that inevitably fail if they are removed from memory. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I suppose this project started out on mainframes years ago, and no one ever took a close look at the hardware differences with PCs, with disk drives and how operating systems interact with them being perhaps the leading example. Yes, the Met Office DID write these for their supercomputers, and do still run them that way. The desktop versions were a collaboration between the Met Office and several people in the Atmospheric, Oceanic and Planetary Physics sub-dept of the Department of Physics, in the early part of this century. Carl said "way back", that it took them the better part of two years to get the program set that was the start of it, (the "slab ocean" model), working, and stable enough to use on desktops. Currently, the Met Office releases Linux versions (32 bit only, I think), for professional climate physicists around the world. We just happen to be lucky that we're able to tag along with them. We are left to fend for ourselves, and I am about fended out on this one, though the others are going nicely. Ongoing problems with existing models ARE passed on to the project people. It probably depends on which model type and research project gets the attention from time to time. And there are no "short" models left in the queue. Perhaps we can get them made Linux only for any future runs. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
but the MOSES EU models seem to do alright with shutdowns and reboots on my PCs. Thanks will check back t see if the ones that failed were all global only ones though by the time I have cleared what I have in the queue there may well be more short tasks going. Perhaps we can get them made Linux only for any future runs. Just have to sort out all the Linux machines with missing libraries and then even large batches of them would get finished quite quickly! |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,290,001 RAC: 13,288 |
Les said Perhaps we can get them made Linux only for any future runs. Hey, my 64 bit Windows 8.1 computer runs them well. In fact, many complete on my computer that have failed elsewhere. I just wish I did not have so many 1980 No Resubmission tasks sent to me! |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Les said Yeah, the shorts work fine for me on Win7 too. Likewise about having to check and abort the No Resubmission ones. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Oh dear, not the OS debate again. Hopefully the project managers can do a proper analysis of whether the models are failing because of the OS, as to me it does not seem to be simply an OS fail/complete. Last year ALL my short models were failing (over 180), and Les's comment at the time was "Luck of the draw, I think." He could well be correct for now over the last month or so I've had well over 100 complete (1980 models excluded), and the only failures were 1980 No Resubmissions run in error. I'm running Win7 with BOINC as a service and BOINC Service gets stopped and started automatically every night while the PC does a backup. Then there are all the other times it gets shut down when doing Win updates, stopping to get intensive graphics work done etc. Once I accidentally hit the power button and when rebooted the tasks resumed just fine. So the model (all models actually) now seems pretty stable on my Win PC. Had a quick look at some of my short tasks, and most were new with just my PC having run them through. Of those where I succeeded and others failed, of the failures 7 were Linux, 20 were Win. About even stevens I would have thought. I didn't look at the reason for failure - leave that for someone else. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,703,308 RAC: 9,860 |
I'm running Win7 with BOINC as a service and BOINC Service gets stopped and started automatically every night while the PC does a backup. Then there are all the other times it gets shut down when doing Win updates, stopping to get intensive graphics work done etc. Once I accidentally hit the power button and when rebooted the tasks resumed just fine. You are, of course, still running the older v7.0.36 BOINC which avoids triggering the service mode bug with CPDN's 2014 Windows builds - which still applies to the HadCM3 short app. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
That is of course true Richard, and it poses a couple of interesting points. 1. The number of PCs running as a service is pretty low as I understand it, and I understood from Les that particular error was triggered by a combination of 'Windows + BOINC 7 [>7.0.36] + a service install'. 2. I looked through a few of my completed short tasks to see if they had wingmen with errors. Out of 17 workunits, there were the following number of task failures, the "L" = Linux, the rest Windows. 6.10.58 - 1, 1L 7.4.23 - 0, 2L 7.4.27 - 2 7.4.28 - 1 7.4.36 - 4 7.4.42 - 22 I then checked a few of those failed PCs and there was a mix of those that failed all shorts and those that failed a lot (excluding the 1980 runs). I didn't notice any where short model failures were low. But I only checked a couple of PCs. 3. When I was failing tasks last year using the v7.2.42 of BOINC (yes that is 7.2.42, not 7.4.42), many tasks had Invalid Theta errors that are generally explained as model errors, not computer errors. When I went back from the v7.2.42 to v7.0.36, I went from 100% failure to 100% success. Interesting. Could it be that something else in the later versions of BOINC is triggering errors in the PCs not running as a service? It would be interesting to see one of those other PCs go back to an earlier version of BOINC to see what happens. But then of course we would need more short tasks :-( Anyway, I leave that question to better minds than mine to figure out. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Just got five of the new short models. 4 failed downloads with permanent HTTP errors, fifth computation error at 25 seconds. Edit: Computation error was a 1980 model, all others 1991. All have been around the block three times already. edit2: looks like all are resubmissions - aborting. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
And some more 1991 tasks failing to download today. The 1994 one is running fine despite being third time round the block but that doesn't really mean much with so many Linux boxes missing libraries etc. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I would be delighted to try the ordinary failures, but it is the "no resubmissions" that get me. You would think that they could run a script to get rid of them, instead of relying on us to babysit them. |
©2024 cpdn.org