climateprediction.net (CPDN) home page
Thread 'HadCM3 short errors'

Thread 'HadCM3 short errors'

Message boards : Number crunching : HadCM3 short errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,703,308
RAC: 9,860
Message 51984 - Posted: 24 May 2015, 11:15:56 UTC - in response to Message 51983.  

Just been checking some of my other tasks which have dates 1994 and 2004 which are giving compute errors on other machines. Either no heartbeat or invalid theta errors. I guess these are going to fail at some point. These are not "no resubmission" tasks.

'No heartbeat' is generally an operational problem with the computer the model is running on. You should let those run, please - they should run on your (different) computer.

'Invalid theta' may cause the task to fail, but still provides valuable information to the scientific researchers. Again, they're worth running, even if they eventually fail.
ID: 51984 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,979,211
RAC: 14,216
Message 51985 - Posted: 24 May 2015, 20:27:38 UTC - in response to Message 51984.  

Thanks Richard - I'll let them run.
ID: 51985 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,009,815
RAC: 21,293
Message 52004 - Posted: 29 May 2015, 11:17:56 UTC

Yes, I notice I too have a number of tasks that are on their 4th attempt but not marked, "No resubmission" They also have all the other tasks in the work unit down as computer error.

So far I have only been deleting the no resubmission tasks. Once I am sure that they are all gone I might abort the ones on 4th attempt.
ID: 52004 · Report as offensive     Reply Quote
BetelgeuseFive

Send message
Joined: 31 Aug 04
Posts: 10
Credit: 2,538,005
RAC: 0
Message 52007 - Posted: 30 May 2015, 8:19:05 UTC - in response to Message 51982.  

Alan, While most of my "No Resubmission" tasks are the 1980s batch, a few are not, so it is necessary to check them all.


I don't know if I am referring to the same problem here, but I recently had a number of tasks that failed with an 'out of memory' message. Same message reported by other tasks for the same workunit.
Examples:

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9406431
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9413766
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=9408774

The last one in this list was sent out again yesterday. Seems like a waste of resources ...

Tom

ID: 52007 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,829
RAC: 13,097
Message 52008 - Posted: 30 May 2015, 13:41:00 UTC - in response to Message 52007.  

Those 3 tasks were all 1980 "No Resubmission" tasks.

In the last few days I have aborted around 20 of those.
ID: 52008 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,009,815
RAC: 21,293
Message 52016 - Posted: 4 Jun 2015, 9:29:22 UTC

Looking at the current batch of short models in more detail.

I decided to do a little more delving.
As far as I can see on every one of the short tasks I am running currently and on those completed successfully all my wingmen have failed. Where these are linux computers they are all down to missing libraries.

I note that none of them seem to complete on windows machines, mostly falling over within 15 seconds with lots of suspend requests and sometimes also with the, "no hearbeat" message. There are also a handful with invalid theta.

Don't know if this is of any help but thought it worth noting.
ID: 52016 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 52017 - Posted: 4 Jun 2015, 9:49:28 UTC - in response to Message 52016.  
Last modified: 4 Jun 2015, 9:52:16 UTC

I note that none of them seem to complete on windows machines, mostly falling over within 15 seconds with lots of suspend requests and sometimes also with the, "no hearbeat" message. There are also a handful with invalid theta.

Then what OS are you running? They complete OK on my Win7 64-bit machine if I abort all the "no resubmissions".
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1349694

I am using a write-cache as noted previously, which seems to be quite necessary to bring down the error rate.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7978&nowrap=true#51395

But I got tired of having to watch over the machine, and have stopped doing the shorts entirely, since if I miss too many bad work units it seems to have the ability to crash the machine, though that may be a rare event.
ID: 52017 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,009,815
RAC: 21,293
Message 52018 - Posted: 4 Jun 2015, 10:59:23 UTC - in response to Message 52017.  

Then what OS are you running? They complete OK on my Win7 64-bit machine if I abort all the "no resubmissions".


I am running linux having only ran that OS this century!

I am using a write-cache as noted previously, which seems to be quite necessary to bring down the error rate.


That might explain how few of these tasks are completing. Most crunchers install and forget which may be fine for some toher projects, often doesn't work for CPDN.

I am just running the short tasks at the moment because the others available for my machine seem to have problems if I shut down at night.

It just strikes me that when the vast majority of tasks even excluding the, "no resubmission ones, (which I seem to have stopped getting now) are falling over it isn't the best use of computing time, either the project's or the crunchers'.
ID: 52018 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 52019 - Posted: 4 Jun 2015, 11:31:43 UTC - in response to Message 52018.  
Last modified: 4 Jun 2015, 11:46:45 UTC

I suppose this project started out on mainframes years ago, and no one ever took a close look at the hardware differences with PCs, with disk drives and how operating systems interact with them being perhaps the leading example.

We are left to fend for ourselves, and I am about fended out on this one, though the others are going nicely.
ID: 52019 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 52023 - Posted: 6 Jun 2015, 0:13:22 UTC - in response to Message 52018.  

I am just running the short tasks at the moment because the others available for my machine seem to have problems if I shut down at night.


I can't speak for your experiences, but the MOSES EU models seem to do alright with shutdowns and reboots on my PCs. Even starting up after power outages due to recent lightning strikes haven't had any problems. It's those MOSES "global-only" models that inevitably fail if they are removed from memory.
ID: 52023 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52024 - Posted: 6 Jun 2015, 4:22:45 UTC - in response to Message 52019.  

I suppose this project started out on mainframes years ago, and no one ever took a close look at the hardware differences with PCs, with disk drives and how operating systems interact with them being perhaps the leading example.

Yes, the Met Office DID write these for their supercomputers, and do still run them that way.

The desktop versions were a collaboration between the Met Office and several people in the Atmospheric, Oceanic and Planetary Physics sub-dept of the Department of Physics, in the early part of this century.
Carl said "way back", that it took them the better part of two years to get the program set that was the start of it, (the "slab ocean" model), working, and stable enough to use on desktops.

Currently, the Met Office releases Linux versions (32 bit only, I think), for professional climate physicists around the world.
We just happen to be lucky that we're able to tag along with them.

We are left to fend for ourselves, and I am about fended out on this one, though the others are going nicely.

Ongoing problems with existing models ARE passed on to the project people.
It probably depends on which model type and research project gets the attention from time to time.

And there are no "short" models left in the queue.
Perhaps we can get them made Linux only for any future runs.

ID: 52024 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,009,815
RAC: 21,293
Message 52025 - Posted: 6 Jun 2015, 8:47:03 UTC

but the MOSES EU models seem to do alright with shutdowns and reboots on my PCs.


Thanks will check back t see if the ones that failed were all global only ones though by the time I have cleared what I have in the queue there may well be more short tasks going.

Perhaps we can get them made Linux only for any future runs.


Just have to sort out all the Linux machines with missing libraries and then even large batches of them would get finished quite quickly!
ID: 52025 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,290,829
RAC: 13,097
Message 52026 - Posted: 6 Jun 2015, 9:28:45 UTC - in response to Message 52024.  

Les said
Perhaps we can get them made Linux only for any future runs.


Hey, my 64 bit Windows 8.1 computer runs them well. In fact, many complete on my computer that have failed elsewhere.

I just wish I did not have so many 1980 No Resubmission tasks sent to me!
ID: 52026 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 52027 - Posted: 6 Jun 2015, 9:58:46 UTC - in response to Message 52026.  

Les said
Perhaps we can get them made Linux only for any future runs.


Hey, my 64 bit Windows 8.1 computer runs them well. In fact, many complete on my computer that have failed elsewhere.

I just wish I did not have so many 1980 No Resubmission tasks sent to me!


Yeah, the shorts work fine for me on Win7 too. Likewise about having to check and abort the No Resubmission ones.
ID: 52027 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 52028 - Posted: 7 Jun 2015, 11:40:04 UTC - in response to Message 52027.  

Oh dear, not the OS debate again. Hopefully the project managers can do a proper analysis of whether the models are failing because of the OS, as to me it does not seem to be simply an OS fail/complete.

Last year ALL my short models were failing (over 180), and Les's comment at the time was "Luck of the draw, I think." He could well be correct for now over the last month or so I've had well over 100 complete (1980 models excluded), and the only failures were 1980 No Resubmissions run in error.

I'm running Win7 with BOINC as a service and BOINC Service gets stopped and started automatically every night while the PC does a backup. Then there are all the other times it gets shut down when doing Win updates, stopping to get intensive graphics work done etc. Once I accidentally hit the power button and when rebooted the tasks resumed just fine.

So the model (all models actually) now seems pretty stable on my Win PC. Had a quick look at some of my short tasks, and most were new with just my PC having run them through. Of those where I succeeded and others failed, of the failures 7 were Linux, 20 were Win. About even stevens I would have thought. I didn't look at the reason for failure - leave that for someone else.

ID: 52028 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,703,308
RAC: 9,860
Message 52029 - Posted: 7 Jun 2015, 11:56:53 UTC - in response to Message 52028.  

I'm running Win7 with BOINC as a service and BOINC Service gets stopped and started automatically every night while the PC does a backup. Then there are all the other times it gets shut down when doing Win updates, stopping to get intensive graphics work done etc. Once I accidentally hit the power button and when rebooted the tasks resumed just fine.

You are, of course, still running the older v7.0.36 BOINC which avoids triggering the service mode bug with CPDN's 2014 Windows builds - which still applies to the HadCM3 short app.
ID: 52029 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 52044 - Posted: 10 Jun 2015, 5:48:37 UTC - in response to Message 52029.  

That is of course true Richard, and it poses a couple of interesting points.

1. The number of PCs running as a service is pretty low as I understand it, and I understood from Les that particular error was triggered by a combination of 'Windows + BOINC 7 [>7.0.36] + a service install'.
2. I looked through a few of my completed short tasks to see if they had wingmen with errors. Out of 17 workunits, there were the following number of task failures, the "L" = Linux, the rest Windows.

6.10.58 - 1, 1L
7.4.23 - 0, 2L
7.4.27 - 2
7.4.28 - 1
7.4.36 - 4
7.4.42 - 22

I then checked a few of those failed PCs and there was a mix of those that failed all shorts and those that failed a lot (excluding the 1980 runs). I didn't notice any where short model failures were low. But I only checked a couple of PCs.

3. When I was failing tasks last year using the v7.2.42 of BOINC (yes that is 7.2.42, not 7.4.42), many tasks had Invalid Theta errors that are generally explained as model errors, not computer errors. When I went back from the v7.2.42 to v7.0.36, I went from 100% failure to 100% success. Interesting.

Could it be that something else in the later versions of BOINC is triggering errors in the PCs not running as a service?

It would be interesting to see one of those other PCs go back to an earlier version of BOINC to see what happens. But then of course we would need more short tasks :-(

Anyway, I leave that question to better minds than mine to figure out.
ID: 52044 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,009,815
RAC: 21,293
Message 52047 - Posted: 10 Jun 2015, 13:06:51 UTC
Last modified: 10 Jun 2015, 13:14:12 UTC

Just got five of the new short models. 4 failed downloads with permanent HTTP errors, fifth computation error at 25 seconds.

Edit: Computation error was a 1980 model, all others 1991. All have been around the block three times already.

edit2: looks like all are resubmissions - aborting.
ID: 52047 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,009,815
RAC: 21,293
Message 52053 - Posted: 12 Jun 2015, 11:54:37 UTC
Last modified: 12 Jun 2015, 11:55:13 UTC

And some more 1991 tasks failing to download today. The 1994 one is running fine despite being third time round the block but that doesn't really mean much with so many Linux boxes missing libraries etc.
ID: 52053 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 52054 - Posted: 12 Jun 2015, 13:50:23 UTC - in response to Message 52053.  

I would be delighted to try the ordinary failures, but it is the "no resubmissions" that get me. You would think that they could run a script to get rid of them, instead of relying on us to babysit them.
ID: 52054 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : HadCM3 short errors

©2024 cpdn.org