climateprediction.net (CPDN) home page
Thread 'A blue world for aborting...?'

Thread 'A blue world for aborting...?'

Message boards : Number crunching : A blue world for aborting...?
Message board moderation

To post messages, you must log in.

AuthorMessage
RichardRodd

Send message
Joined: 16 Mar 06
Posts: 28
Credit: 3,219,100
RAC: 0
Message 36000 - Posted: 24 Jan 2009, 9:39:52 UTC

Hi.

I seem to have picked up a clutch of blue worlds on my small family of machines.

The first one is hadsm3fub_k78a_005974157_9 which I started late December. I only just spotted it\'s blue! It\'s never trickled, is already running at 90 s/TS and the graphics display tells me it\'s only got to model date 25/03/1811 (a cold day for March...!).

Can you advise me is this one for aborting? I\'m very happy to keep the faith - I have had blue worlds run for months and eventually complete, but I don\'t think I\'ve ever seen one this early in the cycle.

Thanks for your help - Richard
ID: 36000 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 36004 - Posted: 24 Jan 2009, 12:36:23 UTC - in response to Message 36000.  
Last modified: 24 Jan 2009, 12:37:50 UTC

Richard,

I think that\'s some other kind of problem. Other Windows/Intel machines in that work unit have got further without showing any slowdown.

So, if you have a backup, then restore it - there\'s every chance that it will progress normally. However, absent of that, the advice is the same - abort it.

The whole iceworld business makes running the shorter models a bit more work than it should be, particularly with a collection of machines, such as yours. I log on to my BOINC account here from time to time and check that the RAC for each machine is \'reasonable\': if it declines for no good reason, then I get to the machine somehow and check for a crash or an iceworld.

Thanks for your efforts!

Iain
ID: 36004 · Report as offensive     Reply Quote
RichardRodd

Send message
Joined: 16 Mar 06
Posts: 28
Credit: 3,219,100
RAC: 0
Message 36008 - Posted: 24 Jan 2009, 19:09:33 UTC - in response to Message 36004.  

Thanks Iain.

Sadly no backup - so I\'ve consigned this one to the bin...

I\'ll watch the progress of the next one with interest to see if there\'s some other problem there as you suspect. After which probably a BOINC reinstall... sigh!

All the best.
Richard
ID: 36008 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36010 - Posted: 25 Jan 2009, 3:25:03 UTC
Last modified: 25 Jan 2009, 3:33:58 UTC

The problem is unlikely to be the BOINC installation. It\'s more likely to be some sort of instability in the computer if other computers with the same CPU type and OS are crunching past the point where your model from the same workunit became an iceworld. But in the case of a computer that usually crunches very stably it could be some momentary event or glitch. I assume the computer isn\'t overclocked?

Task 7672956 on computer 787649 also looks as if it became an iceworld a few trickles ago judging from the sec/timestep. There isn\'t a suitable Intel/Windows computer in the same workunit to judge yet whether the model\'s faulty. If its graphics are blue, abort it. That computer has done a lot of work for CPDN!

Task 7712087 on computer 829623 seems to have become an iceworld after 5 trickles of phase 1. This is its workunit. In this case a Mac and the Linux computers have got past the iceworld point but there\'s a whole bunch of Intel/Windows machines all stuck at the same point as you since New Year. So your computer is definitely not at fault; it\'s a defective workunit. Abort this model ASAP and I\'ll send private messages to the other people stuck in the same iceworld.

Don\'t ever let an iceworld continue to crunch unless you\'re fairly sure it\'s the fault of the computer and can restore a backup made before the iceworld started. As soon as an iceworld develops the model stops processing its data correctly. Usually the precipitation graph becomes empty. From the iceworld point onwards the model\'s data can\'t be used.

You\'ve completed so many slabs that you pretty well deserve to be mentioned by name in the next CPDN publication based on these models!
Cpdn news
ID: 36010 · Report as offensive     Reply Quote
RichardRodd

Send message
Joined: 16 Mar 06
Posts: 28
Credit: 3,219,100
RAC: 0
Message 36028 - Posted: 26 Jan 2009, 14:45:13 UTC - in response to Message 36010.  

Hi there mo.v and thanks for your comprehensive advice.

When I hit my first ice world a couple of years ago, the advice then was just let it run because it *will* run to completion: I think I had a couple back then ... and they did, but only after some 1000+ cpu hours!

The tasks you suggest I abort are both only running with a S/TS less than 6 (how high that number was used to be a factor in advising whether or not to abort, I recall), and one has only approx 100 hours to go (the other 300). It seems a shame to kill them at this late stage IF IF IF ice world results DO contribute useful information, as was implied to me the first time around. But I gather from your penultimate paragraph that wisdom has moved on and they\'re now considered of no value.

I\'d appreciate your final view on that...

Anyhow, the task that sparked this thread was the first one on a new computer I\'ve just added to my \'family\'. But it\'s an ancient Packard Bell, so may not be up to the task, even though the raw spec suggested it\'d be okay - a 2.66GHz chip with 2GB RAM, not used for anything other than basic MS Office stuff - no hungry graphics games! So I\'ve got the next task running and will watch progress with interest.

Once again, many thanks for you interest.

Kind regards
Richard
ID: 36028 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 36029 - Posted: 26 Jan 2009, 16:39:22 UTC
Last modified: 26 Jan 2009, 16:55:40 UTC

I\'m not sure we know whether a completed ice world has no value. What we do know is:

1. The data post-freeze is incomplete, but it\'s still possible that the pre-freeze data might be of some use, even in the phase where the freeze occurs.

2. That an ice world on one platform - e.g. Windows/Intel - doesn\'t mean that there\'ll be ice world on other platforms. So the work unit may produce a complete model despite the ice worlds being incomplete or discarded.

3. That the \"value\" created by completing an ice world is likely to be much less than the value created by using that completion time to run another model, or several models.

It\'s item #3 that motivates my advice to abort ice worlds. As we\'ve encountered more perhaps we\'re getting less tolerant! However, I\'ll put a question to the project staff to get an explicit answer ...

Iain
ID: 36029 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 36030 - Posted: 26 Jan 2009, 17:24:05 UTC - in response to Message 36029.  

I\'ll put a question to the project staff to get an explicit answer ...

The very rapid response is \"abort them\". It\'s official.
ID: 36030 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36031 - Posted: 26 Jan 2009, 20:17:13 UTC
Last modified: 26 Jan 2009, 20:32:04 UTC

Richard, in the past year or so we\'ve discovered more about what happens to iceworlds. A complicating factor is that HADSM and HADSM MH models don\'t produce graphs until after each whole phase has been completed. So we can\'t see what data has been generated until after each end-of-phase zip file has been sent to the server.

One member, John Hunt, whose model became an iceworld just 2 or 3 trickles before the end of its last phase, very kindly ran it to completion for us. We then saw that it had stopped producing its precipitation data as soon as the iceworld developed. Since then we\'ve seen several similar examples. As far as I know, no model that\'s completed the phase in which it became an iceworld has continued after the iceworld point to produce data for both its graphs.

In every case I\'ve seen, the precipitation graph suddenly becomes empty while the temperature graph continues to be generated.

In the time it takes to complete a model that becomes an iceworld shortly before the end, the computer could usually crunch another complete and perfect model. This is a much better use of our computers, time and electricity.

Edit: Richard, here is an iceworld you completed a year ago. Look at its precipitation graph for phase 3.
Cpdn news
ID: 36031 · Report as offensive     Reply Quote
RichardRodd

Send message
Joined: 16 Mar 06
Posts: 28
Credit: 3,219,100
RAC: 0
Message 36248 - Posted: 28 Feb 2009, 14:34:47 UTC - in response to Message 36031.  

Greetings all.

I\'m posting back here after a week or three to ask for more advice, but I don\'t want to go over the top with this...!

I\'m wondering if the problem I reported above may well be this particular machine - the \'elderly\' Packard Bell. Each work unit loaded and run here seems to rapidly turn into a Blue World. I have uninstalled and reinstalled BOINC (now at Ver 6.4.5) but it seems to have made no difference.

I\'ve had a look at how the current task 6191307 is running elsewhere. The three users with the highest credit are still in Phase 1 with no reported precipitation. I now understand this is a warning sign mentioned above for a Blue World, but I\'m not sure at what stage: the last note above suggests at stage-end - which implies I should keep the faith until end of Stage 1. Anyway, the s/TS of these other users are reasonable and or steady, mine is already 8.33 and rising. But that might have something to do with the computer\'s general use - I don\'t know?

So here\'s my question. Is this actually indicative of (another) Blue World - so do I abort and try again? Or is this telling me I have a m/c that can\'t hack it?! Some of the posts above suggested the latter... This model is running steadily (after a fashion!) but at 74 CPU hours and 4% complete something is clearly not happy - and, as you\'d expect, the \'To completion\' figure just rises steadily, if imperceptably - now on 650 hours. On better behaved, albeit more powerful, members of my little \'family\' I expect to complete a model in some 400-500 hours, and do.

When I look at the m/c it seems unremarkable - XP (Home edition, admittedly) SP3. A few years old, a chip that\'s neither fast nor slow. BIOS...do we really care? I put 2Gb RAM in it, so it\'s not that. Plenty of space on the hard drive. No interference from other apps.

Any ideas anyone?

Many thanks as always - Richard

ID: 36248 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 36251 - Posted: 28 Feb 2009, 14:54:37 UTC

To make it easier for others, that\'s this model.


Backups: Here
ID: 36251 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,373,171
RAC: 10,684
Message 36253 - Posted: 28 Feb 2009, 16:50:08 UTC - in response to Message 36248.  

Any ideas ...
Richard, based on the timings/progress of the model, this should trickle every 8-9 hours or so. At 4% for 74 hours, extrapolating gives it\'s current progress about TS 31,109. The two trickles are 1) 2.86 s/TS 2) 3.152 s/TS and extrapolating to 3) it\'s now running about 14.36 s/TS. Your other model 7736251 also showed quite variable s/TS for the four trickles: 1) 3.466 2) 2.841 3) 3.101 4) 2.72.

You may just have been unlucky with the models, however the timing behaviour seems more symptomatic of something adverse in the overall set-up, rather than erratic model behaviour.

If you wish to persevere with it I\'d suggest you check out the HW first:
A) run PRIME95 for 24 hours to check the cpu/cache/memory,
B) a full run of MEMTEST86. That should thoroughly check out the memory.
Other possibilities:
C) it\'s gone into thermal shutdown/slowdown (it won\'t be the first PB cpu/mobo to do that). You may be able to check the temperature by using e.g. SiSandra, although on the last PB mobo I played with the cpu temperature was only visible in the BIOS. There may also be an over-temperature shutdown setting to adjust in the BIOS,
D) Dodgy power supply.

If the tests don’t show anything conclusive, then maybe try a different (short) model type, e.g. HADAM.

Otherwise it seems a candidate to retire to the WEEE directive.

HTH. hagar

ID: 36253 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36255 - Posted: 28 Feb 2009, 18:27:39 UTC
Last modified: 28 Feb 2009, 18:31:50 UTC

It\'s the Intel/Windows computers that are likely to produce iceworlds if there\'s anything wrong with the model. If the model\'s faulty, all the Windows/Intel computers can be expected to develop an iceworld at the same point. Two Intel/Windows computers in the same workunit have got past your iceworld point, so we have to assume, as Hagar has done, that the problem lies within your computer.

If you have a backup of the model made before it turned into an iceworld you could run it again. If it gets past the sticking point or turns into an iceworld at a different point, that would be an extra diagnosis of computer instability.

Does this computer ever shut itself down or do anything else unexpected?

One of my computers produced an iceworld in a Beta HADSM but a restore ran normally so I then knew there was something wrong with the computer. A check of the BIOS settings showed it was overclocked by 2½% unbeknown to me and to my son who\'d bought the CPU and mobo as part of a package. I returned it to factory settings and it\'s worked stably ever since.

Anyway, run the tests Hagar\'s suggested and let us know how they go. After that if necessary someone here can explain how to check the BIOS settings using CPUZ.

There are links to the hardware tests Hagar mentioned in the CPDN READMEs; get to them through my signature.

If you don\'t want to do the testing you could select HADAM or HADCM for that computer. These models are less likely to go wrong. On the other hand, if the computer\'s memory is flaky that could affect any type of model.
Cpdn news
ID: 36255 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 36257 - Posted: 28 Feb 2009, 19:01:29 UTC

If you haven\'t cleaned the dust out of your PC in the last several months or more, you could start there. Shut down and unplug the system, open up the case, and use a can of compressed air to thoroughly clean the dust from the CPU heatsink, the area around the RAM/memory, the case intake/outflow fans and vents (if any), and the ventilation intake/outflow vents on your power supply.

Be cautious of static electricity, as UK_Nick said here, and wait a little while after you\'re done with the compressed air to let the PC go back to room temperature (and let any condensation evaporate).
ID: 36257 · Report as offensive     Reply Quote
RichardRodd

Send message
Joined: 16 Mar 06
Posts: 28
Credit: 3,219,100
RAC: 0
Message 36258 - Posted: 28 Feb 2009, 19:15:48 UTC

Thanks everyone for all those comments and suggestions. I\'ll start on the task list suggested and keep you posted, but it\'s a \'remote\' m/c, so it\'ll take a week or three...

Bye for now.
ID: 36258 · Report as offensive     Reply Quote

Message boards : Number crunching : A blue world for aborting...?

©2024 cpdn.org