Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 15 · Next
Author | Message |
---|---|
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
... WU finished and returned at lunchtime UTC today 19/02/2008. Looks very interesting: I\'ve not see one decay like that ... |
Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0 |
I think this qualifies though it\'s not a hadsm3, it\'s a hadcm3 optimised IO. Result7006672 Timestamp 09:30 20/12/2056 s/TS shown as 2.83 blue globe Intel P4 (Prescott) HT 3.0, not overclocked This box has been my CPDN workhorse, though it\'s the first time it\'s tried one of these optimised I/O coupled models. This is an 80-year model 70% finished. I only noticed after a day or so that no trickles had been sent. When I looked at it, the graphics window showed only black. I suspended, quit, zipped a backup of the entire boinc directory and restarted. The display began as normal but was unresponsive and eventually went from a black world to a blue one. The the 2.83 sec/TS it reports is no longer real, it\'s more like 10 minutes/TS. Any reason not to abort it, or anything else I should be looking at? I\'ve no indications the hardware is ailing, haven\'t changed hardware or software recently, etc. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
The HadCM3 generally can\'t turn into an iceworld due to the way the ocean is modelled. I\'d therefore suspect that something else is involved (perhaps a rewind). Could you try rebooting the PC and then checking the %complete and so forth? I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0 |
Sure. data after reboot: Timestamp 00:30 19-Dec-2056 70.05%done 2.82 s/TS 1140 hours elapsed Globe is orange again. So far it\'s now counting down normally, but there\'s about a model-day to go before it gets to where it was stuck before. I\'m curious what rebooting changed that restarting boinc did not. Anyway since this is apparently the wrong thread, I\'ll look for a better location for subsequent questions (clues happily accepted). -edit- it\'s stuck again same place, 09:30 20-Dec-2056, blue globe and crawling. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I\'m not sure if there IS a better thread. And I seem to remember getting a blue world in a HadCM3 model, but it was in another lifetime, and I don\'t remember the details. Or late last year, which is much the same thing with my memory. :) |
Send message Joined: 9 Jan 05 Posts: 30 Credit: 434,469 RAC: 0 |
OK well this one is repeatably stuck (see the last edit on my previous post), is there anything useful I can do with it or just abort it? |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Sounds like aborting it is the only alternative, sorry... I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 23 Nov 07 Posts: 9 Credit: 325,662 RAC: 0 |
Seems I got an iceworld at quite early stage of modelling hadsm3 slab: ResultID 7291577 A current timestep of the model: 131087 of 259248 The s/TS value: 3.26 The temperature display of the globe graphic is uniformely blue, P gives uniform black, R - uniform pale-biue, no clouds at all. Processor: 1 GenuineIntel Intel(R) Celeron(R) D CPU 3.06GHz [x86 Family 15 Model 6 Stepping 4] Processor features: fpu tsc sse sse2 mmx Whether you are overclocking - seems, not. Should I try to proceed or abort? |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
If the s/ts is getting worse and worse, then yes, abort it. Running a stability check (such as prime95 for 24 hours) helps to identify whether it is the hardware or the software causing the trouble, although if it happens just a few times it is usually the model. If it happens a lot, and you also get unexplained model crashes, then it is worthwhile running the test. Note that it is only the HadSM3 model which suffers from slow-running iceworlds, so perhaps running one of the other model types may be less frustrating if you keep getting issues with slabs :-) I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 27 Feb 08 Posts: 41 Credit: 1,402,356 RAC: 0 |
My hardware ran Einstein@home just fine, but with these climate models I have had some equipment freezes (but no ice world freezes, thank goodness! ;), and have had to throttle back my overclocking several times. The freezes are less and less frequent with each throttle back, and hopefully my latest larger throttle back will work on a long-term basis! ;) Regards, Bob P. |
Send message Joined: 23 Nov 07 Posts: 9 Credit: 325,662 RAC: 0 |
Really interesting effect: s/TS is now 3,43 and the timestep is counting DOWN! (131068 vs 131087) Can I do anything else to investigate it? It\'s the firs time happening, so hardware problem is unlikely. |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
[ask_spb wrote:]... Can I do anything else to investigate it? ... One other thing is to watch the progress of similar PCs in that work unit: 6143336. So far, one PC has gone further but it\'s Linux. When these slabs slow down, a trickle may take as long as a week - I suspect a few more people in that work unit will start to notice something wrong soon. [Edit: The guy above you in the work unit list is the one to watch as he\'s got the fastest machine.] |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Sometimes restoring a backup from before the slowdown started is worthwhile. If you do this you need to keep an eye on the graphics to see whether the same thing happens again. If it does, the only sensible solution is to abort. If the timestep goes back, the model could also perhaps be looping. Re overclocking. When CPDN ran its first Classic slabs, one of the then programmers in Oxford posted on the forum that about 2½% of completed models were failing quality control ie could not be used by the researchers even though the models had earned their credits. He added that one of the main causes was over-enthusiastic overclocking. Quality control is still carried out. (Irrelevant to this thread but interestingly, he also said that differences between AMD/Intel and OSs led to only insignificant differences in model outcomes.) Cpdn news |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
John Hunt persevered and completed a slow iceworld. Look at the graphs: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7209853 Chris Beaugrand also persevered with the same workunit. Again, look at the graphs: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7209854 Another member abandoned the same model after the slowdown started. It must be concluded that the model itself was unviable. But if you find that somebody else has got beyond your slowdown point without a problem, you need to consider whether your computer could perhaps be slightly unstable. Cpdn news |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
... For overclocking, I\'d recommend a full 24 hours of Prime95 before running the climate model (one copy running for each core you have - so on a quad core you\'d have 4 copies, using -A0, -A1, -A2, and -A3). It took a lot of work before I could get my Q6600 working properly overclocked. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 1 Feb 07 Posts: 26 Credit: 885,216 RAC: 0 |
... @Mike: V25.5 of Prime95 automatically loads all the cores that you have - so no need to run multiple instances. F. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
... True, but the front page of http://www.mersenne.org only offers 24.14 (that being the last official release), and I\'m reluctant to point people towards the alpha / \'pre-beta\' versions in the forums... I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 23 Nov 07 Posts: 9 Credit: 325,662 RAC: 0 |
[ask_spb wrote:]... Can I do anything else to investigate it? ... My model definitely got looped, as now it shows again timestep 131070 (was before 131087, 131069, 131068). s/TS is growing - 4,06. Seems all Windows simulations got looped on the same timestep - somewhere after 129,624, while Linux simulation passed it successfully. So I will better try another task :) |
Send message Joined: 7 Apr 08 Posts: 4 Credit: 28,086 RAC: 0 |
I haven\'t a clue if this is the place to ask this question so here goes anyway. I downloaded and installed the BOINC software and attached the climate prediction stuff about a week ago. The computer has accumulated about 23 hours of CPU time but the progress percentage is only about 0.56%. At this rate it will take years to finish the work package and I doubt that the computer will be running 6 months from now. It\'s a Windows 2000 system, 2.8GHz, 2G-main memory, w/ATI x1300 graphics card. In the graphics view I get what looks like a reasonable (from my uninformed eye) distribution of temperature, pressure and the like. I don\'t use the screensaver and don\'t usually have the graphics screen active. The graphics display updates smoothly. Is this a typical CPU usage vs. progress percentage for this speed computer? If it is I\'ll have to terminate the climate prediction task as it will be all wasted analysis anyway. Any comments will be appreciated. |
Send message Joined: 9 Jan 07 Posts: 497 Credit: 342,899 RAC: 0 |
I\'m no expert, Richard, but it looks as if you\'ve downloaded two models instead of one - that would probably slow things up quite a lot? http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=854815 Visit the Scotland team |
©2024 cpdn.org