Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
Message board moderation
Author | Message |
---|---|
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
/Placeholder/ Please use this thread if you have the symptoms of a problem described in this sticky on iceworlds and associated slow model progress in hadsm3/hadsm3h. |
Send message Joined: 10 Jun 05 Posts: 10 Credit: 4,863 RAC: 0 |
I have had that happen to me with 3 models in a row and not only with the SM ones but also with a CM model. Easing back on my overclock did seem to help. The jury is still out on that question, I haven\'t completed a model since lowering my overclock, but it is looking good. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
You didn\'t indicate how drastic your O/C is. How well does it play with 24 hours of Prime-95 (one copy per core)? "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 10 Jun 05 Posts: 10 Credit: 4,863 RAC: 0 |
I had it clocked at something like 3.07 GHz, now it is at 3.0 GHz (stock 2.4 GHz). Hearing/reading about all the achieved overclocks on a Q6600, 3 GHz does not seem excessive to me. I haven\'t run prime95 for 24 hours, only for something like 6 hours (on all 4 cores) and just one pass for memtest86+. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
6 hours isn\'t really enough for Prime95, my Q6600 was passing at 8 hours but failing before 24 hours. I\'m now awaiting an RMA from OCZ to return a dodgy memory stick... Make sure that the Prime95 test is using a large proportion of the system memory, otherwise it will detect CPU errors but not memory errors. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hearing/reading about all the achieved overclocks on a Q6600, 3 GHz does not seem excessive to me. It depends on the context in which you heard/read this. These climate programs are based on huge Fortran programs running on supercomputers, and are rather touchy when running on mere desktop computers. Extreme stability is far more important than \'extreme\' overclocking. |
Send message Joined: 10 Jun 05 Posts: 10 Credit: 4,863 RAC: 0 |
I tested my Q6600 at the maximum power consumption/heat production setting. I\'d say the test with memtest86+ should have taken care of the memory testing. For Climateprediction all my testing may have been too short but in my opinion it should have been enough for normal BOINC-usage. The core temeratures were acceptable, I think. About 60C at 100% load, for a \'normal\' 100% load about 50C to 55C. I haven\'t had any problems with CPDN since reducing my overclock. Would you recommend more testing despite this? BTW, I\'m not crunching 24/7. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Sounds like you\'ve solved the problem, but yes, it\'s worthwhile doing the full test. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 29 Mar 06 Posts: 8 Credit: 2,793,692 RAC: 0 |
I had one of these I think. I hope this link works http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6732317 At that time I was running two slab models, and this one just stopped. Not quite stopped, but it went from two trickles a day to not managing to get out of June 2075 or wherever it was. I was frustrated because it was at about 95% complete. Killing it and restoring from a backup made no difference - it got to the same place and just \'stopped\' again. Still was using all the CPU, and did move between timesteps, but so slowly it never got to the next checkpoint even. It wasn\'t looping as far as I could see. PC is standard issue, Intel Core 2 6320, not overclocked, running on 1GB RAM at that stage. It\'s the first one I have aborted :( |
Send message Joined: 10 Jun 05 Posts: 10 Credit: 4,863 RAC: 0 |
I\'ll certainly consider it very seriously. I suppose I can stand to lose 24 hours of crunching. ;) |
Send message Joined: 21 Dec 05 Posts: 3 Credit: 1,168,435 RAC: 0 |
I seem to have an example of this problem: Result 6817821. I run BOINC as a service, so unfortunately I can\'t provide timestep, temperature display or s/TS from the graphics because I can\'t display them (unless someone can tell me how). Looking at the trickle history, s/TS increased from approx 1.35 to 3.14 between 18 Sep (last \"fast\" trickle) and 01 Oct (last trickle for this result). The processor is an Intel Core 2 6420/not overclocked/2GB RAM. Progress on this model has slowed from 6 trickles per day to 1 trickle per 8 days i.e. by a factor of approx 50. There are no known reliability issues with this host with any BOINC project (it runs CPDN/Einstein/Rosetta; one core dedicated to CPDN). This is a server, so it runs 24x7. Current progress is 75.934%. I believe the slow down started just prior to 75%. Does it serve any useful science purpose to allow this result to finish? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Jim There is/was an option in BOINC to \"Interact with the desktop\", which allowed for graphics. I think this had to be selected at install, and it may also not be there now. Various options/features/facilities have come and gone over the years, as BOINC has changed. With such a large slow down, I think that it would be fair enough to abort it. Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
There\'s another member processing the same workunit, a bit less advanced than Jim. Interestingly, his model seems to have slowed down at exactly the same point - 4 trickles into phase 3. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6817821 I think I\'ll send this member a private message advising them to abort it as well. (If anyone thinks this isn\'t a good idea, please post or send me a PM.) Not sure what to do about a third cruncher still crunching phase 1 of the model. We could let this person continue for the time being and if they hit the same slowdown at the same point, send a PM then. Abort it, Jim. Cpdn news |
Send message Joined: 21 Dec 05 Posts: 3 Credit: 1,168,435 RAC: 0 |
Before I do abort it, is it worthwhile: a) Waiting for the next trickle (should be within 48 hours). Is there any useful (science) data in a trickle other than progress stats?; or b) Saving any files from the model run that might help analyse/troubleshoot the reason for the slowdown? If yes, which ones? If you need time to consult the developers etc, that\'s fine. I can leave the model running until you have a consensus. |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
Jim, Unfortunately, the science data for the slab models is in the Zip file upload at the end of a phase. It\'s different for the coupled models, which upload useful data every year, more at decades and even more every 40 years. Given that the slab models are relatively short, I suspect that the project would debug the models by re-starting from the beginning - so your offer isn\'t likely to be taken up. There are quite a few of these odd slab models, so the project won\'t be short of candidates. Iain |
Send message Joined: 21 Dec 05 Posts: 3 Credit: 1,168,435 RAC: 0 |
OK ... aborted. CPU Time: 647:32:42 Progress: 76.067% I have saved a copy of the complete BOINC directory subtree and will keep it for 30 days in case anyone cares to revisit this. |
Send message Joined: 27 Aug 06 Posts: 26 Credit: 162,685 RAC: 0 |
It has to be done manually. Instructions are here. Kathryn :o) The BOINC FAQ Service The Unofficial BOINC Wiki The Trac System More BOINC information than you can shake a stick of RAM at. |
Send message Joined: 8 Aug 05 Posts: 9 Credit: 46,744 RAC: 0 |
I\'ve got an iceworld. Result ID: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6725943 Current timestep: 149254 of 259248 s/Ts: 1.78 Colour: Blue iceworld CPU: Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz [x86 Family 6 Model 15 Stepping 6] Overclocking: No, bog standard. So I guess I should abort the task ? I\'ve got another one running on the other core but it isn\'t as far advanced. It\'s OK so far. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I suspect that a lot of these \"iceworlds\" aren\'t. It depends on the temperatures, rather than the colour of the globe display. For instance, my current 80 year TCM is very hot, (here), but has started displaying a white globe, sometimes changing to a blue globe. And the processing has stopped, even though the cpu time is still increasing. I\'ve found that I can coax it along one timestep (and half hour), at a time, by opening and closing the globe display, (and waiting), and by other \'fiddling\'. I think there may be a bug of some sort, and I\'m going to try and see what makes it \'move\', and also if I can get it going again by itself. All of this after it slowed way down a couple of weeks ago, and now with 66 hours to go. :( What you do with your model, though, is another matter. There\'s millions of others to run if you want to abort. Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
DKR, it looks as if your model has slowed right down, so much that it hasn\'t trickled for days, since 16 Oct. The 1.78 timestep is a cumulative figure - it\'s probably much slower than that now. If this really is the case, I would abort it as it still has quite a few years left to crunch. Cpdn news |
©2024 cpdn.org