Questions and Answers : Windows : LOOPING IN 2040
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I have a problem at 74.9%. I have been running a CM model for several months. It has been doing fine up till now, sending regular trickles. Yesterday it crashed. I restored it from a backup made that morning. Now it is looping. Every time is reaches 03/12/2040 it loops back to 01/12/2040. It has done this at least 3 or 4 (probably more) times. Could the fact that 2040 is a 40 year mega-trickle year have anything to do with the problem? I plan to try shifting the backup to my other machine(AMD processor to Intel) and seeing if I can get it past the sticking point. If that doesn’t work I will try restoring from an earlier backup. I hate to abort it with so much time invested. Wish me luck. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
EUREKA!!! Changing machines worked! The WU is now past the loop point and crunching on. Messages confirms that the 2040 trickle was sent and received. Apparently there really is a difference between the way the model runs on AMD and Intel processors. The model is now crunching its way through March of 2041. Maybe I shouldn\'t crow about it until it has trickle at least once more in 2041? I will post in this thread if there is any further problems. |
Send message Joined: 14 Jan 07 Posts: 52 Credit: 284,001 RAC: 0 |
|
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Well done! "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Well done, Jim! Do the two computers crunch at a fairly similar speed? I\'m asking because if you move a model from a slow machine to a much faster one (over twice as fast), there\'s a potential problem that can crop up later on. But regular backups allow a solution. If this is the case let us know. Cpdn news |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
I moved a Spinup from a Pentium III to an E8500, no noticable problems so far. Does this \"later\" refer to <rsc_fpops_bound> somehow? In this case I guess I\'ll have to edit init_data.xml and reduce <wu_cpu_time> a bit. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
It\'s the rsc_fpops_bound number/value that\'s the problem so that\'s the number that needs to be changed. There are instructions here, intended to be detailed enough to enable almost any member to edit the file. Members who think their model may hit this problem would probably do well to edit the file soon after the move, just in case. But you can also do nothing and wait to see whether the problem occurs as long as you back up regularly. If a model\'s moved to another computer roughly the same speed, or slightly faster, or slower there\'s no problem. Cpdn news |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
Increasing the allowed fpops or reducing the time that the model has already used up should be about the same. fpops_rsc_bound is part of client_state.xml and has a copy in init_data.xml, wu_cpu_time is only contained in init_data.xml I will make a backup and try what happens if I change the CPU time in init_data.xml Edit : Result = it restored the original WU time from somewhere, so editing the time doesn\'t help. I guess BOINC does this as a cheat prevention. I increased the fpops_bound now, that should help. Thanks for the information, I would not have wanted to loose that model, it is already half through (~40%). |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Well done, Jim! Yes, they are roughly the same speed. The AMD machine has a 1.7 Ghz single core processor. The Intel machine has a core 2 duo twin 1.5 Ghz processors. Both laptops. I transferred the WU back to the AMD when it reached March 2041 and it has since trickled successfully. It is now crunching happily through Sept. of 2042. Should trickle again about 2am local time. I wish I had a machine that was more than twice the speed of the 1.7Ghz. :) |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I only had 1.33GHz for several years of crunching but the computer still did lots of useful CPDN processing. Though not very fast! I kept it working for nearly seven years until so many things went wrong that it eventually couldn\'t be used. But some parts from it are still inside my current AMD which is a hybrid of bits recovered from other computers. Long may it last. The day you do replace it you may find that the price of fast 4 or even 8 cores has come right down. Cpdn news |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Hi, everyone. This is an update one the earlier post in this thread. The CM model that I am running on my HP machine with the AMD processor still loops occasionally when it get to the trickle. It will loop back from 4/12 back to 1/12 for hours (real time) and then crash. This has happened 5 time now. The looping seems to happens about every 4 model years. To fix this I make a backup on 1/12 and move it to my Acer machine with the Intel core2dou processor. I then let it run past 12/7 (to set a save point after the problem) and move it back to the HP. It works like a charm. This work around has worked very well on the Boinc manager 5.10.45. I was concerned that it might not work as well on the new 6.2.18. I am happy to report that it works equally well on the 6.2.18. The model is now in 2061. The model is 87% complete and I am determined that I am going to ride this nag across the finish line if I have to whip it every step of the way. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Good for you, Jim. (The satisfaction in herding one of these large beasts to the finish line is worth the effort.) "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Jim I hate to say this, but I think there\'s something seriously wrong with this model and it may have been flawed from the outset. Like the Greek tragedies where, given the starting situation, a tragic ending is inevitable. But with the difference that while Greek audiences knew in advance whether they were going to see a tragedy or a comedy, on CPDN the suspense is greater because we never know beforehand how a model will turn out. Here are the AMD\'s tasks. Here\'s Task 7202686 which is of course a 160-year HADCM. The graph only shows years to 2019 though Jim says it\'s reached 2061. Here is its graph as far as it goes: That seems to have been abnormal from the start. I\'ve looked at the whole workunit 6133966. Two other members made progress with their tasks from this WU but both models crashed. Here are Indefual\'s model and graph: . That temperature rise from 1920-1945 looks crazy to me. The rise in total precipitation early in the model looks equally bad. Nixniz ran the model on his Intel. Here it is and here\'s its graph: I don\'t like all those ups and downs and it shows the same early extreme temperature rise. I think that in spite of all Jim\'s efforts he should abort this model but I\'d appreciate some other opinions. Cpdn news |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Hi Jim Hi Mo, this is Jim. So you think the model is abnormal and should be aborted? I hate to give up on this one. I have been running it for 7 months! I’m no expert at reading the graphs, but, the steep rise in temp between 1920 and 1945 does seem somewhat extreme. Did the model stop transmitting temp data in 2019? That would mean that I have been running it for the last 5 weeks for nothing. There is no hurry making a decision. Right now the WU is sitting inactive in a backup file and I am running a slab model in its place. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
You could continue on and spend months more on it. But the researchers could then run their statistical analysis programs on it and throw it out, and you\'d never know. Given the results so far, I\'d be inclined to abort it. And if it takes you that long to only get as far as you have with that model, I\'d suggest that you stick to the shorter slab models. Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Jim I\'ve looked at about 8 of the model\'s graphs and none look normal. There\'s missing data at the beginning and something strange happened about 1933. I think you need to merge the records for your AMD - I think you have a phantom record, probably as a result of restoring a backup. It could be that after merging the records, the missing years after 2019 will show up on the graphs. Trickles may have been going to the phantom computer. The computer time hasn\'t been completely wasted because the researchers also need to know what parameter sets are unviable. I know it really hurts to abandon a model after looking after it so carefully and for so long, but I think you\'re going to have to make yourself abort it. I think you should keep a close eye on the models your AMD runs because a slab HADSM model you ran earlier has missing precipitation data for the last few years of phase 3, whereas the other two members who completed it on an AMD and an Intel produced normal graphs. Usually, though not always, a slab model that goes wrong produces the same abnormality on all the computers that run it. Here is that slab. It speeded up for its last few trickles because it wasn\'t processing all the data. Keeping an eye on models means * looking at the globe every few days to check it still looks normal with all the colours, not a monochrome display * noticing the crunching speed (sec/timestep) which shouldn\'t slow down or speed up very much as the model progresses * looking at the model\'s graphs on its web page to check that data is there Jim, is your AMD overclocked, or have the settings and timings been altered in any way? I\'m asking this because a beta HADSM slab I ran on my AMD speeded up massively early on and the globe blanked out. Another mod, PeteB, got me to check the settings using CPU-Z. It turned out that the settings had been speeded up by 2½%, probably at the shop that had sold the CPU & motherboard to my son a couple of years earlier as part of a barebones package. As soon as I got it back to factory settings the computer behaved perfectly and I was able to rerun the same slab with normal results. I\'m not suggesting that this problem is anything to do with AMD as opposed to Intel. But if there\'s any instability in a computer, it\'s likely to affect climate models. It may be significant that your HADCM is much more abnormal than the two run by other crunchers. If you want to download CPU-Z, which is a freebie, to the AMD, and check the actual settings against stock settings for your CPU and RAM, one of us should be able to to advise you how to use the tool. Cpdn news |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Hi Jim Hi Mo and Les. Thanks for the advise. I will abandon the defective CM WU. Since it is not presently installed on my machine (its sitting in a backup file) do I need to reinstall it and formally abort it? I know that one of my past slab models went iceball at 92%. The AMD computer is a laptop with factory settings as far as I know. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Hi Jim Hi, Guys. This is Jim again. I took your advice and merged 2 versions of my HP computer. Did it make the temp results for 2019 to 2060 appear? |
©2025 cpdn.org