Message boards : Number crunching : Iceworld (HadSM and HadSM MH) discussion
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,901,585 RAC: 2,106 |
[Urglab wrote:] Hi, I just noticed one of my tasks turned snowball too. Progress is at 29.43%... another Windows/Intel machine in the work unit has got further with that model (here), which suggests that your model might be recoverable if you have a backup. If not and the graphics stay blue for a while then the model should be aborted as something has gone seriously wrong. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,901,585 RAC: 2,106 |
[Hans-Henrik Husen wrote:] I'm running the hadsm3fub_jwxd_006453195_1 ... Does anybody have an explanation?First of all, welcome to the message board - and there is indeed an explanation. That model has become an 'iceworld', which on a Windows/Intel machine means that it processes very slowly and the temperature graphical display shows all blue. The model will eventually finish (another user in that work unit has finished - after 6,345,735 s [i.e. 73 days!]), but our advice is to abort such models and get on with something a bit more productive. Ideally, CPDN models could be run without any user intervention but unfortunately the user does sometimes have to get involved. This particular problem affects only models in the HADSM3 family. The other model currently running, FAMOUS, operates at the other extreme: if it finds something wrong it stops, which can be frustrating as well - but at least no time is lost. |
Send message Joined: 7 Sep 09 Posts: 2 Credit: 13,113,974 RAC: 0 |
Thank you for your answer! I learned something new today - Iceworld! I'll abort the model (and another Iceworld model I also have running). Regards HH Husen |
Send message Joined: 15 Feb 06 Posts: 18 Credit: 131,262 RAC: 0 |
My project hadsm3dhet2_k2xl_006613451 has developed an Iceworld, but it has reached over 97% (237238/259248), so should I let it run to completion? |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,901,585 RAC: 2,106 |
My project hadsm3dhet2_k2xl_006613451 has developed an Iceworld, but it has reached over 97% (237238/259248), so should I let it run to completion? It will eventually finish, so you could finish it if you want to; however, it will possibly take a month or so to do it! Since someone else has already finished that model on Windows/Intel - complete with iceworld - my advice would be to abort it and run a fresh model. |
Send message Joined: 15 Feb 06 Posts: 18 Credit: 131,262 RAC: 0 |
OK, thanks! It seems to have run backwards in the last hour! |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,901,585 RAC: 2,106 |
OK, thanks! It seems to have run backwards in the last hour!In that case you must certainly abort it. Some iceworlds get into a looping state in which they endlessly repeat the same "checkpoint" (i.e. up to 144 timesteps). I've not been able to work out a pattern for which processor versions do and don't suffer that fate (an old P4 of mine and a laptop did that) - but stopping the model is the only option. |
Send message Joined: 8 May 05 Posts: 2 Credit: 1,373,627 RAC: 0 |
Please tell me what's up with this one: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6702471 It has been stalled at 83% for several days and the "to completion" time keeps climbing. I can't obtain any information from the graphic because I can't display graphics for this or any project (due to protected application installation of BOINC?) It's running under BOINC 6.10.58, 32-bit WinXP, on a dual-boot MacBook (Core2 Duo P7350). |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
This is the model in question. There's no error information on that page, and the s/TS looks constant. The only place that info about the condition of models exists before the final upload, is on people's computers. And a lot of this is in the various graphics displays. So, your guess is as good as ours. :( Sorry. About the only advice that I could give you, would be to keep running it until you get bored with the lack of progress, and then abort it. This will provide visible info on what has gone wrong. Backups: Here |
Send message Joined: 8 May 05 Posts: 2 Credit: 1,373,627 RAC: 0 |
This is the model in question. Killing it. It ran for 482 hours, which is about 3x what these things normally run. I'm getting rid of this machine next week, so I don't have time to wait for it anymore. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
It definitely was an iceworld in Intel + Windows. Here's an Intel/Win model from thesame workunit. Look at the sec/timestep. The slowdown is really much worse than the numbers suggest because we see the cumulative average, not the current speed. So you did the right thing to abort it. What a pity that more members with an iceworld don't report the problem on the forum. If that member had reported the iceworld in February 2010 you would have received an email warning you about this probability. Another member with Intel/Win is still running the model but it's less advanced. She needs an email. Cpdn news |
Send message Joined: 23 Jan 10 Posts: 1 Credit: 3,321,873 RAC: 0 |
If you would like another data point for debugging, it looks like I have an ice world in the Slab 6.07 model. It's been running while and a week or two ago I noticed that it looked like it was going backwards. Today I realized that at the rate it was going it would never make the deadline and decided to look into it. Do you want me to abort it now, or leave it running for a while longer? It is currently suspended. # A link to the model/ResultID webpage http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10955184 # A current timestep of that model (on the globe graphic) 16/03/1815 05:00 # The s/TS value (on the globe graphic. Remember, you can hit the Z key while viewing the globe and it will give you this additional text/status information.) 70.41 # Whether the temperature display of the globe graphic is blue. All blue, I would say it's all -42 (hard to tell, those blues are pretty close) # What your processor/CPU and Operating System is (i.e. Intel or AMD on Windows or Linux) Dual Nehalem Xeon 2.26GHz quad core with hyperthreading enabled Windows 7 # Whether you are overclocking. No overclocking (SuperMicro server motherboard doesn't allow it) |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,901,585 RAC: 2,106 |
Thanks for reporting that, Mark. It's definitely an iceworld, as all the other Windows/Intel machines have run into the same problem. That one didn't take long before misbehaving! The only option for a phase-1 iceworld is to abort it, as it will never recover. Fortunately, the current FAMOUS and HADAM3P models don't become slow-processing iceworlds. |
Send message Joined: 25 Nov 09 Posts: 1 Credit: 204,092 RAC: 0 |
Hi guys, So two possible units that seem to meet this criteria, as follows: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10079626 Timestep: 154983 of 259248 s/TS 7.4 Temp: All blue http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10079752 Timestep: 247718 of 259248 s/TS: 4.92 Temp: All blue Intel I7 960 3.2GHz quad core - Windows 7 It's been running for over a year and I've only just noticed these two haven't completed yet! |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,901,585 RAC: 2,106 |
[Ben Smith wrote:] It's been running for over a year and I've only just noticed these two haven't completed yet!Two finishes and two aborts in those work units after a colossal amount of time. Unfortunately, there's nothing we can do other than advertise the problem here; the slab model has been retired now so it won't be fixed. Abort them! |
Send message Joined: 18 Feb 05 Posts: 5 Credit: 2,654,795 RAC: 654 |
OK, I'm killing this one Task 10941637 Name hadsm3dhet2_jjad_006587991_2 Workunit 6791364 Computer ID 1047305 I'm not sure exactly when it went bad, but I started trying to figure out what was going on when the database work started and I couldn't access the forums - definitely an iceball. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Crystalsys You definitely did the right thing to abort that iceworld. Here it is. You can see from the seconds per timestep column on the right that the slowdown started over a month ago. With this type of model this problem doesn't correct itself. Other computers with a model from the same workunit are managing to finish it if they have an AMD processor ie not the same as your Intel or have Linux or Mac, not Win like you. However, there's a member with Intel + Win like you who's hit the same 'iceworld' at the same point as you. Here's the model, truly stuck in this loop for ages and probably unnoticed by its owner. I shall ask our new acting sysadmin, Jonathan, if he can send one of the special iceworld emails to this hapless member. So thanks for informing us, Crystal. Cpdn news |
©2024 cpdn.org