climateprediction.net (CPDN) home page
Thread 'hadAm looping'

Thread 'hadAm looping'

Message boards : Number crunching : hadAm looping
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user219190

Send message
Joined: 14 Jan 07
Posts: 52
Credit: 284,001
RAC: 0
Message 34841 - Posted: 30 Aug 2008, 22:59:43 UTC
Last modified: 30 Aug 2008, 23:04:17 UTC

Hi
This WU Is looping,not something expected from a \"hadam\" job!
It crunches from 72.855% to72.93% over and over .
The checkpoints are 143 and 100 each time it reverts the globe turns blue and it trickles the date on the model is 24/12/2000.
Have tried restarting Boinc ,pausing etc to no avail.
Do not have an amd computer to try it on so think it will have to be aborted.
That would be a shame, it is the only work unit I have participated in where none of the crunchers have crashed.It was going along very nicely.
Chris.
Edit. It just killed itself :(


ID: 34841 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 34843 - Posted: 31 Aug 2008, 12:31:14 UTC

Hi Chris

None of the other people running the same workunit have reached the point where yours looped, so I think we\'ll have to keep an eye on what happens on to this HADAM on the other computers. I think a model is only allowed to loop/restart 99 or 100 times and then it automatically aborts, which is a good idea because it means the model can\'t get stuck in the loop indefinitely. At least with these HADAMs it seems to be a fairly short loop.

Better luck with your next model. There have also been problems recently with HADAMs running on BOINC6 and the HADAM work queue will probably not be filled again until Milo and Tolu have discussed this.
Cpdn news
ID: 34843 · Report as offensive     Reply Quote
old_user219190

Send message
Joined: 14 Jan 07
Posts: 52
Credit: 284,001
RAC: 0
Message 34844 - Posted: 31 Aug 2008, 13:46:26 UTC

Hi mo.v
Counting back messages it only looped five times before crashing.
As these were only 10 mins apart less than an hour lost,so no problem there.
All the others crunching the unit are also Intel based so the only thing it may prove (assuming each unit is the same) is my computer is on the way out!
Cheers
Chris.


ID: 34844 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 34845 - Posted: 31 Aug 2008, 18:02:04 UTC

If all your previous models were crunched on the same computer, it\'s performed very well so far. All the tasks in a WU start off with identical values, so usually if there\'s a flaw in the model it causes the same problem on all the computers running it. However, as models develop they tend to diverge slightly on different computers (the Lorenz butterfly effect) so it isn\'t inevitable that a model-related problem on one computer will be replicated on every other computer running the same model.

If I were you I\'d look back in a week or so to see whether any other computers encounter the same problem at the same point. If they all sail past your sticking point and complete the WU, you might want to dust the computer inside, check its temperatures and if it\'s overclocked take it back to stock settings.
Cpdn news
ID: 34845 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 34856 - Posted: 1 Sep 2008, 15:54:23 UTC

By the way, the HADAM3 queue has been re-filled.
ID: 34856 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 34903 - Posted: 5 Sep 2008, 10:20:38 UTC
Last modified: 5 Sep 2008, 10:24:06 UTC

Chris, you\'ll be relieved for your computer to see that a second task from your WU has crashed at the same point. So your computer\'s off the hook.

The graphs of the two crashed models are abnormal.

I\'ll send a PM to the second cruncher to advise him not to spend time restoring a backup, and PMs to the other two people further behind. Better for them to know that it\'s a faulty WU and not a problem with their computers.
Cpdn news
ID: 34903 · Report as offensive     Reply Quote
old_user219190

Send message
Joined: 14 Jan 07
Posts: 52
Credit: 284,001
RAC: 0
Message 34960 - Posted: 10 Sep 2008, 10:00:22 UTC
Last modified: 10 Sep 2008, 10:01:20 UTC

Hi mo.v
Sorry for the late reply,have been away .
Yes it is all the same computer, no overclocking (Don\'t know how!)
Had another hadam crash back in Jan it carried the same error code.
Also the temp.graph was the same hovering around 0C. in the last couple of months befor the crash.Intrestingly both Workunits have lost their graphs over the last few days it is now not possible to access \'Run Info\' \'time series\' on either.
Although a lot more stable than the slabs the hadam\'s appear to have their issues.


ID: 34960 · Report as offensive     Reply Quote

Message boards : Number crunching : hadAm looping

©2024 cpdn.org