climateprediction.net (CPDN) home page
Thread 'model reset with no error message'

Thread 'model reset with no error message'

Questions and Answers : Windows : model reset with no error message
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user12728

Send message
Joined: 5 Sep 04
Posts: 6
Credit: 8,779,686
RAC: 0
Message 28654 - Posted: 13 May 2007, 0:19:10 UTC
Last modified: 13 May 2007, 0:19:58 UTC

My computer\'s BOINC workload is split between 2 projects: SETI and CPDN. Since SETI has been down for the past few days, it\'s been running 24 hours a day on CPDN. I\'ve noticed however that the latest model the computer has downloaded is behaving abnormally: since it was downloaded some 32 hours ago, it has reset itself at least twice (BOINC is showing that only ~40 minutes have been spent on the project currently), without throwing any error message. What might be happening? This is the computer in question: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=525394.
ID: 28654 · Report as offensive     Reply Quote
old_user12728

Send message
Joined: 5 Sep 04
Posts: 6
Credit: 8,779,686
RAC: 0
Message 28657 - Posted: 13 May 2007, 4:11:10 UTC - in response to Message 28654.  

Update: it has happened again approximately 1 hour ago. I\'m aborting this particular model (hadcm3inct_cnsr_1920_160_05891600_0), and see if a new one would show similar behaviour.

ID: 28657 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 28660 - Posted: 13 May 2007, 12:54:34 UTC
Last modified: 13 May 2007, 13:27:54 UTC

My guess is that your problems was related to this message shown on the result page:
Model crashed: umshell1.f:  TRANSO2A: Missing data in ocean UV fields

I\'ve not seen that particular error before, but given that there are 9 of them my guess is that it\'s related to an incompatibility in the model\'s parameter set.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 28660 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 28661 - Posted: 13 May 2007, 14:20:17 UTC

Thyme, does the \'scan_lockfile\' part of the messages not mean that an AV scan was done with the model still running with the result that a model file was locked and ruined?
Cpdn news
ID: 28661 · Report as offensive     Reply Quote
old_user12728

Send message
Joined: 5 Sep 04
Posts: 6
Credit: 8,779,686
RAC: 0
Message 28664 - Posted: 13 May 2007, 15:57:51 UTC
Last modified: 13 May 2007, 15:58:48 UTC

Thank you both for your input.

mo.v: As far as I\'m aware there is no scheduled AV scan at this time, and in any event I have put the entire BOINC directory under the exclusion list of the AV scanner (avast).

It appears a reset has occurred again some time in the past 8 hours for the most current WU. I\'ll keep an eye on it to see if that is indeed the case as the day progresses. Not sure if this is relevant, but this started happening only after application 5.40 was downloaded in place of 5.15. Has there been any known issue between 5.40 and Dothan Pentium M machines?
ID: 28664 · Report as offensive     Reply Quote
old_user12728

Send message
Joined: 5 Sep 04
Posts: 6
Credit: 8,779,686
RAC: 0
Message 28673 - Posted: 13 May 2007, 23:11:22 UTC

Another update: I\'ve been trying to track what\'s going on with the model. At 08:08 local time the amount of CPU time spent on the model was supposedly 4 hours 49 minutes, with 0.347% of the model completed. 40 minutes later the CPU time has advanced accordingly, with 0.369% of the model completed. At 16:03 local time CPU time has been reset to 3 hours 18 minutes, yet the amount of the model completed has advanced \"normally\" to 0.595%.

I must say I\'m quite baffled. Should I continue with the model?...
ID: 28673 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 28674 - Posted: 14 May 2007, 1:51:51 UTC

Hi again

I\'ve had a look at the results pages for the 6 models that this computer has had (well, 5 really because the new model hasn\'t trickled yet).

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=525394

The first 4 models the computer had all crashed with negative pressure messages. The first ran at 2.9 sec/TS. The following 3 all did about 2.29 sec/TS which is significantly faster. Was that when the new faster model version was released?

Model #5 is the one you aborted because of the missing data. It only produced one trickle but is recorded at 0.82sec/TS. Have I read that right? Is this possible? A standard cpdn model on the same machine as the first 4 models?

#6 is the new one. If the model dates are advancing, you can keep trying.

Is this computer overclocked?


Cpdn news
ID: 28674 · Report as offensive     Reply Quote
old_user12728

Send message
Joined: 5 Sep 04
Posts: 6
Credit: 8,779,686
RAC: 0
Message 28678 - Posted: 14 May 2007, 5:13:07 UTC

I\'m afraid model #6 has crashed with an error I\'ve never seen before: \"The device does not recognize the command. (0x16) - exit code 22 (0x16)\"...

I\'m not too sure how to interpret the slowness in processing model #5; perhaps that\'s an indication that something has gone awry?

This computer is overclocked: it\'s running a Pentium M 735 pin-modded to 133*17 instead of 100*17. I changed the CPU from a Pentium M 730 at stock speed of 133*12 2 days before model #1 crashed, thus the increase in speed thereafter. I have been running under the assumption that the recent spate of errors have nothing to do with the overclocking since it ran model #5 for more than 2 months, but if I notice something strange again with the now-current model #7, I will back off some and see what will happen.
ID: 28678 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 28680 - Posted: 14 May 2007, 11:17:55 UTC


Slowing down is a symptom of too much overclocking - the model has to retry sections in order to get sensible figures. When I\'m overclocking I use 24 hours of Prime95\'s torture test to make sure the PC is still OK.

I think the exit code 22 crash is something else, but it\'s being looked into.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 28680 · Report as offensive     Reply Quote
old_user12728

Send message
Joined: 5 Sep 04
Posts: 6
Credit: 8,779,686
RAC: 0
Message 28688 - Posted: 14 May 2007, 15:48:30 UTC

Sad to report that backing off my overlock by 133MHz (16 instead of a 17 multiplier) still yielded a \"resetting\" model, at least in terms of reported CPU-time usage. Just like the model immediately prior to this one I\'m noticing the % completion has advanced seemingly normally. I\'ll also be keeping an eye at the model date to hopefully let you all \"in the know\" help me diagnose the problem.
ID: 28688 · Report as offensive     Reply Quote

Questions and Answers : Windows : model reset with no error message

©2025 cpdn.org