Thread 'model reset with no error message'

Author	Message
old_user12728 Send message Joined: 5 Sep 04 Posts: 6 Credit: 8,779,686 RAC: 0	Message 28654 - Posted: 13 May 2007, 0:19:10 UTC Last modified: 13 May 2007, 0:19:58 UTC My computer\'s BOINC workload is split between 2 projects: SETI and CPDN. Since SETI has been down for the past few days, it\'s been running 24 hours a day on CPDN. I\'ve noticed however that the latest model the computer has downloaded is behaving abnormally: since it was downloaded some 32 hours ago, it has reset itself at least twice (BOINC is showing that only ~40 minutes have been spent on the project currently), without throwing any error message. What might be happening? This is the computer in question: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=525394. ID: 28654 · Reply Quote

old_user12728 Send message Joined: 5 Sep 04 Posts: 6 Credit: 8,779,686 RAC: 0	Message 28657 - Posted: 13 May 2007, 4:11:10 UTC - in response to Message 28654. Update: it has happened again approximately 1 hour ago. I\'m aborting this particular model (hadcm3inct_cnsr_1920_160_05891600_0), and see if a new one would show similar behaviour. ID: 28657 · Reply Quote

Thyme Lawn Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0	Message 28660 - Posted: 13 May 2007, 12:54:34 UTC Last modified: 13 May 2007, 13:27:54 UTC My guess is that your problems was related to this message shown on the result page: Model crashed: umshell1.f: TRANSO2A: Missing data in ocean UV fields I\'ve not seen that particular error before, but given that there are 9 of them my guess is that it\'s related to an incompatibility in the model\'s parameter set. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer ID: 28660 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 28661 - Posted: 13 May 2007, 14:20:17 UTC Thyme, does the \'scan_lockfile\' part of the messages not mean that an AV scan was done with the model still running with the result that a model file was locked and ruined? Cpdn news ID: 28661 · Reply Quote

old_user12728 Send message Joined: 5 Sep 04 Posts: 6 Credit: 8,779,686 RAC: 0	Message 28664 - Posted: 13 May 2007, 15:57:51 UTC Last modified: 13 May 2007, 15:58:48 UTC Thank you both for your input. mo.v: As far as I\'m aware there is no scheduled AV scan at this time, and in any event I have put the entire BOINC directory under the exclusion list of the AV scanner (avast). It appears a reset has occurred again some time in the past 8 hours for the most current WU. I\'ll keep an eye on it to see if that is indeed the case as the day progresses. Not sure if this is relevant, but this started happening only after application 5.40 was downloaded in place of 5.15. Has there been any known issue between 5.40 and Dothan Pentium M machines? ID: 28664 · Reply Quote

old_user12728 Send message Joined: 5 Sep 04 Posts: 6 Credit: 8,779,686 RAC: 0	Message 28673 - Posted: 13 May 2007, 23:11:22 UTC Another update: I\'ve been trying to track what\'s going on with the model. At 08:08 local time the amount of CPU time spent on the model was supposedly 4 hours 49 minutes, with 0.347% of the model completed. 40 minutes later the CPU time has advanced accordingly, with 0.369% of the model completed. At 16:03 local time CPU time has been reset to 3 hours 18 minutes, yet the amount of the model completed has advanced \"normally\" to 0.595%. I must say I\'m quite baffled. Should I continue with the model?... ID: 28673 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 28674 - Posted: 14 May 2007, 1:51:51 UTC Hi again I\'ve had a look at the results pages for the 6 models that this computer has had (well, 5 really because the new model hasn\'t trickled yet). http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?hostid=525394 The first 4 models the computer had all crashed with negative pressure messages. The first ran at 2.9 sec/TS. The following 3 all did about 2.29 sec/TS which is significantly faster. Was that when the new faster model version was released? Model #5 is the one you aborted because of the missing data. It only produced one trickle but is recorded at 0.82sec/TS. Have I read that right? Is this possible? A standard cpdn model on the same machine as the first 4 models? #6 is the new one. If the model dates are advancing, you can keep trying. Is this computer overclocked? Cpdn news ID: 28674 · Reply Quote

old_user12728 Send message Joined: 5 Sep 04 Posts: 6 Credit: 8,779,686 RAC: 0	Message 28678 - Posted: 14 May 2007, 5:13:07 UTC I\'m afraid model #6 has crashed with an error I\'ve never seen before: \"The device does not recognize the command. (0x16) - exit code 22 (0x16)\"... I\'m not too sure how to interpret the slowness in processing model #5; perhaps that\'s an indication that something has gone awry? This computer is overclocked: it\'s running a Pentium M 735 pin-modded to 13317 instead of 10017. I changed the CPU from a Pentium M 730 at stock speed of 133*12 2 days before model #1 crashed, thus the increase in speed thereafter. I have been running under the assumption that the recent spate of errors have nothing to do with the overclocking since it ran model #5 for more than 2 months, but if I notice something strange again with the now-current model #7, I will back off some and see what will happen. ID: 28678 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 28680 - Posted: 14 May 2007, 11:17:55 UTC Slowing down is a symptom of too much overclocking - the model has to retry sections in order to get sensible figures. When I\'m overclocking I use 24 hours of Prime95\'s torture test to make sure the PC is still OK. I think the exit code 22 crash is something else, but it\'s being looked into. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 28680 · Reply Quote

old_user12728 Send message Joined: 5 Sep 04 Posts: 6 Credit: 8,779,686 RAC: 0	Message 28688 - Posted: 14 May 2007, 15:48:30 UTC Sad to report that backing off my overlock by 133MHz (16 instead of a 17 multiplier) still yielded a \"resetting\" model, at least in terms of reported CPU-time usage. Just like the model immediately prior to this one I\'m noticing the % completion has advanced seemingly normally. I\'ll also be keeping an eye at the model date to hopefully let you all \"in the know\" help me diagnose the problem. ID: 28688 · Reply Quote