Questions and Answers : Unix/Linux : SC Model won\'t start
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Some of us saw this in Alpha or Beta (or CMspinup?): Starting model in /home/jim/CPDNboinc/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc... Created shared memory region key = 26580 .so shmem return code = 136457540 Copying files for startup... Starting model ID 482y_a00297370 Phase 1 Waiting for model startup, this may take a minute... Model timeout at 180.00 seconds Preparing for restart... Rewinding a model-day... Starting model ID 482y_a00297370 Phase 1 Waiting for model startup, this may take a minute... And waiting, and waiting, and waiting ... If it does as before, it will fail because it can\'t find restart files. (P4 3.0, SuSE 9.0) Edit: As expected: Waiting for model startup, this may take a minute... Model timeout at 180.00 seconds Preparing for restart... Rewinding a model-month... Error: Restart files for dataout/restart.month not found Giving up, this result exceeded crash count for available restart files. 2005-09-13 02:45:19 [climateprediction.net] Computation for result 482y_a00297370 finished 2005-09-13 02:45:19 [climateprediction.net] Unrecoverable error for result 482y_a00297370_0 (Output file error: -1) 2005-09-13 02:45:19 [climateprediction.net] Unrecoverable error for result 482y_a00297370_0 (Output file error: -1) Another Model was D/L & started; it began to play the same game so: CTRL-C. Shutting down another box -- nothing to run. That makes two of my five... "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 6 Aug 04 Posts: 65 Credit: 1,605,224 RAC: 0 |
What did you do to start this problem astoWX? Which boinc version? Was it a reboot, perhaps accidental improper model shut down, etc? Just curious so that I don\'t accidentally create a similar problem here. Is this mainly with the Sulphur models? They seem more sensitive to me, I think I\'ve crashed a couple myself somehow. Thx Dave |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
What did you do to start this problem astoWX? Which boinc version? Was it a reboot, perhaps accidental improper model shut down, etc? Just curious so that I don\'t accidentally create a similar problem here. After two SC Beta Models (4.43/4.18) finished on that machine, Beta was stopped (two 4.18 Beta Models remain to be crunched (and I may return to them despite the ussues with 4.28). Went to HacSM3 in a separate folder (CC 4.19/SM 4.13) to restart an old run in early Phase 1. Each Trickle generated a new Host ID and no credit was posted to the new IDs for those trickles. Figuring the Data Base was hosed again, I deleted middle IDs and merged the oldest and newest, then forced an error on the old Run. That prompted D/L of the Sulfur Model and the renewal of an old error, posted above. (Time out; must restart Dbox to verify SC & CC versions.) CC 4.19/SC 4.21. (4.19 is okay for SC except that it doesn\'t U/L results after each Phase.)
One failure during testing also showed a message that it couldn\'t sort lock files. I find it hard to believe that there is a connection between messing with the old Run and Host ID\'s and failure of a new Model to start -- especially when new client modules were also D/L (and when this problem was seen before). (Edited to eliminate .GT. & .LT. signs that I foolishly thought would be handled with either code or quote BBcode.) "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Once again I have SC Models failing to start. One is in Phase 2 and was running with HT off (on a P4 2.8, SuSE 9.0). Restored HT and set Global Prefs to use 2 CPUs. \"Waiting for model startup, this may take a minute...\" but it failed to start. Pulled the plug before it timed out and went belly up. Retried with same result (posted below.) Reset Global Prefs for one CPU and retried. Similar failure. (Posted below.) The failure has occurred on four different machines, from SC Beta to today. [Edit: This failure is not on the same machine as the original post in this thread.] If anyone has a solution for this... jim@Abox:~/CPDNboinc/BOINC> ./run_client 2005-09-26 09:10:54 [---] Starting BOINC client version 4.43 for i686-pc-linux-gnu 2005-09-26 09:10:54 [---] Data directory: /home/jim/CPDNboinc/BOINC 2005-09-26 09:10:54 [climateprediction.net] Computer ID: 144113; location: home; project prefs: home 2005-09-26 09:10:54 [---] General prefs: from climateprediction.net (last modified 2005-08-26 12:25:05) 2005-09-26 09:10:54 [---] General prefs: using separate prefs for home 2005-09-26 09:10:54 [---] Remote control not allowed; using loopback address 2005-09-26 09:10:54 [climateprediction.net] Resuming computation for result 467h_c00294941_0 using sulphur_cycle version 4.21 2005-09-26 09:10:54 [climateprediction.net] Resuming computation for result 45qb_b00294323_0 using sulphur_cycle version 4.21 2005-09-26 09:10:54 [---] Computer is overcommitted 2005-09-26 09:10:54 [---] New work fetch policy: no work fetch allowed. 2005-09-26 09:10:54 [---] New CPU scheduler policy: earliest deadline first. 2005-09-26 09:10:54 [---] schedule_cpus: must schedule 2005-09-26 09:10:54 [---] earliest deadline: 1139902673.000000 467h_c00294941_0 2005-09-26 09:10:54 [---] earliest deadline: 1140710093.000000 45qb_b00294323_0 Starting model in /home/jim/CPDNboinc/BOINC/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc... Created shared memory region key = 26480 Starting model in /home/jim/CPDNboinc/BOINC/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc... Created shared memory region key = 26620 .so shmem return code = 136457540 .so shmem return code = 136457540 Copying files for startup... Starting model ID 45qb_b00294323 Phase 1 Waiting for model startup, this may take a minute... Starting model ID 467h_c00294941 Phase 2 Waiting for model startup, this may take a minute... 2005-09-26 09:13:25 [---] Received signal 2 2005-09-26 09:13:25 [---] Exit requested by user 2005-09-26 09:13:26 [---] request_reschedule_cpus: exit_tasks jim@Abox:~/CPDNboinc/BOINC> ./run_client 2005-09-26 09:19:19 [---] Starting BOINC client version 4.43 for i686-pc-linux-gnu 2005-09-26 09:19:19 [---] Data directory: /home/jim/CPDNboinc/BOINC 2005-09-26 09:19:19 [climateprediction.net] Computer ID: 144113; location: home; project prefs: home 2005-09-26 09:19:19 [---] General prefs: from climateprediction.net (last modified 2005-08-26 12:25:05) 2005-09-26 09:19:19 [---] General prefs: using separate prefs for home 2005-09-26 09:19:19 [---] Remote control not allowed; using loopback address 2005-09-26 09:19:19 [climateprediction.net] Resuming computation for result 467h_c00294941_0 using sulphur_cycle version 4.21 2005-09-26 09:19:19 [climateprediction.net] Deferring computation for result 45qb_b00294323_0 2005-09-26 09:19:19 [---] Computer is overcommitted 2005-09-26 09:19:19 [---] Nearly overcommitted. 2005-09-26 09:19:19 [---] New work fetch policy: no work fetch allowed. 2005-09-26 09:19:19 [---] New CPU scheduler policy: earliest deadline first. 2005-09-26 09:19:19 [---] schedule_cpus: must schedule 2005-09-26 09:19:19 [---] earliest deadline: 1139902673.000000 467h_c00294941_0 Starting model in /home/jim/CPDNboinc/BOINC/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc... Created shared memory region key = 26480 .so shmem return code = 136457540 Starting model ID 467h_c00294941 Phase 2 Waiting for model startup, this may take a minute... 2005-09-26 09:20:20 [---] Received signal 2 2005-09-26 09:20:20 [---] Exit requested by user "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Ran two simultaneous copies of Prime95 Torture Test. As expected, no problems. (The error isn\'t one which points to hardware.) Copy #1: Torture Test ran 5 hours, 18 minutes - 0 errors, 0 warnings. Copy #2: Torture Test ran 5 hours, 18 minutes - 0 errors, 0 warnings. Given that I seem to be the only one in the world experiencing this problem, per the \"plethora\" of suggestions since the original post on 13 September, this box will join my growing inventory of idle hardware, all thanks to this sorry state of affairs. Pathetic waste of machinery. Next step, failing a solution from on high, will be reset_project, with the loss of three SC Models, one in Phase 2. Sigh. What a waste. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
Sorry Jim, I haven\'t seen this one. But then I\'ve typically been running a single model. I have two sulphur \"gold\" models running in Linux on a dual Xeon at work, but other than the benchmark crash, I\'ve not tried to stop and restart the runs. Now you\'ve got me scared of even trying to stop them. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Hi, George, I\'d be tempted to think its another SuSE issue except that my WinXP box was idled, for what seems like two weeks, for failure to start either a SC or Spinup Model. Edit: They all fail at the same place. Until then, I thought it might be a client_state.xml issue but my SC and Spinup Models run from different Folders/Directories, completely independent of each other, hence different copies of client_state.xml. (Convenient, except for having to un-install/re-install boinc in XP.) It\'s one thing when a new Model won\'t start but even more strange when one won\'t restart. Curious. Jim "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0 |
Hi, I\'m currently running 2 gold SC models (in phase 2) and they\'re both stable as rock. I stop my machine 2h/day everyday by simply shutting off the computer and everything is OK when I start it again. Suse 9.2, CC4.43. Arnaud |
©2024 cpdn.org