climateprediction.net (CPDN) home page
Thread 'SC Model won\'t start'

Thread 'SC Model won\'t start'

Questions and Answers : Unix/Linux : SC Model won\'t start
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 15964 - Posted: 13 Sep 2005, 9:39:04 UTC
Last modified: 13 Sep 2005, 9:53:52 UTC

Some of us saw this in Alpha or Beta (or CMspinup?):
Starting model in /home/jim/CPDNboinc/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc...
Created shared memory region key = 26580
.so shmem return code = 136457540
Copying files for startup...
Starting model ID 482y_a00297370   Phase 1
Waiting for model startup, this may take a minute...
Model timeout at 180.00 seconds
Preparing for restart...
Rewinding a model-day...
Starting model ID 482y_a00297370   Phase 1
Waiting for model startup, this may take a minute...

And waiting, and waiting, and waiting ...
If it does as before, it will fail because it can\'t find restart files.

(P4 3.0, SuSE 9.0)

Edit:

As expected:

Waiting for model startup, this may take a minute...
Model timeout at 180.00 seconds
Preparing for restart...
Rewinding a model-month...
Error: Restart files for dataout/restart.month not found
Giving up, this result exceeded crash count for available restart files.
2005-09-13 02:45:19 [climateprediction.net] Computation for result 482y_a00297370 finished
2005-09-13 02:45:19 [climateprediction.net] Unrecoverable error for result 482y_a00297370_0 (Output file error: -1)
2005-09-13 02:45:19 [climateprediction.net] Unrecoverable error for result 482y_a00297370_0 (Output file error: -1)

Another Model was D/L & started; it began to play the same game so: CTRL-C.

Shutting down another box -- nothing to run. That makes two of my five...
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 15964 · Report as offensive     Reply Quote
Profileold_user248

Send message
Joined: 6 Aug 04
Posts: 65
Credit: 1,605,224
RAC: 0
Message 15966 - Posted: 13 Sep 2005, 10:01:24 UTC

What did you do to start this problem astoWX? Which boinc version? Was it a reboot, perhaps accidental improper model shut down, etc? Just curious so that I don\'t accidentally create a similar problem here.

Is this mainly with the Sulphur models? They seem more sensitive to me, I think I\'ve crashed a couple myself somehow.

Thx
Dave
ID: 15966 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 15973 - Posted: 13 Sep 2005, 15:54:51 UTC - in response to Message 15966.  
Last modified: 13 Sep 2005, 16:05:40 UTC

What did you do to start this problem astoWX? Which boinc version? Was it a reboot, perhaps accidental improper model shut down, etc? Just curious so that I don\'t accidentally create a similar problem here.

Is this mainly with the Sulphur models? They seem more sensitive to me, I think I\'ve crashed a couple myself somehow.

Thx
Dave


After two SC Beta Models (4.43/4.18) finished on that machine, Beta was stopped (two 4.18 Beta Models remain to be crunched (and I may return to them despite the ussues with 4.28). Went to HacSM3 in a separate folder (CC 4.19/SM 4.13) to restart an old run in early Phase 1.

Each Trickle generated a new Host ID and no credit was posted to the new IDs for those trickles. Figuring the Data Base was hosed again, I deleted middle IDs and merged the oldest and newest, then forced an error on the old Run. That prompted D/L of the Sulfur Model and the renewal of an old error, posted above. (Time out; must restart Dbox to verify SC & CC versions.) CC 4.19/SC 4.21. (4.19 is okay for SC except that it doesn\'t U/L results after each Phase.)


core_client_version 4.19 /core_client_version
stderr_txt

/stderr_txt
message Output file error: -1 /message
active_task_state 2 /active_task_state
signal 0 /signal
upload_error
file_name 482y_a00297370_0_1.zip /file_name
error_code -1 /error_code
/upload_error
upload_error
file_name 482y_a00297370_0_2.zip /file_name
error_code -1 /error_code
/upload_error
upload_error
file_name 482y_a00297370_0_3.zip /file_name
error_code -1 /error_code
/upload_error
upload_error
file_name 482y_a00297370_0_4.zip /file_name
error_code -1 /error_code
/upload_error
upload_error
file_name 482y_a00297370_0_5.zip /file_name
error_code -1 /error_code
/upload_error


One failure during testing also showed a message that it couldn\'t sort lock files.

I find it hard to believe that there is a connection between messing with the old Run and Host ID\'s and failure of a new Model to start -- especially when new client modules were also D/L (and when this problem was seen before).

(Edited to eliminate .GT. & .LT. signs that I foolishly thought would be handled with either code or quote BBcode.)
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 15973 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 16271 - Posted: 26 Sep 2005, 16:41:06 UTC
Last modified: 26 Sep 2005, 16:46:36 UTC

Once again I have SC Models failing to start. One is in Phase 2 and was running with HT off (on a P4 2.8, SuSE 9.0). Restored HT and set Global Prefs to use 2 CPUs. \"Waiting for model startup, this may take a minute...\" but it failed to start. Pulled the plug before it timed out and went belly up. Retried with same result (posted below.)

Reset Global Prefs for one CPU and retried. Similar failure. (Posted below.)

The failure has occurred on four different machines, from SC Beta to today. [Edit: This failure is not on the same machine as the original post in this thread.] If anyone has a solution for this...

jim@Abox:~/CPDNboinc/BOINC> ./run_client
2005-09-26 09:10:54 [---] Starting BOINC client version 4.43 for i686-pc-linux-gnu
2005-09-26 09:10:54 [---] Data directory: /home/jim/CPDNboinc/BOINC
2005-09-26 09:10:54 [climateprediction.net] Computer ID: 144113; location: home; project prefs: home
2005-09-26 09:10:54 [---] General prefs: from climateprediction.net (last modified 2005-08-26 12:25:05)
2005-09-26 09:10:54 [---] General prefs: using separate prefs for home
2005-09-26 09:10:54 [---] Remote control not allowed; using loopback address
2005-09-26 09:10:54 [climateprediction.net] Resuming computation for result 467h_c00294941_0 using sulphur_cycle version 4.21
2005-09-26 09:10:54 [climateprediction.net] Resuming computation for result 45qb_b00294323_0 using sulphur_cycle version 4.21
2005-09-26 09:10:54 [---] Computer is overcommitted
2005-09-26 09:10:54 [---] New work fetch policy: no work fetch allowed.
2005-09-26 09:10:54 [---] New CPU scheduler policy: earliest deadline first.
2005-09-26 09:10:54 [---] schedule_cpus: must schedule
2005-09-26 09:10:54 [---] earliest deadline: 1139902673.000000 467h_c00294941_0
2005-09-26 09:10:54 [---] earliest deadline: 1140710093.000000 45qb_b00294323_0
Starting model in /home/jim/CPDNboinc/BOINC/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc...
Created shared memory region key = 26480
Starting model in /home/jim/CPDNboinc/BOINC/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc...
Created shared memory region key = 26620
.so shmem return code = 136457540
.so shmem return code = 136457540
Copying files for startup...
Starting model ID 45qb_b00294323 Phase 1
Waiting for model startup, this may take a minute...
Starting model ID 467h_c00294941 Phase 2
Waiting for model startup, this may take a minute...
2005-09-26 09:13:25 [---] Received signal 2
2005-09-26 09:13:25 [---] Exit requested by user
2005-09-26 09:13:26 [---] request_reschedule_cpus: exit_tasks

jim@Abox:~/CPDNboinc/BOINC> ./run_client
2005-09-26 09:19:19 [---] Starting BOINC client version 4.43 for i686-pc-linux-gnu
2005-09-26 09:19:19 [---] Data directory: /home/jim/CPDNboinc/BOINC
2005-09-26 09:19:19 [climateprediction.net] Computer ID: 144113; location: home; project prefs: home
2005-09-26 09:19:19 [---] General prefs: from climateprediction.net (last modified 2005-08-26 12:25:05)
2005-09-26 09:19:19 [---] General prefs: using separate prefs for home
2005-09-26 09:19:19 [---] Remote control not allowed; using loopback address
2005-09-26 09:19:19 [climateprediction.net] Resuming computation for result 467h_c00294941_0 using sulphur_cycle version 4.21
2005-09-26 09:19:19 [climateprediction.net] Deferring computation for result 45qb_b00294323_0
2005-09-26 09:19:19 [---] Computer is overcommitted
2005-09-26 09:19:19 [---] Nearly overcommitted.
2005-09-26 09:19:19 [---] New work fetch policy: no work fetch allowed.
2005-09-26 09:19:19 [---] New CPU scheduler policy: earliest deadline first.
2005-09-26 09:19:19 [---] schedule_cpus: must schedule
2005-09-26 09:19:19 [---] earliest deadline: 1139902673.000000 467h_c00294941_0
Starting model in /home/jim/CPDNboinc/BOINC/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc...
Created shared memory region key = 26480
.so shmem return code = 136457540
Starting model ID 467h_c00294941 Phase 2
Waiting for model startup, this may take a minute...
2005-09-26 09:20:20 [---] Received signal 2
2005-09-26 09:20:20 [---] Exit requested by user
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 16271 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 16281 - Posted: 27 Sep 2005, 0:54:09 UTC

Ran two simultaneous copies of Prime95 Torture Test. As expected, no problems. (The error isn\'t one which points to hardware.)

Copy #1:
Torture Test ran 5 hours, 18 minutes - 0 errors, 0 warnings.
Copy #2:
Torture Test ran 5 hours, 18 minutes - 0 errors, 0 warnings.

Given that I seem to be the only one in the world experiencing this problem, per the \"plethora\" of suggestions since the original post on 13 September, this box will join my growing inventory of idle hardware, all thanks to this sorry state of affairs. Pathetic waste of machinery.

Next step, failing a solution from on high, will be reset_project, with the loss of three SC Models, one in Phase 2. Sigh. What a waste.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 16281 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 16282 - Posted: 27 Sep 2005, 2:22:10 UTC

Sorry Jim, I haven\'t seen this one. But then I\'ve typically been running a single model. I have two sulphur \"gold\" models running in Linux on a dual Xeon at work, but other than the benchmark crash, I\'ve not tried to stop and restart the runs. Now you\'ve got me scared of even trying to stop them.
ID: 16282 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 16293 - Posted: 27 Sep 2005, 15:57:58 UTC
Last modified: 27 Sep 2005, 15:59:26 UTC

Hi, George,

I\'d be tempted to think its another SuSE issue except that my WinXP box was idled, for what seems like two weeks, for failure to start either a SC or Spinup Model. Edit: They all fail at the same place.

Until then, I thought it might be a client_state.xml issue but my SC and Spinup Models run from different Folders/Directories, completely independent of each other, hence different copies of client_state.xml. (Convenient, except for having to un-install/re-install boinc in XP.)

It\'s one thing when a new Model won\'t start but even more strange when one won\'t restart. Curious.

Jim
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 16293 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 16297 - Posted: 27 Sep 2005, 18:53:57 UTC

Hi,
I\'m currently running 2 gold SC models (in phase 2) and they\'re both stable as rock.
I stop my machine 2h/day everyday by simply shutting off the computer and everything is OK when I start it again.
Suse 9.2, CC4.43.
Arnaud
ID: 16297 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : SC Model won\'t start

©2024 cpdn.org