Thread 'Cannot get a model to run'

Author	Message
old_user6988 Send message Joined: 31 Aug 04 Posts: 9 Credit: 402,104 RAC: 0	Message 2780 - Posted: 3 Sep 2004, 0:49:06 UTC Just setup for CPDN, running fine for SAH. Downloads some data unzips, etc. and crashes. This is what I get: 2004-09-02 20:45:00 [---] Insufficient work; requesting more 2004-09-02 20:45:00 [climateprediction.net] Requesting 15494 seconds of work 2004-09-02 20:45:01 [climateprediction.net] Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi 2004-09-02 20:45:02 [climateprediction.net] Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded 2004-09-02 20:45:02 [climateprediction.net] Started download of 1pmo_000100690.zip 2004-09-02 20:45:03 [climateprediction.net] Finished download of 1pmo_000100690.zip 2004-09-02 20:45:03 [climateprediction.net] Throughput 22322 bytes/sec 2004-09-02 20:45:03 [climateprediction.net] Starting result 1pmo_000100690_0 using hadsm3 version 4.03 Starting model in /Applications/boinc/projects/climateprediction.net... Archive: hadsm3data_4.03_powerpc-apple-darwin.zip creating: 1pmo_000100690/datain/ creating: 1pmo_000100690/datain/ancil/ creating: 1pmo_000100690/datain/ancil/ctldata/ inflating: 1pmo_000100690/datain/ancil/ctldata/spec3a_lw_3_asol2c_hadcm3 inflating: 1pmo_000100690/datain/ancil/ctldata/spec3a_sw_3_asol2b_hadcm3 creating: 1pmo_000100690/datain/ancil/ctldata/stasets/ inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01001218 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01002207 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003236 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003237 extracting: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003254 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003255 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003274 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003275 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003276 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003277 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003278 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003279 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003280 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003281 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003286 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01005207 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01005208 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01005222 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01005223 inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01010206 creating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/ inflating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/STASHmaster_A inflating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/STASHmaster_O inflating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/STASHmaster_S inflating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/STASHmaster_W inflating: 1pmo_000100690/datain/ancil/qrclim.icedp.32 inflating: 1pmo_000100690/datain/ancil/qrclim.newsst5.32 inflating: 1pmo_000100690/datain/ancil/qrclim.ozone_preind_corr inflating: 1pmo_000100690/datain/ancil/qrclim.uvcurr.32 creating: 1pmo_000100690/datain/dumps/ inflating: 1pmo_000100690/datain/dumps/slab32_1810.start inflating: 1pmo_000100690/datain/lats inflating: 1pmo_000100690/datain/ppcodes creating: 1pmo_000100690/dataout/ extracting: 1pmo_000100690/dataout/thist creating: 1pmo_000100690/jobs/ inflating: 1pmo_000100690/jobs/control.stashc inflating: 1pmo_000100690/jobs/double.stashc inflating: 1pmo_000100690/jobs/Recona.12 inflating: 1pmo_000100690/jobs/Recona.13 inflating: 1pmo_000100690/jobs/spec3a_lw_3_asol2c_hadcm3 inflating: 1pmo_000100690/jobs/spec3a_sw_3_asol2b_hadcm3 inflating: 1pmo_000100690/jobs/spin.stashc inflating: 1pmo_000100690/jobs/yabsd.ihist inflating: 1pmo_000100690/jobs/yabsd.PRESM_A extracting: 1pmo_000100690/jobs/yabsd.PRESM_O extracting: 1pmo_000100690/jobs/yabsd.PRESM_S extracting: 1pmo_000100690/jobs/yabsd.PRESM_W inflating: 1pmo_000100690/jobs/yabsd.stashc inflating: 1pmo_000100690/registration_license.txt extracting: 1pmo_000100690/stderr_um.txt extracting: 1pmo_000100690/stdout_um.txt creating: 1pmo_000100690/tmp/ extracting: 1pmo_000100690/tmp/pipe_dummy creating: 1pmo_000100690/viz/ inflating: 1pmo_000100690/viz/globe.rgb Archive: 1pmo_000100690.zip inflating: 1pmo_000100690/jobs/climate.spin inflating: 1pmo_000100690/jobs/climate.cont inflating: 1pmo_000100690/jobs/climate.doub inflating: 1pmo_000100690/jobs/ncatts.cpdc inflating: 1pmo_000100690/jobs/spec3a_sw_3_asol2b_hadcm3 Created shared memory region key = 25565 Env Used=DYLD_LIBRARY_PATH=/Applications/boinc/projects/climateprediction.net:../ Copying files for startup... In pre_initialise_phase (part 1 of 3) In initialise_phase (part 2 of 3) In startup_phase (part 3 of 3) Starting model ID 1pmo_000100690 Phase 1 Waiting for model startup, this may take a minute... Stack size=48.00 MB 1pmo_000100690 - PH 1 TS 000001 - 00/00/0000 00:00 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00 Model crashed...retrying...restart level 0 Preparing for restart... Rewinding a model-day... Starting model ID 1pmo_000100690 Phase 1 Stack size=48.00 MB Waiting for model startup, this may take a minute... 1pmo_000100690 - PH 1 TS 000001 - 00/00/0000 00:00 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00 Model crashed...retrying...restart level 1 Preparing for restart... Rewinding a model-month... Error: Restart files for dataout/restart.month not found Giving up, this result exceeded crash count for available restart files. adding: ncatts.cpdc (deflated 72%) adding: climate.cont (deflated 79%) adding: climate.cpdc (deflated 79%) adding: climate.doub (deflated 78%) adding: climate.spin (deflated 79%) adding: 1pmo_000100690.xml (deflated 65%) adding: ncatts.cpdc (deflated 72%) adding: ncatts.cpdc (deflated 72%) adding: ncatts.cpdc (deflated 72%) adding: stderr_um.txt (stored 0%) adding: yabsd.out (deflated 93%) adding: restart.day (deflated 43%) 2004-09-02 20:45:14 [climateprediction.net] Unrecoverable error for result 1pmo_000100690_0 (process exited with code 251 (0xfb)) 2004-09-02 20:45:14 [climateprediction.net] Unrecoverable error for result 1pmo_000100690_0 (process exited with code 251 (0xfb)) 2004-09-02 20:45:14 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds 2004-09-02 20:45:14 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds 2004-09-02 20:45:14 [climateprediction.net] Computation for result 1pmo_000100690 finished 2004-09-02 20:45:14 [climateprediction.net] Started upload of 1pmo_000100690_0_1.zip 2004-09-02 20:45:14 [climateprediction.net] Started upload of 1pmo_000100690_0_2.zip 2004-09-02 20:45:15 [climateprediction.net] Finished upload of 1pmo_000100690_0_1.zip 2004-09-02 20:45:15 [climateprediction.net] Throughput 3766 bytes/sec 2004-09-02 20:45:15 [climateprediction.net] Started upload of 1pmo_000100690_0_3.zip 2004-09-02 20:45:16 [climateprediction.net] Finished upload of 1pmo_000100690_0_2.zip 2004-09-02 20:45:16 [climateprediction.net] Throughput 102980 bytes/sec 2004-09-02 20:45:16 [climateprediction.net] Started upload of 1pmo_000100690_0_4.zip 2004-09-02 20:45:16 [climateprediction.net] Finished upload of 1pmo_000100690_0_3.zip 2004-09-02 20:45:16 [climateprediction.net] Throughput 3037 bytes/sec 2004-09-02 20:45:16 [climateprediction.net] Started upload of 1pmo_000100690_0_5.zip 2004-09-02 20:45:17 [climateprediction.net] Finished upload of 1pmo_000100690_0_4.zip 2004-09-02 20:45:17 [climateprediction.net] Throughput 3731 bytes/sec 2004-09-02 20:45:18 [climateprediction.net] Finished upload of 1pmo_000100690_0_5.zip 2004-09-02 20:45:18 [climateprediction.net] Throughput 117367 bytes/sec ID: 2780 · Reply Quote

old_user1204 Send message Joined: 25 Aug 04 Posts: 5 Credit: 103,128 RAC: 0	Message 2785 - Posted: 3 Sep 2004, 2:18:11 UTC Last modified: 3 Sep 2004, 4:03:24 UTC <P>Looks like the model failed right out of the gate. The messages do show that it uploaded the failure results back. So in a little while your log should show boinc downloading another model to run. My question is does the new model run or does it fail in the same manner?</P> <P>Also check the stderr_um.txt file for the failed model to see if it gave any error messages. Should be in <Dir-where-BOINC-is>/projects/climateprediction.net/1pmo_000100690/stderr_um.txt, that might give a clue.</P> BCNU,<BR> Vance ID: 2785 · Reply Quote

Thyme Lawn Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0	Message 2859 - Posted: 3 Sep 2004, 12:48:14 UTC Hi drweaser, Your problem <i>might</i> be related to a Visual Fortran error that's been afflicting the windows build recently. Seems that some workunits have gone out with a duff file. Check out <a href="http://www.climateprediction.net/board/viewtopic.php?t=2296&p=20006#20006">this thread</a> on the phpBB forum. And thanks to <b>sjokela</b> for doing the investigative work and <b>UK_Nick</b> for providing a link to the file that gives a workaround for the problem :) <a href="http://www.teampicard.net"><img src="http://www.teampicard.net/templates/fisubice/images/phpbb2_logo.jpg"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a> ID: 2859 · Reply Quote

old_user6711 Send message Joined: 31 Aug 04 Posts: 5 Credit: 349,932 RAC: 0	Message 2918 - Posted: 3 Sep 2004, 19:13:42 UTC - in response to Message 2785. I am having this same problem on both Mac OS X and Linux boxes. All model fail in this way. It is like they are all corrupt or something. stderr.txt says 1525-092 ID: 2918 · Reply Quote

old_user9662 Send message Joined: 2 Sep 04 Posts: 29 Credit: 13,100 RAC: 0	Message 2942 - Posted: 4 Sep 2004, 1:04:43 UTC - in response to Message 2785. > <P>Looks like the model failed right out of the gate. The messages do show > that it uploaded the failure results back. So in a little while your log > should show boinc downloading another model to run. My question is does the > new model run or does it fail in the same manner?</P> > <P>Also check the stderr_um.txt file for the failed model to see if it gave > any error messages. Should be in > /projects/climateprediction.net/1pmo_000100690/stderr_um.txt, > that might give a clue.</P> > BCNU,<BR> > Vance > I am having the same problem also - my log file on terminal looks almost exactly like the one from drweaser. I had this problem earlier today, and just tryed again a few minutes ago. BOINC downloaded another set of data and tried three more times and then gave up again, so the new model fails also. I tried to find a stderr_um.txt file, but can't find one anywhere on my computer. I'm running a 1 GHz G4 Powerbook with 1GB ram and a 60GB hard drive with over 30GB available, using OS 10.3.5. Nothing else was running when I tried to run the model. I read somewhere else in the formus that I should change the debt values in client_state.xml from 0.0000 to anything else. I tried it by changing them to 1.0000, but no improvement. Any help would be greatly appreciated. ID: 2942 · Reply Quote

old_user6988 Send message Joined: 31 Aug 04 Posts: 9 Credit: 402,104 RAC: 0	Message 3038 - Posted: 4 Sep 2004, 21:27:42 UTC Yep, Definitely still having problems..... Anyone interested can check my results..... <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?userid=6988">drweaser</a> Didn't find that output file? Maybe it was already uploaded. Tried on ~30 machines all Mac OS X 10.3.5 with 384M+ RAM and 40G+ HD Space, doing nothing else right since neither predictor nor seti is sending out work. <a href="http://www.boinc.dk/index.php?page=user_statistics&project=sah&userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a> ID: 3038 · Reply Quote

old_user9824 Send message Joined: 2 Sep 04 Posts: 1 Credit: 28,185 RAC: 0	Message 3143 - Posted: 6 Sep 2004, 12:48:50 UTC - in response to Message 2859. Last modified: 6 Sep 2004, 12:51:49 UTC > Hi drweaser, > > Your problem <i>might</i> be related to a Visual Fortran error that's been > afflicting the windows build recently. Seems that some workunits have gone out > with a duff file. > No it doesn't. That problem was found to be caused by corrupt datafiles generated by the server. It is fixed now - and I still have this error on my two Macs as well. Just attached for the first time on these machines, so it cannot be anything old residual either... Hope the admins will look into this. ID: 3143 · Reply Quote

old_user6988 Send message Joined: 31 Aug 04 Posts: 9 Credit: 402,104 RAC: 0	Message 3144 - Posted: 6 Sep 2004, 13:13:30 UTC - in response to Message 3143. I have about 6 machines out the 30 that are now working from the returned results. So there is something in the WU's that ain't right. Hopefully they will find it soon so that the WU's don't constantly get downloaded ad rejected and reldeownloaded. I wonder if they have taken the bad ones out of the pool, or not.....I seem to remember that predicator had to do something like that with bad charmm units. Drweaser > > Hi drweaser, > > > > Your problem <i>might</i> be related to a Visual Fortran error that's > been > > afflicting the windows build recently. Seems that some workunits have > gone out > > with a duff file. > > > > No it doesn't. That problem was found to be caused by corrupt datafiles > generated by the server. It is fixed now - and I still have this error on my > two Macs as well. > Just attached for the first time on these machines, so it cannot be anything > old residual either... > > Hope the admins will look into this. > > <a href="http://www.boinc.dk/index.php?page=user_statistics&project=sah&userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a> ID: 3144 · Reply Quote

old_user6988 Send message Joined: 31 Aug 04 Posts: 9 Credit: 402,104 RAC: 0	Message 3153 - Posted: 6 Sep 2004, 15:23:16 UTC - in response to Message 3144. Now what the Hades is going on.... check this out http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=141230 now it can't figure out the host computer for this work unit.....something is bonkers somewhere..... drweaser > I have about 6 machines out the 30 that are now working from the returned > results. So there is something in the WU's that ain't right. Hopefully they > will find it soon so that the WU's don't constantly get downloaded ad rejected > and reldeownloaded. I wonder if they have taken the bad ones out of the pool, > or not.....I seem to remember that predicator had to do something like that > with bad charmm units. > > Drweaser <a href="http://www.boinc.dk/index.php?page=user_statistics&project=sah&userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a> ID: 3153 · Reply Quote

old_user156 Send message Joined: 5 Aug 04 Posts: 186 Credit: 1,612,182 RAC: 0	Message 3154 - Posted: 6 Sep 2004, 15:34:58 UTC - in response to Message 3153. > Now what the Hades is going on.... > > check this out > > http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=141230 > > now it can't figure out the host computer for this work unit.....something is > bonkers somewhere..... Heh, a negative computer. :-) <a href="http://www.nmvs.dsl.pipex.com/"><img src="http://boinc.mundayweb.com/cpdn/stats.php?userID=6&team=off&trans=off"></a> ID: 3154 · Reply Quote

old_user1 Send message Joined: 5 Aug 04 Posts: 907 Credit: 299,864 RAC: 0	Message 3169 - Posted: 6 Sep 2004, 18:21:51 UTC - in response to Message 3154. I think I've corrected a lot of the oddities such as the "negative" computer. The problem is there seems to be some "orphan" runs out there, i.e. a problem with the BOINC "feeder" (which has since been fixed) means that some runs were sent out with incorrect parent records, so there are a bunch of runs out there that will not trickle or update properly. If your machine is one, you're probably better off resetting and attaching so that you get a "proper" run. ID: 3169 · Reply Quote

old_user6988 Send message Joined: 31 Aug 04 Posts: 9 Credit: 402,104 RAC: 0	Message 3176 - Posted: 6 Sep 2004, 19:25:59 UTC - in response to Message 3169. So out of my 30 eMacs, 5 are currently running a model. Questions 1. Should I reset those....they were the ones tha had the negative computer? Question 2. Should I reset the remaining 25 machines or simply let them keep trying to get a usable WU? At least now the trickles show up under my account, thanks for the fix! drweaser > I think I've corrected a lot of the oddities such as the "negative" computer. > The problem is there seems to be some "orphan" runs out there, i.e. a problem > with the BOINC "feeder" (which has since been fixed) means that some runs were > sent out with incorrect parent records, so there are a bunch of runs out there > that will not trickle or update properly. If your machine is one, you're > probably better off resetting and attaching so that you get a "proper" run. > > > <a href="http://www.boinc.dk/index.php?page=user_statistics&project=sah&userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a> ID: 3176 · Reply Quote

old_user6988 Send message Joined: 31 Aug 04 Posts: 9 Credit: 402,104 RAC: 0	Message 3180 - Posted: 6 Sep 2004, 19:43:40 UTC Got a different error this time.....looks like some sort of memory problem.....maybe this will help somebody figure it out.... Starting model in /Applications/boinc/projects/climateprediction.net... Archive: hadsm3se_4.03_powerpc-apple-darwin.zip inflating: ./hadsm3se_4.03_powerpc-apple-darwin inflating: ./libxlf90.A.dylib ... inflating: 1sdl_100104286/jobs/climate.doub inflating: 1sdl_100104286/jobs/ncatts.cpdc inflating: 1sdl_100104286/jobs/spec3a_sw_3_asol2b_hadcm3 Could not create shared memory region 25780, 301228 2004-09-06 15:41:44 [climateprediction.net] Unrecoverable error for result 1sdl_100104286_1 (process exited with code 255 (0xff)) 2004-09-06 15:41:44 [climateprediction.net] Unrecoverable error for result 1sdl_100104286_1 (process exited with code 255 (0xff)) 2004-09-06 15:41:44 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds 2004-09-06 15:41:44 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds 2004-09-06 15:41:44 [climateprediction.net] Computation for result 1sdl_100104286 finished <a href="http://www.boinc.dk/index.php?page=user_statistics&project=sah&userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a> ID: 3180 · Reply Quote