climateprediction.net (CPDN) home page
Thread 'Cannot get a model to run'

Thread 'Cannot get a model to run'

Questions and Answers : Macintosh : Cannot get a model to run
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user6988

Send message
Joined: 31 Aug 04
Posts: 9
Credit: 402,104
RAC: 0
Message 2780 - Posted: 3 Sep 2004, 0:49:06 UTC

Just setup for CPDN, running fine for SAH. Downloads some data unzips, etc. and crashes.

This is what I get:

2004-09-02 20:45:00 [---] Insufficient work; requesting more
2004-09-02 20:45:00 [climateprediction.net] Requesting 15494 seconds of work
2004-09-02 20:45:01 [climateprediction.net] Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
2004-09-02 20:45:02 [climateprediction.net] Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded
2004-09-02 20:45:02 [climateprediction.net] Started download of 1pmo_000100690.zip
2004-09-02 20:45:03 [climateprediction.net] Finished download of 1pmo_000100690.zip
2004-09-02 20:45:03 [climateprediction.net] Throughput 22322 bytes/sec
2004-09-02 20:45:03 [climateprediction.net] Starting result 1pmo_000100690_0 using hadsm3 version 4.03
Starting model in /Applications/boinc/projects/climateprediction.net...
Archive: hadsm3data_4.03_powerpc-apple-darwin.zip
creating: 1pmo_000100690/datain/
creating: 1pmo_000100690/datain/ancil/
creating: 1pmo_000100690/datain/ancil/ctldata/
inflating: 1pmo_000100690/datain/ancil/ctldata/spec3a_lw_3_asol2c_hadcm3
inflating: 1pmo_000100690/datain/ancil/ctldata/spec3a_sw_3_asol2b_hadcm3
creating: 1pmo_000100690/datain/ancil/ctldata/stasets/
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01001218
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01002207
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003236
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003237
extracting: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003254
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003255
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003274
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003275
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003276
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003277
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003278
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003279
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003280
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003281
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01003286
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01005207
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01005208
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01005222
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01005223
inflating: 1pmo_000100690/datain/ancil/ctldata/stasets/X01010206
creating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/
inflating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/STASHmaster_A
inflating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/STASHmaster_O
inflating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/STASHmaster_S
inflating: 1pmo_000100690/datain/ancil/ctldata/STASHmaster/STASHmaster_W
inflating: 1pmo_000100690/datain/ancil/qrclim.icedp.32
inflating: 1pmo_000100690/datain/ancil/qrclim.newsst5.32
inflating: 1pmo_000100690/datain/ancil/qrclim.ozone_preind_corr
inflating: 1pmo_000100690/datain/ancil/qrclim.uvcurr.32
creating: 1pmo_000100690/datain/dumps/
inflating: 1pmo_000100690/datain/dumps/slab32_1810.start
inflating: 1pmo_000100690/datain/lats
inflating: 1pmo_000100690/datain/ppcodes
creating: 1pmo_000100690/dataout/
extracting: 1pmo_000100690/dataout/thist
creating: 1pmo_000100690/jobs/
inflating: 1pmo_000100690/jobs/control.stashc
inflating: 1pmo_000100690/jobs/double.stashc
inflating: 1pmo_000100690/jobs/Recona.12
inflating: 1pmo_000100690/jobs/Recona.13
inflating: 1pmo_000100690/jobs/spec3a_lw_3_asol2c_hadcm3
inflating: 1pmo_000100690/jobs/spec3a_sw_3_asol2b_hadcm3
inflating: 1pmo_000100690/jobs/spin.stashc
inflating: 1pmo_000100690/jobs/yabsd.ihist
inflating: 1pmo_000100690/jobs/yabsd.PRESM_A
extracting: 1pmo_000100690/jobs/yabsd.PRESM_O
extracting: 1pmo_000100690/jobs/yabsd.PRESM_S
extracting: 1pmo_000100690/jobs/yabsd.PRESM_W
inflating: 1pmo_000100690/jobs/yabsd.stashc
inflating: 1pmo_000100690/registration_license.txt
extracting: 1pmo_000100690/stderr_um.txt
extracting: 1pmo_000100690/stdout_um.txt
creating: 1pmo_000100690/tmp/
extracting: 1pmo_000100690/tmp/pipe_dummy
creating: 1pmo_000100690/viz/
inflating: 1pmo_000100690/viz/globe.rgb
Archive: 1pmo_000100690.zip
inflating: 1pmo_000100690/jobs/climate.spin
inflating: 1pmo_000100690/jobs/climate.cont
inflating: 1pmo_000100690/jobs/climate.doub
inflating: 1pmo_000100690/jobs/ncatts.cpdc
inflating: 1pmo_000100690/jobs/spec3a_sw_3_asol2b_hadcm3
Created shared memory region key = 25565
Env Used=DYLD_LIBRARY_PATH=/Applications/boinc/projects/climateprediction.net:../
Copying files for startup...
In pre_initialise_phase (part 1 of 3)
In initialise_phase (part 2 of 3)
In startup_phase (part 3 of 3)
Starting model ID 1pmo_000100690 Phase 1
Waiting for model startup, this may take a minute...
Stack size=48.00 MB
1pmo_000100690 - PH 1 TS 000001 - 00/00/0000 00:00 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00
Model crashed...retrying...restart level 0
Preparing for restart...
Rewinding a model-day...
Starting model ID 1pmo_000100690 Phase 1
Stack size=48.00 MB
Waiting for model startup, this may take a minute...
1pmo_000100690 - PH 1 TS 000001 - 00/00/0000 00:00 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00
Model crashed...retrying...restart level 1
Preparing for restart...
Rewinding a model-month...
Error: Restart files for dataout/restart.month not found
Giving up, this result exceeded crash count for available restart files.
adding: ncatts.cpdc (deflated 72%)
adding: climate.cont (deflated 79%)
adding: climate.cpdc (deflated 79%)
adding: climate.doub (deflated 78%)
adding: climate.spin (deflated 79%)
adding: 1pmo_000100690.xml (deflated 65%)
adding: ncatts.cpdc (deflated 72%)
adding: ncatts.cpdc (deflated 72%)
adding: ncatts.cpdc (deflated 72%)
adding: stderr_um.txt (stored 0%)
adding: yabsd.out (deflated 93%)
adding: restart.day (deflated 43%)
2004-09-02 20:45:14 [climateprediction.net] Unrecoverable error for result 1pmo_000100690_0 (process exited with code 251 (0xfb))
2004-09-02 20:45:14 [climateprediction.net] Unrecoverable error for result 1pmo_000100690_0 (process exited with code 251 (0xfb))
2004-09-02 20:45:14 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds
2004-09-02 20:45:14 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds
2004-09-02 20:45:14 [climateprediction.net] Computation for result 1pmo_000100690 finished
2004-09-02 20:45:14 [climateprediction.net] Started upload of 1pmo_000100690_0_1.zip
2004-09-02 20:45:14 [climateprediction.net] Started upload of 1pmo_000100690_0_2.zip
2004-09-02 20:45:15 [climateprediction.net] Finished upload of 1pmo_000100690_0_1.zip
2004-09-02 20:45:15 [climateprediction.net] Throughput 3766 bytes/sec
2004-09-02 20:45:15 [climateprediction.net] Started upload of 1pmo_000100690_0_3.zip
2004-09-02 20:45:16 [climateprediction.net] Finished upload of 1pmo_000100690_0_2.zip
2004-09-02 20:45:16 [climateprediction.net] Throughput 102980 bytes/sec
2004-09-02 20:45:16 [climateprediction.net] Started upload of 1pmo_000100690_0_4.zip
2004-09-02 20:45:16 [climateprediction.net] Finished upload of 1pmo_000100690_0_3.zip
2004-09-02 20:45:16 [climateprediction.net] Throughput 3037 bytes/sec
2004-09-02 20:45:16 [climateprediction.net] Started upload of 1pmo_000100690_0_5.zip
2004-09-02 20:45:17 [climateprediction.net] Finished upload of 1pmo_000100690_0_4.zip
2004-09-02 20:45:17 [climateprediction.net] Throughput 3731 bytes/sec
2004-09-02 20:45:18 [climateprediction.net] Finished upload of 1pmo_000100690_0_5.zip
2004-09-02 20:45:18 [climateprediction.net] Throughput 117367 bytes/sec

ID: 2780 · Report as offensive     Reply Quote
old_user1204

Send message
Joined: 25 Aug 04
Posts: 5
Credit: 103,128
RAC: 0
Message 2785 - Posted: 3 Sep 2004, 2:18:11 UTC
Last modified: 3 Sep 2004, 4:03:24 UTC

<P>Looks like the model failed right out of the gate. The messages do show that it uploaded the failure results back. So in a little while your log should show boinc downloading another model to run. My question is does the new model run or does it fail in the same manner?</P>
<P>Also check the stderr_um.txt file for the failed model to see if it gave any error messages. Should be in &lt;Dir-where-BOINC-is&gt;/projects/climateprediction.net/1pmo_000100690/stderr_um.txt, that might give a clue.</P>
BCNU,<BR>
Vance
ID: 2785 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 2859 - Posted: 3 Sep 2004, 12:48:14 UTC

Hi drweaser,

Your problem <i>might</i> be related to a Visual Fortran error that's been afflicting the windows build recently. Seems that some workunits have gone out with a duff file.

Check out <a href="http://www.climateprediction.net/board/viewtopic.php?t=2296&amp;p=20006#20006">this thread</a> on the phpBB forum.

And thanks to <b>sjokela</b> for doing the investigative work and <b>UK_Nick</b> for providing a link to the file that gives a workaround for the problem :)

<a href="http://www.teampicard.net"><img src="http://www.teampicard.net/templates/fisubice/images/phpbb2_logo.jpg"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 2859 · Report as offensive     Reply Quote
old_user6711

Send message
Joined: 31 Aug 04
Posts: 5
Credit: 349,932
RAC: 0
Message 2918 - Posted: 3 Sep 2004, 19:13:42 UTC - in response to Message 2785.  

I am having this same problem on both Mac OS X and Linux boxes.
All model fail in this way. It is like they are all corrupt or something.
stderr.txt says
1525-092
ID: 2918 · Report as offensive     Reply Quote
old_user9662

Send message
Joined: 2 Sep 04
Posts: 29
Credit: 13,100
RAC: 0
Message 2942 - Posted: 4 Sep 2004, 1:04:43 UTC - in response to Message 2785.  

&gt; <P>Looks like the model failed right out of the gate. The messages do show
&gt; that it uploaded the failure results back. So in a little while your log
&gt; should show boinc downloading another model to run. My question is does the
&gt; new model run or does it fail in the same manner?</P>
&gt; <P>Also check the stderr_um.txt file for the failed model to see if it gave
&gt; any error messages. Should be in
&gt; /projects/climateprediction.net/1pmo_000100690/stderr_um.txt,
&gt; that might give a clue.</P>
&gt; BCNU,<BR>
&gt; Vance
&gt;

I am having the same problem also - my log file on terminal looks almost exactly like the one from drweaser. I had this problem earlier today, and just tryed again a few minutes ago. BOINC downloaded another set of data and tried three more times and then gave up again, so the new model fails also. I tried to find a stderr_um.txt file, but can't find one anywhere on my computer. I'm running a 1 GHz G4 Powerbook with 1GB ram and a 60GB hard drive with over 30GB available, using OS 10.3.5. Nothing else was running when I tried to run the model.

I read somewhere else in the formus that I should change the debt values in client_state.xml from 0.0000 to anything else. I tried it by changing them to 1.0000, but no improvement.

Any help would be greatly appreciated.
ID: 2942 · Report as offensive     Reply Quote
old_user6988

Send message
Joined: 31 Aug 04
Posts: 9
Credit: 402,104
RAC: 0
Message 3038 - Posted: 4 Sep 2004, 21:27:42 UTC

Yep, Definitely still having problems.....

Anyone interested can check my results.....
<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/results.php?userid=6988">drweaser</a>

Didn't find that output file? Maybe it was already uploaded. Tried on ~30 machines all Mac OS X 10.3.5 with 384M+ RAM and 40G+ HD Space, doing nothing else right since neither predictor nor seti is sending out work.
<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=sah&amp;userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a>
ID: 3038 · Report as offensive     Reply Quote
old_user9824

Send message
Joined: 2 Sep 04
Posts: 1
Credit: 28,185
RAC: 0
Message 3143 - Posted: 6 Sep 2004, 12:48:50 UTC - in response to Message 2859.  
Last modified: 6 Sep 2004, 12:51:49 UTC

&gt; Hi drweaser,
&gt;
&gt; Your problem <i>might</i> be related to a Visual Fortran error that's been
&gt; afflicting the windows build recently. Seems that some workunits have gone out
&gt; with a duff file.
&gt;

No it doesn't. That problem was found to be caused by corrupt datafiles generated by the server. It is fixed now - and I still have this error on my two Macs as well.
Just attached for the first time on these machines, so it cannot be anything old residual either...

Hope the admins will look into this.
ID: 3143 · Report as offensive     Reply Quote
old_user6988

Send message
Joined: 31 Aug 04
Posts: 9
Credit: 402,104
RAC: 0
Message 3144 - Posted: 6 Sep 2004, 13:13:30 UTC - in response to Message 3143.  

I have about 6 machines out the 30 that are now working from the returned results. So there is something in the WU's that ain't right. Hopefully they will find it soon so that the WU's don't constantly get downloaded ad rejected and reldeownloaded. I wonder if they have taken the bad ones out of the pool, or not.....I seem to remember that predicator had to do something like that with bad charmm units.

Drweaser



&gt; &gt; Hi drweaser,
&gt; &gt;
&gt; &gt; Your problem <i>might</i> be related to a Visual Fortran error that's
&gt; been
&gt; &gt; afflicting the windows build recently. Seems that some workunits have
&gt; gone out
&gt; &gt; with a duff file.
&gt; &gt;
&gt;
&gt; No it doesn't. That problem was found to be caused by corrupt datafiles
&gt; generated by the server. It is fixed now - and I still have this error on my
&gt; two Macs as well.
&gt; Just attached for the first time on these machines, so it cannot be anything
&gt; old residual either...
&gt;
&gt; Hope the admins will look into this.
&gt;
&gt;
<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=sah&amp;userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a>
ID: 3144 · Report as offensive     Reply Quote
old_user6988

Send message
Joined: 31 Aug 04
Posts: 9
Credit: 402,104
RAC: 0
Message 3153 - Posted: 6 Sep 2004, 15:23:16 UTC - in response to Message 3144.  

Now what the Hades is going on....

check this out

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=141230

now it can't figure out the host computer for this work unit.....something is bonkers somewhere.....

drweaser


&gt; I have about 6 machines out the 30 that are now working from the returned
&gt; results. So there is something in the WU's that ain't right. Hopefully they
&gt; will find it soon so that the WU's don't constantly get downloaded ad rejected
&gt; and reldeownloaded. I wonder if they have taken the bad ones out of the pool,
&gt; or not.....I seem to remember that predicator had to do something like that
&gt; with bad charmm units.
&gt;
&gt; Drweaser

<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=sah&amp;userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a>
ID: 3153 · Report as offensive     Reply Quote
Profileold_user156
Avatar

Send message
Joined: 5 Aug 04
Posts: 186
Credit: 1,612,182
RAC: 0
Message 3154 - Posted: 6 Sep 2004, 15:34:58 UTC - in response to Message 3153.  

&gt; Now what the Hades is going on....
&gt;
&gt; check this out
&gt;
&gt; http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=141230
&gt;
&gt; now it can't figure out the host computer for this work unit.....something is
&gt; bonkers somewhere.....

Heh, a negative computer. :-)


<a href="http://www.nmvs.dsl.pipex.com/"><img src="http://boinc.mundayweb.com/cpdn/stats.php?userID=6&amp;team=off&amp;trans=off"></a>
ID: 3154 · Report as offensive     Reply Quote
old_user1
Avatar

Send message
Joined: 5 Aug 04
Posts: 907
Credit: 299,864
RAC: 0
Message 3169 - Posted: 6 Sep 2004, 18:21:51 UTC - in response to Message 3154.  

I think I've corrected a lot of the oddities such as the "negative" computer. The problem is there seems to be some "orphan" runs out there, i.e. a problem with the BOINC "feeder" (which has since been fixed) means that some runs were sent out with incorrect parent records, so there are a bunch of runs out there that will not trickle or update properly. If your machine is one, you're probably better off resetting and attaching so that you get a "proper" run.

ID: 3169 · Report as offensive     Reply Quote
old_user6988

Send message
Joined: 31 Aug 04
Posts: 9
Credit: 402,104
RAC: 0
Message 3176 - Posted: 6 Sep 2004, 19:25:59 UTC - in response to Message 3169.  

So out of my 30 eMacs, 5 are currently running a model. Questions 1. Should I reset those....they were the ones tha had the negative computer?

Question 2. Should I reset the remaining 25 machines or simply let them keep trying to get a usable WU?

At least now the trickles show up under my account, thanks for the fix!

drweaser

&gt; I think I've corrected a lot of the oddities such as the "negative" computer.
&gt; The problem is there seems to be some "orphan" runs out there, i.e. a problem
&gt; with the BOINC "feeder" (which has since been fixed) means that some runs were
&gt; sent out with incorrect parent records, so there are a bunch of runs out there
&gt; that will not trickle or update properly. If your machine is one, you're
&gt; probably better off resetting and attaching so that you get a "proper" run.
&gt;
&gt;
&gt;
<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=sah&amp;userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a>
ID: 3176 · Report as offensive     Reply Quote
old_user6988

Send message
Joined: 31 Aug 04
Posts: 9
Credit: 402,104
RAC: 0
Message 3180 - Posted: 6 Sep 2004, 19:43:40 UTC

Got a different error this time.....looks like some sort of memory problem.....maybe this will help somebody figure it out....

Starting model in /Applications/boinc/projects/climateprediction.net...
Archive: hadsm3se_4.03_powerpc-apple-darwin.zip
inflating: ./hadsm3se_4.03_powerpc-apple-darwin
inflating: ./libxlf90.A.dylib
...
inflating: 1sdl_100104286/jobs/climate.doub
inflating: 1sdl_100104286/jobs/ncatts.cpdc
inflating: 1sdl_100104286/jobs/spec3a_sw_3_asol2b_hadcm3


Could not create shared memory region 25780, 301228



2004-09-06 15:41:44 [climateprediction.net] Unrecoverable error for result 1sdl_100104286_1 (process exited with code 255 (0xff))
2004-09-06 15:41:44 [climateprediction.net] Unrecoverable error for result 1sdl_100104286_1 (process exited with code 255 (0xff))
2004-09-06 15:41:44 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds
2004-09-06 15:41:44 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds
2004-09-06 15:41:44 [climateprediction.net] Computation for result 1sdl_100104286 finished
<a href="http://www.boinc.dk/index.php?page=user_statistics&amp;project=sah&amp;userid=5800"><img border="0" height="80" src="http://5800.sah.sig.boinc.dk?188"></a>
ID: 3180 · Report as offensive     Reply Quote

Questions and Answers : Macintosh : Cannot get a model to run

©2024 cpdn.org