Questions and Answers :
Macintosh :
Disappearing run. Diagnostics?
Message board moderation
Author | Message |
---|---|
Send message Joined: 25 Aug 04 Posts: 5 Credit: 103,128 RAC: 0 |
<P>I was running 2 runs on <B>boinc_4.05_powerpc-apple-darwin</B> on a PowerMac dual 1.25 GHz G4, 1.25 GB RAM MacOS 10.3.5. Runs started around 2004-08-25 18:54:30 for <I>00c3_300025420_0 using hadsm3 version 4.03</I> and <I>00c4_300025421_0 using hadsm3 version 4.03</I>.</P> <P>Checking after coming home today 8/26, it appears that run 00c4 has disappeared. The thing is I don\'t know what diagnostics to look for to see <B>why</B> it disappeared? The log file for that model shows...</P> <PRE> 00c4_300025421 - PH 1 TS 007633 - 10/05/1811 00:30 - H:M:S=0011:53:10 AVG= 5.61 DLT= 2.69 00c4_300025421 - PH 1 TS 007634 - 10/05/1811 01:00 - H:M:S=0011:53:23 AVG= 5.61 DLT=13.27 00c4_300025421 - PH 1 TS 007635 - 10/05/1811 01:30 - H:M:S=0011:53:25 AVG= 5.61 DLT= 1.94 00c4_300025421 - PH 1 TS 007636 - 10/05/1811 02:00 - H:M:S=0011:53:27 AVG= 5.61 DLT= 1.95 </PRE> <P>Then nothing else, no messages or errors just that run stoped reporting. Kicking in viz on that run shows a blue planet. Checking my account on the website shows no status for the 00c4 run. So 2 questions: 1) How do I determine if this was just a \"normal\" failed model or something else (like a bug)? That is, how do I diagnose this? 2) How do I get boinc to report home to y\'all about 00c4 status or will it just do that on it\'s own in time and then download a new model? </P> <P>Thanks for your time.<br> BCNU,<br> Vance</P> |
Send message Joined: 5 Aug 04 Posts: 907 Credit: 299,864 RAC: 0 |
Hi, on the ./viz you can use a command line argument to attach to the specific model, i.e.: ./viz 00c4_300025421 if it's blue it could mean it crashed, probaby best to try a Ctrl+C, and then run ./boinc* again and see if they both pop up? also you could do a ps aux|grep hads and see if the stuff is running (for two cpu's there should be a hadsm3_ and hadsm3um_ twice) |
Send message Joined: 14 Aug 04 Posts: 37 Credit: 276,676 RAC: 0 |
Hi, In 'Finder' open up the folder you are using to run the project. You should see a folder 'projects' open this. Open folder 'climateprediction.net' and you will see a seperate folder for each of the runs. Your lost run should be there. Open it and look for the files stderr_um.txt. If the work unit (WU) failed there should be a message about it here. If there is you can post it if its not to big or submit it to CPDN. K. @carl: you got in just ahead of me (LoL) In os X 10.3.5 you can go to utilities/Activity monitor where you easily see all that is running, including of course, 443 hadsm3um_4.02_po chuggybus 93.50 1 45.80 MB 91.16 MB K. |
Send message Joined: 25 Aug 04 Posts: 5 Credit: 103,128 RAC: 0 |
<P>Thanks for your help! The <I>stderr_um.txt</I> and <I>stdout_um.txt</I> files for the <I>00c4</I> project were both zero length. The ps listing showed </P> <PRE> G4 /Applications/BOINC-CPDN/projects/climateprediction.net/00c4_300025421 $ ps aux | grep had strick 925 96.2 3.4 93344 45132 p1 RN Wed06PM 2220:52.21 hadsm3um_4.03_powerpc-apple-darwin 24090 912 strick 912 0.0 0.1 30216 1168 p1 SN Wed06PM 0:18.29 hadsm3_4.03_powerpc-apple-darwin 00c3_300025420 strick 913 0.0 0.1 30216 1164 p1 SN Wed06PM 0:11.18 hadsm3_4.03_powerpc-apple-darwin 00c4_300025421 strick 924 0.0 0.0 0 0 p1 ZN 31Dec69 0:00.00 (hadsm3um_4.03_po) strick 2434 0.0 0.0 18172 340 std S+ 8:37AM 0:00.01 grep had </PRE> <P>So it appears that the <I>00c4</I> run did indeed die off, probably the zombie process pid 924 above. Which makes me wonder why pid 913, which was probably the parent process, didn't catch the child process exit status? If it <I>wait()</I>'ed appropriately the child should have been cleaned up. Odd.</P> <P>Anyway, doing a CTRL-C shutdown everything cleanly and on restart the log showed, yes indeed, the <I>00c4</I> model had crashed. Then it uploaded the results to y'all and downloaded a new run.</P> <PRE> Starting model ID 00c4_300025421 Phase 1 Waiting for model startup, this may take a minute... Stack size=48.00 MB 00c4_300025421 - PH 1 TS 007633 - 00/00/0000 00:00 - H:M:S=0011:53:10 AVG= 5.61 DLT= 0.00 Model crashed...retrying...restart level 2 Preparing for restart... Rewinding a model-year... Error: Restart files for dataout/restart.year not found Giving up, this result exceeded crash count for available restart files. ... entries about zipping up files... 2004-08-27 08:47:40 [climateprediction.net] Unrecoverable error for result 00c4_300025421_0 (process exited with code 25 1 (0xfb)) 2004-08-27 08:47:40 [climateprediction.net] Unrecoverable error for result 00c4_300025421_0 (process exited with code 25 1 (0xfb)) 2004-08-27 08:47:40 [climateprediction.net] Computation for result 00c4_300025421 finished 2004-08-27 08:47:40 [climateprediction.net] Started upload of 00c4_300025421_0_1.zip ... </PRE> <P>So that's a wrap. Thank you very much for helping me diagnose this. Things are on-track and crunching away again.</P> BCNU,<BR> Vance |
Send message Joined: 5 Aug 04 Posts: 907 Credit: 299,864 RAC: 0 |
I looked up the error output from the upload server: LOOKUP TABLE 19328 64-bit words long Non constant polar row found in dump : field 1 Dump must be reconfigured Model run aborted IN U_MODEL1_WIN Starting hadsm3 model for ID# 24170... Changing to slots directory /Applications/BOINC-CPDN/slots/1 Model abandoned: UM has aborted the model Detaching shared memory, closing model... so it's definitely an odd crash, probably from a parameter for this run that caused the climate model to go unstable. The "monitor" program (hadsm3_) usually detects when the model (hadsm3um_) has crashed, but somehow the first time this wasn't detected. |
©2024 cpdn.org