Questions and Answers :
Windows :
Bug narrowed down : [climateprediction.net] Computation for result xxxx finished
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 39 Credit: 87,633 RAC: 0 |
After stopping BOINC for a moment and then restarting it, it destroyed 2 results : stdout.txt = 2004-11-03 20:40:28 [---] Starting BOINC client version 4.13 for windows_intelx86 2004-11-03 20:40:28 [climateprediction.net] Project prefs: using separate prefs for school 2004-11-03 20:40:28 [LHC@home] Project prefs: using separate prefs for school 2004-11-03 20:40:28 [climateprediction.net] Host ID is 46600 2004-11-03 20:40:28 [LHC@home] Host ID is 17130 2004-11-03 20:40:28 [---] General prefs: from LHC@home (last modified 2004-11-02 22:50:22) 2004-11-03 20:40:28 [---] General prefs: using separate prefs for school 2004-11-03 20:40:28 [climateprediction.net] Resuming computation for result 2z86_000160367_1 using hadsm3 version 4.03 2004-11-03 20:40:28 [climateprediction.net] Resuming computation for result 30jy_000162104_1 using hadsm3 version 4.03 2004-11-03 20:40:29 [LHC@home] Started upload of v64lhc1000protwelve-58s8_1053.42_1_sixvf_39751_0_0 2004-11-03 20:40:29 [LHC@home] Started upload of v64lhc1000protwelve-59s10_12553.46_1_sixvf_42652_1_0 2004-11-03 20:40:29 [climateprediction.net] Computation for result 2z86_000160367 finished 2004-11-03 20:40:29 [climateprediction.net] Starting result 3siy_000198715_0 using hadsm3 version 4.04 2004-11-03 20:40:29 [climateprediction.net] Computation for result 30jy_000162104 finished stderr.txt = 2004-11-03 20:40:29 [climateprediction.net] Unrecoverable error for result 2z86_000160367_1 ( - exit code -1 (0xffffffff)) 2004-11-03 20:40:29 [climateprediction.net] Unrecoverable error for result 30jy_000162104_1 ( - exit code -1 (0xffffffff)) 2004-11-03 20:40:29 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds It happened while LHC has been completely unreachable, not sure if the problem is related to that but the chance is quite high as it immediately tried to upload a bunch of LHC results which of course failed. After stopping BOINC again, I saw that hadsm3um_4.03_windows_intelx86.exe was still running although all other BOINC processes have been gone. This might be a reason for this failure too of course. BOINC 4.13 / Win2k SP4 / Dual Athlon MP 2600+ I have saved all XML and project files and will report the two damaged WUs now : <a>resultid=268718</a> <a>resultid=261770</a> If you need any of the files to help figure out the problem, I can upload them to some web space. 90 trickles lost in BOINC space :´( |
Send message Joined: 5 Aug 04 Posts: 39 Credit: 87,633 RAC: 0 |
I found something more, that looks suspicious in stderr_um.txt : ... OPEN: File dataout/2z86ca.dap3bj0 Created on Unit 22 OPEN: File dataout/2z86ca.dap3bm0 Created on Unit 22 OPEN: File dataout/2z86ca.dap3bp0 Created on Unit 22 OPEN: File dataout/2z86ca.dap3bs0 Created on Unit 22 OPEN: File dataout/2z86ca.dap3c10 Created on Unit 22 CLOSE: WARNING: Unit 60 Not Opened OPEN: File dataout/2z86ca.pap4c10 Created on Unit 60 CLOSE: WARNING: Unit 63 Not Opened OPEN: File dataout/2z86ca.pdp4c10 Created on Unit 63 CLOSE: WARNING: Unit 64 Not Opened OPEN: File dataout/2z86ca.pep4c10 Created on Unit 64 CLOSE: WARNING: Unit 65 Not Opened OPEN: File dataout/2z86ca.pfp4c10 Created on Unit 65 CLOSE: WARNING: Unit 66 Not Opened OPEN: File dataout/2z86ca.pgp4c10 Created on Unit 66 CLOSE: WARNING: Unit 67 Not Opened OPEN: File dataout/2z86ca.php4c10 Created on Unit 67 OPEN: File dataout/2z86ca.dap3c40 Created on Unit 22 OPEN: File dataout/2z86ca.dap3c70 Created on Unit 22 OPEN: File dataout/2z86ca.dap3ca0 Created on Unit 22 ... The other model that has been destroyed did not have any error in this file. |
Send message Joined: 5 Aug 04 Posts: 39 Credit: 87,633 RAC: 0 |
It (nearly) happened again - this time with the CLI : I shut down the BOINC CLI but the CPDN client kept running. But this time I saw that it was still there so I killed CPDN from the task manager - it did not destroy the model this time, it restarted properly. So now I'm quite sure, the problem is that under certain circumstances the project client doesn't end but the BOINC client doesn't retry to kill it. The machine was under heavy load when it happened (one CPU was doing Seti Classic, the other was supposed to do some 3D rendering stuff). I reproduced it 3 times now under the same load - it is definitely a bug. (I posted a link to this thread to the BOINC forum.) |
Send message Joined: 5 Aug 04 Posts: 39 Credit: 87,633 RAC: 0 |
One more thought about the problem : If hadsm3<b>se</b>...exe calls <i>boinc_init_options()</i> with <i>opt.main_program=true</i> instead of hadsm3<b>um</b>...exe, this would explain why slots/?/boinc_lockfile didn't do the job to avoid a second client working on the same model. |
©2024 cpdn.org