Message boards : Number crunching : Error while computing???
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,007,330 RAC: 21,449 |
It has been noted that they have a high failure rate. My looking at the work units suggests failure rate is higher on AMD than Intel but there seem to be some running well past an hour on both. Will wait and see if Sihan comes up with any reason for the high failure rate. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,750,791 RAC: 3,898 |
Twelve workunits without any problems so far on an Intel-cpu. |
Send message Joined: 29 Jun 12 Posts: 31 Credit: 1,438,478 RAC: 0 |
My looking at the work units suggests failure rate is higher on AMD than Intel That's one thing we could have done without, an unscientific opinion like that is going to get all the fanbois all sturred up and start flame wars. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,007,330 RAC: 21,449 |
fanbois? Surely the ones that would get upset over this are all on super quiet fanless liquid cooling? (I have used both so can get upset on either account :) ) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
an unscientific opinion I said that first about AMDs, and it's NOT an opinion, it's an observation, based on the 2 failures that I received a few hours ago. Since then, more people have posted, and there's lots of Intels as well now. Whatever the reason, a failure is a failure, and will gather no credit. |
Send message Joined: 29 Jun 12 Posts: 31 Credit: 1,438,478 RAC: 0 |
Your statement doesn't make any sense at all to me, you said what first? You had 2 failures on your computers? I'm sorry I don't understand. To me, these failures don't seem hardware related at all and I'll try to do better with using the correct terminology next time. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Just about all the wah2 tasks issued between 1-4 Aug failed - about 100 of them. Ones since then seem to be OK. Computer 1415561 (Ignore the abandoned ones - that's another thing.) |
Send message Joined: 23 Aug 06 Posts: 6 Credit: 5,365,473 RAC: 0 |
Several failures within minutes after downloading. All wah2_safr50 tasks: Task 21397321 Task 21390625 Task 21395215 And I get a Fortran error appearing as a warning box/window outside BOINC. Below example is for Task 21393668 Intel(r) Visual Fortran run-time error forrtl: severe (19): invalid reference to variable in NAMELIST input, unit 4, file C:\ProgramData\BOINC\projects\climateprediction.net\wah2_safrSO_b 2nw_199912_16_774_011679367\jobs\xadae.stashc, line 60, position 13 Image PC Routine Line Source wah2am3m2_um_8.24 016B32AA Unknown Unknown Unknown wah2am3m2_um_8.24 0165FC90 Unknown Unknown Unknown wah2am3m2_um_8.24 0165EESA Unknown Unknown Unknown wah2am3m2_um_8.24 01641DEA Unknown Unknown Unknown wah2am3m2_um_8.24 014F999F Unknown Unknown Unknown wah2am3m2_um_8.24 0155F43F Unknown Unknown Unknown wah2am3m2_um_8.24 016112E8 Unknown Unknown Unknown wah2am3m2_um_8.24 013497F4 Unknown Unknown Unknown wah2am3m2_um_8.24 016989FF Unknown Unknown Unknown KERNEL32.DLL 745462C4 Unknown Unknown Unknown ntdll.dll 77971 F69 Unknown Unknown Unknown ntdll.dll 77971 F34 Unknown Unknown Unknown Event log example for tasks 21397321, 21390625 and 21395215. 27/11/2018 11:46:10 | climateprediction.net | Finished download of atmos_restart_batch_741_safr50_a0sl_2004-12-01.gz 27/11/2018 11:46:10 | climateprediction.net | Started download of ic19611201_16_N96.gz 27/11/2018 11:46:11 | climateprediction.net | Finished download of ic19611201_16_N96.gz 27/11/2018 11:46:11 | climateprediction.net | Started download of final_ancil_2year_OSTIA_sst_2004-12-01_2006-12-30.gz 27/11/2018 11:46:13 | climateprediction.net | Finished download of region_restart_batch_741_safr50_a0sl_2004-12-01.gz 27/11/2018 11:46:13 | climateprediction.net | Started download of final_ancil_2year_OSTIA_ice_2004-12-01_2006-12-30.gz 27/11/2018 11:46:14 | climateprediction.net | Finished download of final_ancil_2year_OSTIA_ice_2004-12-01_2006-12-30.gz 27/11/2018 11:46:14 | climateprediction.net | Started download of so2dms_rcp45_N96_1999_2010.gz 27/11/2018 11:46:14 | climateprediction.net | Computation for task wah2_safr50_b0bx_198712_16_774_011676344_0 finished 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_1.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_2.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_3.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_4.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_5.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_6.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_7.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_8.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_9.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_10.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_11.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_12.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_13.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_14.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_15.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_16.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:14 | climateprediction.net | Output file wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_restart.zip for task wah2_safr50_b0bx_198712_16_774_011676344_0 absent 27/11/2018 11:46:15 | climateprediction.net | Finished download of final_ancil_2year_OSTIA_sst_2004-12-01_2006-12-30.gz 27/11/2018 11:46:15 | climateprediction.net | Started download of ozone_rcp45_N96_1999_2010v2.gz 27/11/2018 11:46:16 | climateprediction.net | Finished download of ozone_rcp45_N96_1999_2010v2.gz 27/11/2018 11:46:16 | climateprediction.net | Started upload of wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_out.zip 27/11/2018 11:46:16 | climateprediction.net | Computation for task wah2_safr50_b5gj_201312_16_774_011682990_0 finished 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_1.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_2.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_3.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_4.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_5.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_6.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_7.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_8.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_9.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_10.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_11.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_12.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_13.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_14.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_15.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_16.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:16 | climateprediction.net | Output file wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_restart.zip for task wah2_safr50_b5gj_201312_16_774_011682990_0 absent 27/11/2018 11:46:17 | climateprediction.net | Finished upload of wah2_safr50_b0bx_198712_16_774_011676344_0_r1505665212_out.zip 27/11/2018 11:46:18 | climateprediction.net | Finished download of so2dms_rcp45_N96_1999_2010.gz 27/11/2018 11:46:18 | climateprediction.net | Started upload of wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_out.zip 27/11/2018 11:46:19 | climateprediction.net | Finished upload of wah2_safr50_b5gj_201312_16_774_011682990_0_r162766967_out.zip 27/11/2018 11:46:20 | climateprediction.net | Starting task wah2_safr50_b3ul_200412_16_774_011680904_0 27/11/2018 11:48:25 | climateprediction.net | Computation for task wah2_safr50_b3ul_200412_16_774_011680904_0 finished 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_1.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_2.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_3.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_4.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_5.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_6.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_7.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_8.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_9.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_10.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_11.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_12.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_13.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_14.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_15.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_16.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:25 | climateprediction.net | Output file wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_restart.zip for task wah2_safr50_b3ul_200412_16_774_011680904_0 absent 27/11/2018 11:48:27 | climateprediction.net | Started upload of wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_out.zip 27/11/2018 11:48:29 | climateprediction.net | Finished upload of wah2_safr50_b3ul_200412_16_774_011680904_0_r381998839_out.zip |
Send message Joined: 22 Aug 06 Posts: 1 Credit: 832,463 RAC: 0 |
I started getting those this morning when I started getting work units. Going to try the settings mentioned above. I changed the settings as recommended above. This cut the number of active tasks in half and the error message went away. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
If they are batch 744 everyone is getting the run time errors. The whole batch appears to be bad. See “new Work” thread. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,976,726 RAC: 14,201 |
Getting a compute error on batch 771 model with the following error: <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -1073740791 (0xc0000409)</message> <stderr_txt> BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Processing restart Year 1910 Month 12 Day 1 Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Had previously failed on another machine with same error code after trickle at timestep 259272. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I just got a work unit: https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11667216 It has used 11 hours 27 minutes of CPU time so far since it started. 334 hours predicted for it to finish. You can see that two others attempted this work unit, used up a lot of machine time, and then failed. But notice that the reported run time is fantastically longer than the CPU time. I do not know what run time measures; wall-clock time perhaps? Those work units that I do get (very very few since I run Linux) seem to all be of this type: failures. And they usually complete just fine on my machine, an Dell T7600 with a 4-core 64-bit Xeon processor, 8 GBytes RAM, running Red Hat Enterprise Linux Server release 6.10 (Santiago). It makes me wonder of the other users run on unreliable hardware, or unreliable software. My machines usually run 24/7 except when I reboot after installing a new kernel. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
It makes me wonder of the other users run on unreliable hardware, or unreliable software. My machines usually run 24/7 except when I reboot after installing a new kernel. I think a lot of people use laptops, and they are constantly shutting them down or allowing them to go into sleep mode. That kills the CPDN work units after too many times. The project should really ban machines that fail too often. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
You can see that two others attempted this work unit, used up a lot of machine time, and then failed. That model is what the researchers are looking for: one whose starting parameters eventually lead to an unrealistic physics. So now they know. Which is what the error message: ATM_DYN : INVALID THETA DETECTED. means. I think that the first part is an abbreviation of: Atmospheric Dynamics. So you'll probably also get that error, which will be "proof of the pudding". |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
That model is what the researchers are looking for: one whose starting parameters eventually lead to an unrealistic physics. In this case, you are probably right. On the other hand, many of the work units I have gotten in the last year or so (not a lot of them) also failed for one or two other users, and completed successfully for me. Some of them died due to missing libraries (usually died very fast). Others died later because of missing trickle files (I think). In the case of this work unit, do they really need it to fail on three different machines? I will let it run, but ... |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Years ago, a test was run with the same starting values on different computers. It showed that there were slight differences between the results, enough to make some of the tests appear to have different starting values. And this IS research. Perhaps your computer is just slightly different in a way that will mean that it WON'T fail. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
In the case of this work unit, do they really need it to fail on three different machines? I will let it run, but ... I know that I have successfully completed a number of _2 WU’s that had failed on 2 other machines. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
this IS research. Perhaps your computer is just slightly different in a way that will mean that it WON'T fail. Well, it has just sent a trickle. And unless this has been fixed (not likely), it will be the only trickle. Work unit has used 31 hours 43 minutes so far and predicts 312 hours 19 hours to go. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
predicts 312 hours 19 hours to go. OOPS! predicts 312 hours 19 minutes to go. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
What I find interesting about this work unit hadcm3s_st249_190012_120_771_011667216 is the large amount of Run Time required (149,672.31, 138,538.43 seconds) to get 30 to 60 seconds of CPU time. This is on two different machines with different CPUs, both running 64-bit Windows 10. What are they spending that time on without using a CPU? On my machine with this same work unit, I already have over 56 hours of CPU time, have uploaded a trickle, and still running with 283 hours predicted to go. |
©2024 cpdn.org