Message boards : Number crunching : Batch 742
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Is there a problem with batch 742? They are wah2_sam25’s. In the last 3 days I have tried to run three of them and each has failed in less than 10 minutes. The stderr on the latest is: <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> Signal 11 received: Segment violation Signal 11 received: Software termination signal from kill Signal 11 received: Abnormal termination triggered by abort call Signal 11 received, exiting... 10:55:27 (5196): called boinc_finish(193) Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=5872, iMonCtr=2 Model crash detected, will try to restart... Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=8428, iMonCtr=2 Leaving CPDN_ain::Monitor... 10:55:31 (5872): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_1.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_2.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_3.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_4.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_5.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_6.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_7.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_8.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_9.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_10.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_11.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_12.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_13.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_restart.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> ]]> No trickles! Are others having this problem? |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Yes, a big problem. Mine all fail after about three minutes or, are aborted upon receipt (especially if they failed twice already). Your choice. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I have 2 problems with that batch: The zips are soooooo big I keep getting them So far, I've only had one failure. That was very close to the end, and was a FORTRAN error. (I think that it may be the only one of those that I've had.) I get them at all stages, including some that are on their first run. At present I have 5 of them. When will they stop. :( |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I've finished 12 on my Ryzen and one on an i7 without any failures. I've noted many, many failures on other PCs with seg faults. I'm running mine without hyperthreading turned on. I wonder if that has anything to do with it. They are big models that hit memory and cache harder than most of the other regions. Then again, I just may have been fortunate. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,620,508 RAC: 4,981 |
I've had 5 of these, 4 finished successfully and the 5th is at 56%. I run them on Win7, boinc 7.8.3 (x64), i5-2520M, with hyper-threading on and it takes around 18 days. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Thanks guys. It’s nice to know its not just me. I just upgraded to Win10 from 7 after replacing a bad HHD. Two of the failed WU’s were _2’s so they had failed twice before. The other was a _0. They may be testing extreme conditions. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
I've finished 12 on my Ryzen and one on an i7 without any failures. I think there is something to it, at least for the Ryzens. I have seen a number of cases where I can complete a work unit on my Intel chips (i7-3770, i7-4790) with hyper-threading enabled that fails on the AMD chips. https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11631965 I have a Ryzen 1700 on Ubuntu 18.04, and have found that when running Rosetta (for example), I get basically no errors with SMT off in the BIOS, but about one in ten errors with it enabled. Maybe the compilers are more optimized for Intel? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
For the models that are failing near the start, it's possible that those computers are running too many climate models at the same time. I think the early failures were reported to be near the change from global to regional models. The sam's have a very big regional model, so the circumference is very big, and will have a lot of data points. Which means a lot of storage area, and if there are a more models than the processor can allocate space for, (stack space?), then some will fail because they try to write data into prohibited areas. (On the supercomputers the models were written for, one computer, one model. And the translation to desktops would presumably be for one model per processor core.) Perhaps someone with a high failure rate could try running with no more than one model per core, and see how they go. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,824,485 RAC: 4,956 |
Because I benchmark the first two models I receive from each batch I rarely run more than one model on a PC at a time and have had numerous Batch #742 failures in that single-model mode - and a few successes. On two machines: 2/2 successes (Windows 7) and 1/20 successes (Windows 10). (And, as Les says, 13 x 95 MB + 1 x 115 MB is big.) |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I've finished 12 on my Ryzen and one on an i7 without any failures. @Jim1348 Do you have one of the early batch Ryzens? Problems with full loads and Linux according to many websites. Peruse this thread if you want more. AMD was replacing these early batch Ryzens for Linux users if you followed the right procedure. https://community.amd.com/thread/215773?start=1815&tstart=0 |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Do you have one of the early batch Ryzens? Problems with full loads and Linux according to many websites. No, mine is week 33; I waited until they fixed it. (And I just ordered my Ryzen 2700 two days ago, waiting six months since the release date also, even though that is only a die shrink.) |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
The zips are soooooo big I keep getting them And batch 755 has zips of 152MB. I go away, make a coffee including grinding the beans, drink it return and they still haven't finished uploading. Edit:Second caffetiere and still not finished. Maybe I need to take up growing my own beans. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Ahh. I only think that I've got a problem. I was hoping to get some of those Atlantics to see what they were like. Just as well that I didn't wish too hard :) And we ran out of work a few hours ago. |
©2024 cpdn.org