climateprediction.net (CPDN) home page
Thread 'Batch 742'

Thread 'Batch 742'

Message boards : Number crunching : Batch 742
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 58879 - Posted: 20 Oct 2018, 16:37:04 UTC

Is there a problem with batch 742? They are wah2_sam25’s. In the last 3 days I have tried to run three of them and each has failed in less than 10 minutes.

The stderr on the latest is:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
Signal 11 received: Segment violation
Signal 11 received: Software termination signal from kill
Signal 11 received: Abnormal termination triggered by abort call
Signal 11 received, exiting...
10:55:27 (5196): called boinc_finish(193)
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=5872, iMonCtr=2
Model crash detected, will try to restart...
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=8428, iMonCtr=2
Leaving CPDN_ain::Monitor...
10:55:31 (5872): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_1.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_2.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_3.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_4.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_5.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_6.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_7.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_8.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_9.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_10.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_11.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_12.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_13.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sam25_s8pt_200512_13_742_011582363_2_r1898366273_restart.zip</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>
No trickles!

Are others having this problem?
ID: 58879 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 58880 - Posted: 20 Oct 2018, 17:39:48 UTC - in response to Message 58879.  

Yes, a big problem. Mine all fail after about three minutes or, are aborted upon receipt (especially if they failed twice already).

Your choice.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 58880 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58882 - Posted: 20 Oct 2018, 19:35:04 UTC

I have 2 problems with that batch:

The zips are soooooo big
I keep getting them

So far, I've only had one failure. That was very close to the end, and was a FORTRAN error. (I think that it may be the only one of those that I've had.)

I get them at all stages, including some that are on their first run.
At present I have 5 of them.
When will they stop. :(
ID: 58882 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 58884 - Posted: 20 Oct 2018, 22:05:11 UTC

I've finished 12 on my Ryzen and one on an i7 without any failures.

I've noted many, many failures on other PCs with seg faults. I'm running mine without hyperthreading turned on. I wonder if that has anything to do with it. They are big models that hit memory and cache harder than most of the other regions. Then again, I just may have been fortunate.
ID: 58884 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,620,508
RAC: 4,981
Message 58885 - Posted: 21 Oct 2018, 12:02:22 UTC - in response to Message 58884.  

I've had 5 of these, 4 finished successfully and the 5th is at 56%. I run them on Win7, boinc 7.8.3 (x64), i5-2520M, with hyper-threading on and it takes around 18 days.
ID: 58885 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 58886 - Posted: 21 Oct 2018, 13:40:33 UTC

Thanks guys. It’s nice to know its not just me. I just upgraded to Win10 from 7 after replacing a bad HHD. Two of the failed WU’s were _2’s so they had failed twice before. The other was a _0. They may be testing extreme conditions.
ID: 58886 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 58887 - Posted: 21 Oct 2018, 16:02:01 UTC - in response to Message 58884.  
Last modified: 21 Oct 2018, 16:03:41 UTC

I've finished 12 on my Ryzen and one on an i7 without any failures.

I've noted many, many failures on other PCs with seg faults. I'm running mine without hyperthreading turned on. I wonder if that has anything to do with it. They are big models that hit memory and cache harder than most of the other regions. Then again, I just may have been fortunate.

I think there is something to it, at least for the Ryzens. I have seen a number of cases where I can complete a work unit on my Intel chips (i7-3770, i7-4790) with hyper-threading enabled that fails on the AMD chips.
https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11631965

I have a Ryzen 1700 on Ubuntu 18.04, and have found that when running Rosetta (for example), I get basically no errors with SMT off in the BIOS, but about one in ten errors with it enabled.
Maybe the compilers are more optimized for Intel?
ID: 58887 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58890 - Posted: 21 Oct 2018, 20:16:51 UTC

For the models that are failing near the start, it's possible that those computers are running too many climate models at the same time.

I think the early failures were reported to be near the change from global to regional models.
The sam's have a very big regional model, so the circumference is very big, and will have a lot of data points.
Which means a lot of storage area, and if there are a more models than the processor can allocate space for, (stack space?), then some will fail because they try to write data into prohibited areas.

(On the supercomputers the models were written for, one computer, one model.
And the translation to desktops would presumably be for one model per processor core.)

Perhaps someone with a high failure rate could try running with no more than one model per core, and see how they go.
ID: 58890 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,803,756
RAC: 5,187
Message 58891 - Posted: 21 Oct 2018, 20:48:40 UTC

Because I benchmark the first two models I receive from each batch I rarely run more than one model on a PC at a time and have had numerous Batch #742 failures in that single-model mode - and a few successes.

On two machines: 2/2 successes (Windows 7) and 1/20 successes (Windows 10).

(And, as Les says, 13 x 95 MB + 1 x 115 MB is big.)
ID: 58891 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 58894 - Posted: 22 Oct 2018, 1:57:01 UTC - in response to Message 58887.  

I've finished 12 on my Ryzen and one on an i7 without any failures.

I've noted many, many failures on other PCs with seg faults. I'm running mine without hyperthreading turned on. I wonder if that has anything to do with it. They are big models that hit memory and cache harder than most of the other regions. Then again, I just may have been fortunate.

I think there is something to it, at least for the Ryzens. I have seen a number of cases where I can complete a work unit on my Intel chips (i7-3770, i7-4790) with hyper-threading enabled that fails on the AMD chips.
https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11631965

I have a Ryzen 1700 on Ubuntu 18.04, and have found that when running Rosetta (for example), I get basically no errors with SMT off in the BIOS, but about one in ten errors with it enabled.
Maybe the compilers are more optimized for Intel?

@Jim1348

Do you have one of the early batch Ryzens? Problems with full loads and Linux according to many websites. Peruse this thread if you want more. AMD was replacing these early batch Ryzens for Linux users if you followed the right procedure.

https://community.amd.com/thread/215773?start=1815&tstart=0
ID: 58894 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 58895 - Posted: 22 Oct 2018, 6:10:58 UTC - in response to Message 58894.  

Do you have one of the early batch Ryzens? Problems with full loads and Linux according to many websites.

No, mine is week 33; I waited until they fixed it.
(And I just ordered my Ryzen 2700 two days ago, waiting six months since the release date also, even though that is only a die shrink.)
ID: 58895 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,002,360
RAC: 21,497
Message 58902 - Posted: 23 Oct 2018, 7:06:27 UTC
Last modified: 23 Oct 2018, 7:18:17 UTC

The zips are soooooo big
I keep getting them

And batch 755 has zips of 152MB. I go away, make a coffee including grinding the beans, drink it return and they still haven't finished uploading.

Edit:Second caffetiere and still not finished. Maybe I need to take up growing my own beans.
ID: 58902 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58903 - Posted: 23 Oct 2018, 7:28:33 UTC

Ahh. I only think that I've got a problem.

I was hoping to get some of those Atlantics to see what they were like. Just as well that I didn't wish too hard :)

And we ran out of work a few hours ago.
ID: 58903 · Report as offensive     Reply Quote

Message boards : Number crunching : Batch 742

©2024 cpdn.org