climateprediction.net (CPDN) home page
Thread 'EAS batches 1001-4'

Thread 'EAS batches 1001-4'

Message boards : Number crunching : EAS batches 1001-4
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 70144 - Posted: 17 Jan 2024, 21:56:13 UTC

Thread for the discussion of the above batches.
ID: 70144 · Report as offensive     Reply Quote
[AF] Kalianthys

Send message
Joined: 20 Dec 20
Posts: 13
Credit: 40,052,490
RAC: 9,149
Message 70160 - Posted: 20 Jan 2024, 8:19:15 UTC

Hello.

A have this error with task wah2_eas25_n019_200912_24_1002_012236059_0

link : https://www.cpdn.org/result.php?resultid=22357060

<file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_10.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_11.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_12.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_13.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_14.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_15.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_16.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_17.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_18.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_19.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_20.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_21.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_22.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_23.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_24.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_restart.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>



Kali
ID: 70160 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70161 - Posted: 20 Jan 2024, 9:35:52 UTC - in response to Message 70160.  

Hi,
That's not the real error. The real error is at the top of the log:

<core_client_version>7.16.20</core_client_version>
<![CDATA[
<stderr_txt>
CPDN Monitor - Quit request from BOINC...
Signal 11 received: Segment violation
Signal 11 received: Software termination signal from kill 
Signal 11 received: Abnormal termination triggered by abort call
Signal 11 received, exiting...

The following message about stat() just mean the output files haven't been found because the model has crashed.

Cheers, Glenn

Hello.

A have this error with task wah2_eas25_n019_200912_24_1002_012236059_0

link : https://www.cpdn.org/result.php?resultid=22357060

<file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_10.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>

Kali

---
CPDN Visiting Scientist
ID: 70161 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 70162 - Posted: 20 Jan 2024, 9:40:34 UTC

The signal 11 error is a known issue with these tasks. They are particularly prone to it if interrupted so best not to run them on machines that will be restarted. Some will still fail with this error even under perfect conditions however.

I am hoping the work Glen is doing on the code will render this an issue of the past in a few batches time but, I have no idea of how much code will require rewriting and Fortran is not a language I have ever used.
ID: 70162 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,855
RAC: 4,073
Message 70163 - Posted: 20 Jan 2024, 12:52:07 UTC - in response to Message 70162.  

Signal 11 is a grade one pain in the cushion polisher :-(

As for FORTRAN - well let's just say "it's different".
ID: 70163 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 70164 - Posted: 20 Jan 2024, 14:25:04 UTC - in response to Message 70163.  

Signal 11 is a grade one pain in the cushion polisher :-(

As for FORTRAN - well let's just say "it's different".
I am old enough to remember it being the dominant language for scientific work. However I was taught ALGOL at that time and never did enough computing to get past that.
ID: 70164 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,855
RAC: 4,073
Message 70165 - Posted: 20 Jan 2024, 14:49:25 UTC - in response to Message 70164.  

...and thus missed out (to a greater or lesser degree) on the frustration of finding that a statement was going to be one character too long, and thus one would have to work out where to split it for the extension to work correctly or try and find out how to rearrange it so it would fit onto one line.
ID: 70165 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,855
RAC: 4,073
Message 70166 - Posted: 20 Jan 2024, 14:56:01 UTC

OK, let's get back nearer the topic of this thread.
The dreaded signal 11 problem bit me earlier today when I suffered a short lived power outage, and so lost several tasks that had been running quite happily for a few days. I really hope Glenn's efforts pay off soon.
ID: 70166 · Report as offensive     Reply Quote
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 70167 - Posted: 20 Jan 2024, 16:49:40 UTC - in response to Message 70162.  

The signal 11 error is a known issue with these tasks. They are particularly prone to it if interrupted so best not to run them on machines that will be restarted. Some will still fail with this error even under perfect conditions however.

I have only one PC. It is restarted each day.

Would you prefer that I not run these tasks?
ID: 70167 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 70168 - Posted: 20 Jan 2024, 17:05:32 UTC - in response to Message 70167.  

Would you prefer that I not run these tasks?
That is above my pay(0) grade. You are likely to experience a high failure rate with them doing that even if you do suspend the tasks and wait long enough to ensure all disk writes are finished before closing down BOINC. Doing that does reduce the failure rate. If you are able to suspend to RAM or to disk using the sleep/hibernate options you should do much better.
ID: 70168 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,738,273
RAC: 62,820
Message 70169 - Posted: 20 Jan 2024, 20:48:13 UTC - in response to Message 70144.  
Last modified: 20 Jan 2024, 20:52:53 UTC

Not reporting errors. I just have some interesting data to share. I recently got 7950X3D and was curious to see if the additional L3 makes any difference. Turns out it does make a major difference.

Here is the experiment setup. I run 16 tasks on my 16T/32C CPU, while assigning each task to one SMT pair. For example, logical core 0/1 gets a task, 2/3 gets one, all the way to 30/31. During the period, I also have two Einstein@home WU occupying two threads too. I have a Linux script that queries `boinccmd --get_tasks` remotely, parses its output, records WU name, elapsed time and fraction done for all CPDN tasks periodically. Then it calculates the difference for each WU and overall progress across multiple records. The table below shows a 6-hour period.

There are clearly two bands. Ignoring the best core in each CCD that got screwed by the E@H tasks, it's about 154 vs 184 WU/hour, almost 20% difference. I've manually verified that the faster ones are all on CCD0 with X3D cache, and the slower ones are on CCD1 without additional cache. This is despite the fact that CCD1 generally runs ~100-200MHz faster from what I see in HWMonitor.

Caveats.
* With different WU on each CPU, it's possible the variation comes from WU instead, but at least the name pattern doesn't suggest such an obvious correlation. It would also be quite a coincidence if all shorter WUs happen to get assigned to one CCD. Would be good to confirm if it's same work for each WU. I also have data before I did the affinity binding while tasks can switch around, and all 16 tasks hover around ~170 hour/ WU, further proving it's the affinity making the difference.
* I know nothing about Windows programming, so all these affinity are done manually and checked manually.
* Ideally I should stop the E@H task, but I'm just curious not doing rigid science. Good enough for me. ¯\_(ツ)_/¯

name                                          fraction    elapsed    pct / hour    hour / WU
------------------------------------------  ----------  ---------  ------------  -----------
wah2_eas25_g38s_201712_24_1003_012246266_0    0.032948    21600.4      0.549124      182.108
wah2_eas25_g3a1_201712_24_1003_012246311_0    0.03912     21600.4      0.651989      153.377
wah2_eas25_g4ls_202012_24_1003_012248030_1    0.039059    21600.4      0.650972      153.616
wah2_eas25_a296_201412_24_1004_012251032_0    0.038575    21600.4      0.642905      155.544
wah2_eas25_h03o_200912_24_1001_012230098_1    0.032759    21600.4      0.545974      183.159
wah2_eas25_n27f_201412_24_1002_012238873_1    0.032627    21600.4      0.543774      183.9
wah2_eas25_a31q_201612_24_1004_012252060_1    0.032199    21600.4      0.536641      186.344
wah2_eas25_n48f_201912_24_1002_012241501_1    0.038752    21600.4      0.645855      154.833
wah2_eas25_n2rm_201612_24_1002_012239600_1    0.039015    21600.4      0.650239      153.79
wah2_eas25_n06f_200912_24_1002_012236245_1    0.031947    21600.4      0.532441      187.814
wah2_eas25_g2n8_201512_24_1003_012245490_0    0.030369    21600.4      0.506141      197.573
wah2_eas25_h3mc_201812_24_1001_012234658_0    0.029541    21600.4      0.492341      203.111
wah2_eas25_g17p_201212_24_1003_012243635_0    0.032622    21600.4      0.543691      183.928
wah2_eas25_h02p_200912_24_1001_012230063_1    0.033273    21600.4      0.55454       180.33
wah2_eas25_g3ap_201712_24_1003_012246335_1    0.039008    21600.4      0.650122      153.817
wah2_eas25_g4ad_202012_24_1003_012247619_0    0.039865    21600.4      0.664405      150.511
ID: 70169 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 70170 - Posted: 20 Jan 2024, 21:46:31 UTC - in response to Message 70169.  

Yes level 3 cache makes a difference. Also, my experience is maximum throughput of work is running on N-1 real cores where N is the number of real cores present. Also within a batch, time taken on my machine is always within 1 or 2 percent so I think you can trust your results.

I don't run Windows natively so current tasks are either in a VM or using WINE. If I run them using a VM I get a 20% drop in performance compared to using WINE.
ID: 70170 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 41,738,273
RAC: 62,820
Message 70171 - Posted: 20 Jan 2024, 22:58:42 UTC - in response to Message 70170.  

Also, my experience is maximum throughput of work is running on N-1 real cores where N is the number of real cores present.

I've also been playing with number of tasks too, but in my case, increasing from 16 to 20 can eke out another 4-8% total performance actually. However, it's very clear the system is starving from memory bandwidth at 20 tasks (dual channel DDR5-6000). The desktop becomes sluggish and the E@H tasks suffer a 50% regression. Such performance problem doesn't happen with other projects even when I fill all 32 threads. I no longer have my old 5950X + DDR4 3200 setup, and back then 12 wah tasks are enough to make performance suffer though I didn't measure the task progress like this time.

I feel the optimal tasks could be very system dependent. Generally, SMT is rather useless for CPDN workloads. The memory subsystem (cache + DDR) has a far greater impact than many other BOINC projects. (I know these are well-known. Just getting and seeing the data myself is interesting. :-P )
ID: 70171 · Report as offensive     Reply Quote
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 70173 - Posted: 21 Jan 2024, 8:52:28 UTC - in response to Message 70168.  

You are likely to experience a high failure rate with them doing that even if you do suspend the tasks and wait long enough to ensure all disk writes are finished before closing down BOINC. Doing that does reduce the failure rate.

That seems a bit strange, at least to me. 🙂 I would have thought that as long as the last checkpoint was successful, that would have saved everything and that is where the task would start from next time??

Thanks!
ID: 70173 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 70174 - Posted: 21 Jan 2024, 9:25:13 UTC - in response to Message 70173.  

I would have thought that as long as the last checkpoint was successful, that would have saved everything and that is where the task would start from next time??


I hope that Glen's code wrestling will resolve this. It has been an issue with the project since the days of tasks that would take months to complete with the slow processors of the time. Of course the met office code was written for mainframe computers that always ran 24/7 except when there was a problem that required shutting down.
ID: 70174 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70175 - Posted: 21 Jan 2024, 19:06:42 UTC - in response to Message 70164.  

However I was taught ALGOL


I started with assembler for the 704 IBM computer (5000 vacuum tubes. IIRC) and then used the original FORTRAN for it for some things.
When Illinois-ALCOR Algol 60 came out, I really liked it for mathematical work, and SNOBOL4 and SPITBOL for text type problems. I even wrote a compiler for a special-purpose language in SPITBOL.

I seldom write program these days. I am most at home with C and C++ currently.
ID: 70175 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 70176 - Posted: 21 Jan 2024, 20:23:53 UTC - in response to Message 70173.  

If any process is quit, it will not completely die until it's closed open files. I think what Dave might be referring to is flushing any I/O buffers held in memory to a hardware drive. Although code can 'tell' the OS it wants that done, the final decision is still made by the OS. Both the CPDN task & Boinc has a wait-time built into the code to allow any buffers to be flushed but again it's can't be forced.

Just to be clear, this segv problem with these tasks is nothing to do with the files produced by the model - so don't waste time waiting for the OS to do its thing. It's a memory issue related to the model starting up, not reading the files.

You are likely to experience a high failure rate with them doing that even if you do suspend the tasks and wait long enough to ensure all disk writes are finished before closing down BOINC. Doing that does reduce the failure rate.

That seems a bit strange, at least to me. 🙂 I would have thought that as long as the last checkpoint was successful, that would have saved everything and that is where the task would start from next time??

Thanks!

---
CPDN Visiting Scientist
ID: 70176 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 70177 - Posted: 22 Jan 2024, 0:25:35 UTC - in response to Message 70176.  

If any process is quit, it will not completely die until it's closed open files. I think what Dave might be referring to is flushing any I/O buffers held in memory to a hardware drive. Although code can 'tell' the OS it wants that done, the final decision is still made by the OS. Both the CPDN task & Boinc has a wait-time built into the code to allow any buffers to be flushed but again it's can't be forced.


In "modern" versions of Linux, and perhaps other versions of UNIX, you can greatly increase the chances that IO buffers are actually written to disk (at least to the input buffer of the drive itself) by calling the fsync() command.

https://www.man7.org/linux/man-pages/man2/fsync.2.html

Here is part of the description.

DESCRIPTION         top

       fsync() transfers ("flushes") all modified in-core data of (i.e.,
       modified buffer cache pages for) the file referred to by the file
       descriptor fd to the disk device (or other permanent storage
       device) so that all changed information can be retrieved even if
       the system crashes or is rebooted.  This includes writing through
       or flushing a disk cache if present.  The call blocks until the
       device reports that the transfer has completed.

       As well as flushing the file data, fsync() also flushes the
       metadata information associated with the file (see inode(7)).

       Calling fsync() does not necessarily ensure that the entry in the
       directory containing the file has also reached disk.  For that an
       explicit fsync() on a file descriptor for the directory is also
       needed.

ID: 70177 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,014,785
RAC: 20,946
Message 70178 - Posted: 22 Jan 2024, 7:42:00 UTC

I see the faster machines have started to return results. Still over 2 days till the first of mine finish but it does mean the number of tasks waiting to be sent from these batches is going down a little faster now.
ID: 70178 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,341,652
RAC: 10,508
Message 70179 - Posted: 22 Jan 2024, 12:05:27 UTC - in response to Message 70178.  

I see the faster machines have started to return results. Still over 2 days till the first of mine finish but it does mean the number of tasks waiting to be sent from these batches is going down a little faster now.
Two more days for the first tasks to complete here. They survived Storm Isha brown-outs yesterday evening. We have planned power outages tomorrow, for connecting the solar panels, battery and a new charger. So fingers crossed that the WAH models behave after 'hibernate'.
ID: 70179 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : EAS batches 1001-4

©2024 cpdn.org