Thread 'EAS batches 1001-4'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946	Message 70144 - Posted: 17 Jan 2024, 21:56:13 UTC Thread for the discussion of the above batches. ID: 70144 · Reply Quote

[AF] Kalianthys Send message Joined: 20 Dec 20 Posts: 13 Credit: 40,052,490 RAC: 9,149	Message 70160 - Posted: 20 Jan 2024, 8:19:15 UTC Hello. A have this error with task wah2_eas25_n019_200912_24_1002_012236059_0 link : https://www.cpdn.org/result.php?resultid=22357060 <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_10.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_11.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_12.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_13.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_14.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_15.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_16.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_17.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_18.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_19.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_20.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_21.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_22.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_23.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_24.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> <file_xfer_error> <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_restart.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> Kali ID: 70160 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70161 - Posted: 20 Jan 2024, 9:35:52 UTC - in response to Message 70160. Hi, That's not the real error. The real error is at the top of the log: <core_client_version>7.16.20</core_client_version> <![CDATA[ <stderr_txt> CPDN Monitor - Quit request from BOINC... Signal 11 received: Segment violation Signal 11 received: Software termination signal from kill Signal 11 received: Abnormal termination triggered by abort call Signal 11 received, exiting... The following message about stat() just mean the output files haven't been found because the model has crashed. Cheers, Glenn Hello. A have this error with task wah2_eas25_n019_200912_24_1002_012236059_0 link : https://www.cpdn.org/result.php?resultid=22357060 <file_name>wah2_eas25_n019_200912_24_1002_012236059_0_r691533224_10.zip</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> Kali --- CPDN Visiting Scientist ID: 70161 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946	Message 70162 - Posted: 20 Jan 2024, 9:40:34 UTC The signal 11 error is a known issue with these tasks. They are particularly prone to it if interrupted so best not to run them on machines that will be restarted. Some will still fail with this error even under perfect conditions however. I am hoping the work Glen is doing on the code will render this an issue of the past in a few batches time but, I have no idea of how much code will require rewriting and Fortran is not a language I have ever used. ID: 70162 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073	Message 70163 - Posted: 20 Jan 2024, 12:52:07 UTC - in response to Message 70162. Signal 11 is a grade one pain in the cushion polisher :-( As for FORTRAN - well let's just say "it's different". ID: 70163 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946	Message 70164 - Posted: 20 Jan 2024, 14:25:04 UTC - in response to Message 70163. Signal 11 is a grade one pain in the cushion polisher :-( As for FORTRAN - well let's just say "it's different". I am old enough to remember it being the dominant language for scientific work. However I was taught ALGOL at that time and never did enough computing to get past that. ID: 70164 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073	Message 70165 - Posted: 20 Jan 2024, 14:49:25 UTC - in response to Message 70164. ...and thus missed out (to a greater or lesser degree) on the frustration of finding that a statement was going to be one character too long, and thus one would have to work out where to split it for the extension to work correctly or try and find out how to rearrange it so it would fit onto one line. ID: 70165 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 97 Credit: 3,736,855 RAC: 4,073	Message 70166 - Posted: 20 Jan 2024, 14:56:01 UTC OK, let's get back nearer the topic of this thread. The dreaded signal 11 problem bit me earlier today when I suffered a short lived power outage, and so lost several tasks that had been running quite happily for a few days. I really hope Glenn's efforts pay off soon. ID: 70166 · Reply Quote

Paul Send message Joined: 14 Feb 06 Posts: 31 Credit: 4,507,116 RAC: 2,013	Message 70167 - Posted: 20 Jan 2024, 16:49:40 UTC - in response to Message 70162. The signal 11 error is a known issue with these tasks. They are particularly prone to it if interrupted so best not to run them on machines that will be restarted. Some will still fail with this error even under perfect conditions however. I have only one PC. It is restarted each day. Would you prefer that I not run these tasks? ID: 70167 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946	Message 70168 - Posted: 20 Jan 2024, 17:05:32 UTC - in response to Message 70167. Would you prefer that I not run these tasks? That is above my pay(0) grade. You are likely to experience a high failure rate with them doing that even if you do suspend the tasks and wait long enough to ensure all disk writes are finished before closing down BOINC. Doing that does reduce the failure rate. If you are able to suspend to RAM or to disk using the sleep/hibernate options you should do much better. ID: 70168 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,738,273 RAC: 62,820	Message 70169 - Posted: 20 Jan 2024, 20:48:13 UTC - in response to Message 70144. Last modified: 20 Jan 2024, 20:52:53 UTC Not reporting errors. I just have some interesting data to share. I recently got 7950X3D and was curious to see if the additional L3 makes any difference. Turns out it does make a major difference. Here is the experiment setup. I run 16 tasks on my 16T/32C CPU, while assigning each task to one SMT pair. For example, logical core 0/1 gets a task, 2/3 gets one, all the way to 30/31. During the period, I also have two Einstein@home WU occupying two threads too. I have a Linux script that queries `boinccmd --get_tasks` remotely, parses its output, records WU name, elapsed time and fraction done for all CPDN tasks periodically. Then it calculates the difference for each WU and overall progress across multiple records. The table below shows a 6-hour period. There are clearly two bands. Ignoring the best core in each CCD that got screwed by the E@H tasks, it's about 154 vs 184 WU/hour, almost 20% difference. I've manually verified that the faster ones are all on CCD0 with X3D cache, and the slower ones are on CCD1 without additional cache. This is despite the fact that CCD1 generally runs ~100-200MHz faster from what I see in HWMonitor. Caveats. * With different WU on each CPU, it's possible the variation comes from WU instead, but at least the name pattern doesn't suggest such an obvious correlation. It would also be quite a coincidence if all shorter WUs happen to get assigned to one CCD. Would be good to confirm if it's same work for each WU. I also have data before I did the affinity binding while tasks can switch around, and all 16 tasks hover around ~170 hour/ WU, further proving it's the affinity making the difference. * I know nothing about Windows programming, so all these affinity are done manually and checked manually. * Ideally I should stop the E@H task, but I'm just curious not doing rigid science. Good enough for me. ¯\_(ツ)_/¯ name fraction elapsed pct / hour hour / WU ------------------------------------------ ---------- --------- ------------ ----------- wah2_eas25_g38s_201712_24_1003_012246266_0 0.032948 21600.4 0.549124 182.108 wah2_eas25_g3a1_201712_24_1003_012246311_0 0.03912 21600.4 0.651989 153.377 wah2_eas25_g4ls_202012_24_1003_012248030_1 0.039059 21600.4 0.650972 153.616 wah2_eas25_a296_201412_24_1004_012251032_0 0.038575 21600.4 0.642905 155.544 wah2_eas25_h03o_200912_24_1001_012230098_1 0.032759 21600.4 0.545974 183.159 wah2_eas25_n27f_201412_24_1002_012238873_1 0.032627 21600.4 0.543774 183.9 wah2_eas25_a31q_201612_24_1004_012252060_1 0.032199 21600.4 0.536641 186.344 wah2_eas25_n48f_201912_24_1002_012241501_1 0.038752 21600.4 0.645855 154.833 wah2_eas25_n2rm_201612_24_1002_012239600_1 0.039015 21600.4 0.650239 153.79 wah2_eas25_n06f_200912_24_1002_012236245_1 0.031947 21600.4 0.532441 187.814 wah2_eas25_g2n8_201512_24_1003_012245490_0 0.030369 21600.4 0.506141 197.573 wah2_eas25_h3mc_201812_24_1001_012234658_0 0.029541 21600.4 0.492341 203.111 wah2_eas25_g17p_201212_24_1003_012243635_0 0.032622 21600.4 0.543691 183.928 wah2_eas25_h02p_200912_24_1001_012230063_1 0.033273 21600.4 0.55454 180.33 wah2_eas25_g3ap_201712_24_1003_012246335_1 0.039008 21600.4 0.650122 153.817 wah2_eas25_g4ad_202012_24_1003_012247619_0 0.039865 21600.4 0.664405 150.511 ID: 70169 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946	Message 70170 - Posted: 20 Jan 2024, 21:46:31 UTC - in response to Message 70169. Yes level 3 cache makes a difference. Also, my experience is maximum throughput of work is running on N-1 real cores where N is the number of real cores present. Also within a batch, time taken on my machine is always within 1 or 2 percent so I think you can trust your results. I don't run Windows natively so current tasks are either in a VM or using WINE. If I run them using a VM I get a 20% drop in performance compared to using WINE. ID: 70170 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,738,273 RAC: 62,820	Message 70171 - Posted: 20 Jan 2024, 22:58:42 UTC - in response to Message 70170. Also, my experience is maximum throughput of work is running on N-1 real cores where N is the number of real cores present. I've also been playing with number of tasks too, but in my case, increasing from 16 to 20 can eke out another 4-8% total performance actually. However, it's very clear the system is starving from memory bandwidth at 20 tasks (dual channel DDR5-6000). The desktop becomes sluggish and the E@H tasks suffer a 50% regression. Such performance problem doesn't happen with other projects even when I fill all 32 threads. I no longer have my old 5950X + DDR4 3200 setup, and back then 12 wah tasks are enough to make performance suffer though I didn't measure the task progress like this time. I feel the optimal tasks could be very system dependent. Generally, SMT is rather useless for CPDN workloads. The memory subsystem (cache + DDR) has a far greater impact than many other BOINC projects. (I know these are well-known. Just getting and seeing the data myself is interesting. :-P ) ID: 70171 · Reply Quote

Paul Send message Joined: 14 Feb 06 Posts: 31 Credit: 4,507,116 RAC: 2,013	Message 70173 - Posted: 21 Jan 2024, 8:52:28 UTC - in response to Message 70168. You are likely to experience a high failure rate with them doing that even if you do suspend the tasks and wait long enough to ensure all disk writes are finished before closing down BOINC. Doing that does reduce the failure rate. That seems a bit strange, at least to me. 🙂 I would have thought that as long as the last checkpoint was successful, that would have saved everything and that is where the task would start from next time?? Thanks! ID: 70173 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946	Message 70174 - Posted: 21 Jan 2024, 9:25:13 UTC - in response to Message 70173. I would have thought that as long as the last checkpoint was successful, that would have saved everything and that is where the task would start from next time?? I hope that Glen's code wrestling will resolve this. It has been an issue with the project since the days of tasks that would take months to complete with the slow processors of the time. Of course the met office code was written for mainframe computers that always ran 24/7 except when there was a problem that required shutting down. ID: 70174 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 70175 - Posted: 21 Jan 2024, 19:06:42 UTC - in response to Message 70164. However I was taught ALGOL I started with assembler for the 704 IBM computer (5000 vacuum tubes. IIRC) and then used the original FORTRAN for it for some things. When Illinois-ALCOR Algol 60 came out, I really liked it for mathematical work, and SNOBOL4 and SPITBOL for text type problems. I even wrote a compiler for a special-purpose language in SPITBOL. I seldom write program these days. I am most at home with C and C++ currently. ID: 70175 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331	Message 70176 - Posted: 21 Jan 2024, 20:23:53 UTC - in response to Message 70173. If any process is quit, it will not completely die until it's closed open files. I think what Dave might be referring to is flushing any I/O buffers held in memory to a hardware drive. Although code can 'tell' the OS it wants that done, the final decision is still made by the OS. Both the CPDN task & Boinc has a wait-time built into the code to allow any buffers to be flushed but again it's can't be forced. Just to be clear, this segv problem with these tasks is nothing to do with the files produced by the model - so don't waste time waiting for the OS to do its thing. It's a memory issue related to the model starting up, not reading the files. You are likely to experience a high failure rate with them doing that even if you do suspend the tasks and wait long enough to ensure all disk writes are finished before closing down BOINC. Doing that does reduce the failure rate. That seems a bit strange, at least to me. 🙂 I would have thought that as long as the last checkpoint was successful, that would have saved everything and that is where the task would start from next time?? Thanks! --- CPDN Visiting Scientist ID: 70176 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 70177 - Posted: 22 Jan 2024, 0:25:35 UTC - in response to Message 70176. If any process is quit, it will not completely die until it's closed open files. I think what Dave might be referring to is flushing any I/O buffers held in memory to a hardware drive. Although code can 'tell' the OS it wants that done, the final decision is still made by the OS. Both the CPDN task & Boinc has a wait-time built into the code to allow any buffers to be flushed but again it's can't be forced. In "modern" versions of Linux, and perhaps other versions of UNIX, you can greatly increase the chances that IO buffers are actually written to disk (at least to the input buffer of the drive itself) by calling the fsync() command. https://www.man7.org/linux/man-pages/man2/fsync.2.html Here is part of the description. DESCRIPTION top fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7)). Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. ID: 70177 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,014,785 RAC: 20,946	Message 70178 - Posted: 22 Jan 2024, 7:42:00 UTC I see the faster machines have started to return results. Still over 2 days till the first of mine finish but it does mean the number of tasks waiting to be sent from these batches is going down a little faster now. ID: 70178 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,341,652 RAC: 10,508	Message 70179 - Posted: 22 Jan 2024, 12:05:27 UTC - in response to Message 70178. I see the faster machines have started to return results. Still over 2 days till the first of mine finish but it does mean the number of tasks waiting to be sent from these batches is going down a little faster now. Two more days for the first tasks to complete here. They survived Storm Isha brown-outs yesterday evening. We have planned power outages tomorrow, for connecting the solar panels, battery and a new charger. So fingers crossed that the WAH models behave after 'hibernate'. ID: 70179 · Reply Quote