Thread 'OpenIFS Discussion'

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68333 - Posted: 15 Feb 2023, 13:45:06 UTC - in response to Message 68332. Could the push notifications be used to ask users to limit the number of tasks they run concurrently? How would that be worded? I am running 12 Boinc tasks at a time, of which 5 right now are _bl and they all run just fine. There appears to be no memory shortage. Of all the Oifs tasks I have run (280), only one crashed (due to double free problem). I do not see how limiting the number of concurrent tasks would have helped this. $ date; free -hw Wed Feb 15 08:33:58 EST 2023 total used free shared buffers cache available Mem: 62Gi 21Gi 4.8Gi 118Mi 150Mi 36Gi 40Gi Swap: 15Gi 1.2Gi 14Gi PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 766565 766546 boinc 39 19 R 4.0g 6.3 98.3 9 225:29.05 /var/lib/boinc/slots/11/oifs_43r3_model.exe 766988 766985 boinc 39 19 R 3.9g 6.2 98.5 10 218:06.08 /var/lib/boinc/slots/7/oifs_43r3_model.exe 768349 768346 boinc 39 19 R 3.5g 5.6 97.8 0 190:55.99 /var/lib/boinc/slots/5/oifs_43r3_model.exe 762499 762494 boinc 39 19 R 3.4g 5.5 98.4 6 305:36.67 /var/lib/boinc/slots/6/oifs_43r3_model.exe 768103 768098 boinc 39 19 R 3.4g 5.5 98.0 8 194:09.25 /var/lib/boinc/slots/9/oifs_43r3_model.exe ID: 68333 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68334 - Posted: 15 Feb 2023, 14:25:35 UTC How would that be worded? I am running 12 Boinc tasks at a time, of which 5 right now are _bl and they all run just fine. There appears to be no memory shortage. Of all the Oifs tasks I have run (280), only one crashed (due to double free problem). I do not see how limiting the number of concurrent tasks would have helped this. I agree the working is important. There are however quite a few computers out there trying to run four or even more tasks at once with only 16GB of memory or less. Some of these are failing everything they get. ID: 68334 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68335 - Posted: 15 Feb 2023, 14:32:17 UTC - in response to Message 68334. I agree the working is important. There are however quite a few computers out there trying to run four or even more tasks at once with only 16GB of memory or less. Some of these are failing everything they get. So would the wording be something like "divide RAM size by 7G to compute the maximum number to run at a time, and be sure to subtract the sizes of the non-Oifs tasks from the RAM size first"? ID: 68335 · Reply Quote

Yeti Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121	Message 68336 - Posted: 15 Feb 2023, 14:47:35 UTC - in response to Message 68335. Last modified: 15 Feb 2023, 14:48:18 UTC ... something like "divide RAM size by 7G to compute the maximum number to run at a time, and be sure to subtract the sizes of the non-Oifs tasks from the RAM size first"? I have seen several figures how much RAM OpenIFS_PS-Tasks need, 7 GB, 6 GB, in real I have never seen using more than 4,5 GB per Task I'm running 3 OpenIFS in a 16 GB RAM-Environment together with a Squid-Instance and so far this box has run 32 WUs successfull without any errors. Supporting BOINC, a great concept ! ID: 68336 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68337 - Posted: 15 Feb 2023, 14:55:43 UTC - in response to Message 68334. My machine has many more boinc tasks sitting around waiting to run than are actually running. For example, there are perhaps 150 Universe tasks ready to go, and app_config allows up to three at a time to run. Notice that none are running. Similarly, there are lots of MilkyWay tasks ready to run, and app_config allows up to three at a time to run. Notice that only two are running. In Preferences, I tell boinc-client to use at most 75% of memory when machine is in use and 85% when machine is not in use. And to use at most 75% of the CPUs. Do not these restrictions protect the boinc-client from using too much memory? top - 09:39:30 up 8 days, 19:35, 1 user, load average: 12.65, 12.67, 12.58 Tasks: 471 total, 14 running, 457 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.4 us, 11.5 sy, 62.9 ni, 24.9 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 63897.3 total, 5678.4 free, 21547.5 used, 36671.4 buff/cache MiB Swap: 15992.0 total, 14590.7 free, 1401.2 used. 41495.3 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 768103 768098 boinc 39 19 R 4.2g 6.7 98.7 11 262:25.35 /var/lib/boinc/slots/9/oifs_43r3_model.exe 768349 768346 boinc 39 19 R 4.0g 6.5 98.6 14 259:11.76 /var/lib/boinc/slots/5/oifs_43r3_model.exe 766565 766546 boinc 39 19 R 3.5g 5.6 99.0 6 293:44.84 /var/lib/boinc/slots/11/oifs_43r3_model.exe 762499 762494 boinc 39 19 R 2.8g 4.4 98.7 0 373:52.14 /var/lib/boinc/slots/6/oifs_43r3_model.exe 766988 766985 boinc 39 19 R 2.3g 3.6 98.8 3 286:22.15 /var/lib/boinc/slots/7/oifs_43r3_model.exe 784022 2211 boinc 39 19 R 765368 1.2 98.8 1 0:16.00 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 775376 2211 boinc 39 19 R 88936 0.1 98.8 2 120:28.72 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 779023 2211 boinc 39 19 R 77148 0.1 99.1 12 59:57.71 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 779253 2211 boinc 39 19 R 76776 0.1 98.9 7 56:57.61 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 781925 2211 boinc 39 19 R 71760 0.1 99.2 4 28:00.93 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 2211 1 boinc 30 10 S 45480 0.1 0.4 9 141262:07 /usr/bin/boinc 781003 2211 boinc 39 19 R 7180 0.0 98.6 5 38:09.60 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 780794 2211 boinc 39 19 R 7156 0.0 99.0 8 40:14.77 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 768346 2211 boinc 39 19 S 4824 0.0 0.0 10 1:15.75 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 766985 2211 boinc 39 19 S 4820 0.0 0.0 11 1:27.40 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 768098 2211 boinc 39 19 S 4816 0.0 0.1 11 1:34.92 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 762494 2211 boinc 39 19 S 3368 0.0 0.0 13 2:14.72 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 766546 2211 boinc 39 19 S 3292 0.0 0.0 10 1:39.86 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ ID: 68337 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68338 - Posted: 15 Feb 2023, 17:30:49 UTC - in response to Message 68336. ... something like "divide RAM size by 7G to compute the maximum number to run at a time, and be sure to subtract the sizes of the non-Oifs tasks from the RAM size first"? I have seen several figures how much RAM OpenIFS_PS-Tasks need, 7 GB, 6 GB, in real I have never seen using more than 4,5 GB per Task I'm running 3 OpenIFS in a 16 GB RAM-Environment together with a Squid-Instance and so far this box has run 32 WUs successfull without any errors. Depends what process(es) you are looking at as there are several associated with the task as it runs. The limit is set to the sum. But yes the BL model itself takes less memory than the PS version because it's a simpler configuration. However, because we know there are significant memory leaks in the boinc client code for zipping files, we have to add a buffer to account for memory leaks accumulating during the run to make sure the task isn't killed. The problem with informing users is (a) getting them to read the notices/forums and putting it in a way they understand (look on the News forum now and you'll see what I mean); (b) will they really care; (c) it's not really their problem & not CPDN's. The issue is with the client itself. Don't assume that just because one OpenIFS task uses less than 7Gb memory the others will. Respect the rsc_memory_bound in the task. It's set to those numbers for good reasons. ID: 68338 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68339 - Posted: 15 Feb 2023, 18:09:27 UTC - in response to Message 68338. Don't assume that just because one OpenIFS task uses less than 7Gb memory the others will. Respect the rsc_memory_bound in the task. It's set to those numbers for good reasons. Respect the rsc_memory_bound in the task. OK: I found one of those, but how do I read it? [<rsc_memory_bound>8804000000.000000</rsc_memory_bound> Is that 8804000000.000000 Bytes? It is not using that much at the moment, but that proves little. PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 788056 788051 boinc 39 19 R 3.9g 6.2 98.9 4 137:26.93 /var/lib/boinc/slots/6/oifs_43r3_model.exe 788051 2211 boinc 39 19 S 4736 0.0 0.0 9 0:31.71 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ ID: 68339 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 68340 - Posted: 15 Feb 2023, 18:24:28 UTC Another forrtl: error (72): floating overflow on WU 12206864 - two attempts so far, and both failed with the same error at the exact same stage in the run (after step 156). ID: 68340 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68341 - Posted: 15 Feb 2023, 18:59:08 UTC - in response to Message 68340. Another forrtl: error (72): floating overflow on WU 12206864 - two attempts so far, and both failed with the same error at the exact same stage in the run (after step 156). It is clearly data-dependent. All my Oifs _bl tasks worked except for the one that died from double-free problem. ID: 68341 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 68342 - Posted: 15 Feb 2023, 19:17:02 UTC And this one is a real oddity - same machine as the last one, but I don't think this one was data dependent. Task 22313532 This ran as normal and without incident until the very end: 17:15:10 STEP 1440 H= 360:00 +CPU= 19.627 It then stopped, with the worker missing from the PID list, but with the wrapper still present. It was locked solid - using no discernible CPU time, and not responding to 'kill' commands. But it hadn't written the finish file, so BOINC let it run. And so did I, while other tasks finished. When things were quiet, I stopped and restarted the BOINC client. That's been observed to remove locked PIDs from memory, and did so on this occasion too. You can see the restart process in the stderr, culminating with a re-alignment at 18:47:29 STEP 1440 H= 360:00 +CPU= 11.699 It then prepared and uploaded the final zip and trickle, and has been accepted as valid. ID: 68342 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 68343 - Posted: 15 Feb 2023, 19:30:47 UTC - in response to Message 68342. This ran as normal and without incident until the very end: 17:15:10 STEP 1440 H= 360:00 +CPU= 19.627 It then stopped, with the worker missing from the PID list, but with the wrapper still present. It was locked solid - using no discernible CPU time, and not responding to 'kill' commands. But it hadn't written the finish file, so BOINC let it run. And so did I, while other tasks finished. When things were quiet, I stopped and restarted the BOINC client. That's been observed to remove locked PIDs from memory, and did so on this occasion too. You can see the restart process in the stderr, culminating with a re-alignment at 18:47:29 STEP 1440 H= 360:00 +CPU= 11.699 It then prepared and uploaded the final zip and trickle, and has been accepted as valid. ... and here is how my most recent one of these ended normally: 12:29:46 STEP 1438 H= 359:30 +CPU= 17.113 12:30:04 STEP 1439 H= 359:45 +CPU= 17.382 12:30:27 STEP 1440 H= 360:00 +CPU= 22.512 ..The child process terminated with status: 0 >>> Printing last 70 lines from file: NODE.001_01 I VARIABLE I Initial value I Current value I I-------------------------------------------------------------------------I I OMEGA (P/S) I -0.1227910857E-20 I 0.2519104860E-05 I ICloud fractionI 0.0000000000E+00 I 0.1919869766E-01 I I Relat. Hum. I 0.3184217114E+00 I 0.3629554164E+00 I I PASS 01 I 0.0000000000E+00 I 0.3514421542E+03 I DDH------------------------------------------------------------------------ MAXGPFV : MAX. VALUE = 0.000000000000000E+000 MAXGPFV : MAX. VALUE = 0.000000000000000E+000 NSTEP = 1440 SCAN2M_HPOS P IO-STREAM SETUP - IOTYPE = 2 NUMIOPROCS = 1 CPATH = ICMGGh7zg+001440 MODE=a IO-STREAM CLOSED - ICMGGh7zg+001440 MPI-TASK: 1 - 295645393 BYTES IN 7 RECORDS TRANSFERRED IN 0.0002 SECONDS****** Mbytes/s, TOTAL TIME= 0.0079( 2.5%) 12:30:27 STEP 1440 H= 360:00 +CPU= 22.512 END CNT3 NSTEP = 1440 CNT0 000000000 16.726 16.726 16.974 0.985 0 0 0 0 0 IO-STREAM STATISTICS, TOTAL NO OF REC/BYTES READ 0/ 0 WRITTEN 42886/******** -TASK-OPENED-OPEN-RECS IN -KBYTE IN -RECS OUT-KBYTE OUT-WALL -WALL IN-WALL OU-TOT IN -TOT OUT 1 758 0 0 0 42886 ******** 34.5 0.0 34.5 0.0 83.2 ===-=== START OF TIMING STATISTICS ===-=== STATS FOR ALL TASKS NUM ROUTINE CALLS MEAN(ms) MAX(ms) FRAC(%) UNBAL(%) 0 CNT0 - COMPLETE EXECUTION 1 ***** ***** 100.00 0.00 1 CNT4 - FORWARD INTEGRATION 1 ***** ***** 99.99 0.00 8 SCAN2M - GRID-POINT DYNAMICS 1562 2726.9 2726.9 16.29 0.00 9 SPCM - SPECTRAL COMP. 1440 573.0 573.0 3.15 0.00 10 SCAN2M - PHYSICS 1441 8670.9 8670.9 47.78 0.00 11 IOPACK - OUTPUT P.P. RESULTS 121 423.0 423.0 0.20 0.00 12 SPNORM - SPECTRAL NORM COMP. 63 32.3 32.3 0.01 0.00 14 SUINIF 1 1945.1 1945.1 0.01 0.00 17 GRIDFPOS IN CNT4 121 19.6 19.6 0.01 0.00 18 SUSPECG 1 1000.1 1000.1 0.00 0.00 19 SUSPEC 1 1005.5 1005.5 0.00 0.00 24 SUGRIDU 1 446.4 446.4 0.00 0.00 25 SPECRT 1 126.3 126.3 0.00 0.00 26 SUGRIDF 1 366.8 366.8 0.00 0.00 27 RESTART FILES - WRITING 30 1706.2 1706.2 0.20 0.00 28 RESTART FILES - READING 1 0.1 0.1 0.00 0.00 29 SU4FPOS IN CNT4 121 0.0 0.0 0.00 0.00 30 DYNFPOS IN CNT4 121 3800.5 3800.5 1.76 0.00 31 POSDDH IN STEPO 7 5.0 5.0 0.00 0.00 37 CPGLAG - SL COMPUTATIONS 1441 3827.4 3827.4 21.09 0.00 39 SU0YOMB 1 132.2 132.2 0.00 0.00 51 SCAN2M - SL COMM. PART 1 1441 13.3 13.3 0.07 0.00 54 SPCM - M TO S/S TO M TRANSP. 1440 260.2 260.2 1.43 0.00 55 SPCIMPF - S TO M/M TO S TRANSP. 1440 38.1 38.1 0.21 0.00 56 SPNORM - SPECTRAL NORM COMM. 63 0.2 0.2 0.00 0.00 102 LTINV_CTL - INVERSE LEGENDRE TRANSFORM 3125 288.5 288.5 3.45 0.00 103 LTDIR_CTL - DIRECT LEGENDRE TRANSFORM 3006 168.8 168.8 1.94 0.00 106 FTDIR_CTL - DIRECT FOURIER TRANSFORM 3006 53.0 53.0 0.61 0.00 107 FTINV_CTL - INVERSE FOURIER TRANSFORM 3125 91.6 91.6 1.09 0.00 140 SULEG - COMP. OF LEGENDRE POL. 1 21.9 21.9 0.00 0.00 152 LTINV_CTL - M TO L TRANSPOSITION 3125 20.9 20.9 0.25 0.00 153 LTDIR_CTL - L TO M TRANSPOSITION 3006 12.1 12.1 0.14 0.00 157 FTINV_CTL - L TO G TRANSPOSITION 3125 54.8 54.8 0.66 0.00 158 FTDIR_CTL - G TO L TRANSPOSITION 3006 13.9 13.9 0.16 0.00 400 GSTATS 167374 0.0 0.0 0.00 0.00 401 GSTATS HOOK 155706 0.0 0.0 0.00 0.00 TOTAL MEASURED IMBALANCE = 0.0 SECONDS, 0.0 PERCENT TOTAL WALLCLOCK TIME 26151.9 CPU TIME 25832.3 VECTOR TIME 25832.3 ===-=== END OF TIMING STATISTICS ===-=== FORECAST DAYS PER DAY 49.6 * END CNT0 *** ------------------------------------------------ Moving to projects directory: /var/lib/boinc/slots/9/ICMGGh7zg+001440 Moving to projects directory: /var/lib/boinc/slots/9/ICMSHh7zg+001440 Moving to projects directory: /var/lib/boinc/slots/9/ICMUAh7zg+001440 Adding to the zip: /var/lib/boinc/slots/9/NODE.001_01 Adding to the zip: /var/lib/boinc/slots/9/ifs.stat Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_bl_12209190/ICMGGh7zg+001344 [snip] Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_bl_12209190/ICMUAh7zg+001440 Zipping up the final file: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_bl_a1ur_2016092300_15_991_12209190_0_r208224497_14.zip Uploading the final file: upload_file_14.zip Uploading trickle at timestep: 1295100 12:33:54 (768098): called boinc_finish(0) </stderr_txt> ]]> ID: 68343 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68344 - Posted: 15 Feb 2023, 19:57:11 UTC - in response to Message 68342. And this one is a real oddity - same machine as the last one, but I don't think this one was data dependent. Task 22313532 This ran as normal and without incident until the very end: 17:15:10 STEP 1440 H= 360:00 +CPU= 19.627 It then stopped, with the worker missing from the PID list, but with the wrapper still present. It was locked solid - using no discernible CPU time, and not responding to 'kill' commands. But it hadn't written the finish file, so BOINC let it run. This looks like the file locking problem we've seen before, could also be a manifestation of the memory corruption which has corrupted file ptrs. It's also possible we are in the boinc_finish() call but that text 'boinc_finish' hasn't been flushed to stderr yet. In cases like this, ideally I need you to attach the debugger and generate a traceback so we can see where it is in the calling tree. In deadlocks like this, the program pointer will be right at the problem: # get the process id <pid> of the oifs_43r3_bl_1.11_x86-64-linux-gnu ps -ef \| grep _bl_ gdb -p <pid> bt full detach exit and then PM the output to me. ID: 68344 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68345 - Posted: 15 Feb 2023, 20:08:04 UTC If anyone's interested, this is another task fail which is different to the turbulence failure mentioned earlier: https://www.cpdn.org/result.php?resultid=22313995 In this task, scrolll up the stderr output until just before the long traceback and you'll see lines like this: 15:18:49 STEP 920 H= 230:00 +CPU= 11.509 MAX U WIND= 250.807710093462 15:19:00 STEP 921 H= 230:15 +CPU= 11.031 MAX U WIND= 258.524841745508 MAX V WIND= 253.308999272636 15:19:11 STEP 922 H= 230:30 +CPU= 10.936 MAX U WIND= 256.013634963429 MAX V WIND= 253.579772348885 15:19:22 STEP 923 H= 230:45 +CPU= 10.949 MAX U WIND= 250.801318000386 MAX V WIND= 250.423648892277 15:19:36 STEP 924 H= 231:00 +CPU= 14.168 The model has caught that the maximum wind speed (U is the E-W, V is the N-S; total windspeed is sqrt(UU+VV)), is greater than 250m/s (~560 mph). Usually, max E-W wind is ~75m/s. I've just looked it up and the max windspeed ever recorded was 231mph. So this storm got pretty powerful! ID: 68345 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,836,638 RAC: 3,986	Message 68346 - Posted: 15 Feb 2023, 20:39:14 UTC - in response to Message 68345. If anyone's interested, this is another task fail which is different to the turbulence failure mentioned earlier: https://www.cpdn.org/result.php?resultid=22313995 So this storm got pretty powerful! It's earlier run of the workunit reported the same results for wind speed at the same step: https://www.cpdn.org/result.php?resultid=22307272 That's very reassuring that the science/physics/programmimg/perturbations give consistent results. ID: 68346 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207	Message 68347 - Posted: 15 Feb 2023, 20:49:00 UTC - in response to Message 68336. I'm running 3 OpenIFS in a 16 GB RAM-Environment together with a Squid-Instance and so far this box has run 32 WUs successfull without any errors. That's impressive, also this PC is almost certainly in the minority of PCs that can do that. I'm assuming you don't use the machine for anything else? Does it have ECC RAM? Is Squid there from when you used it for LHC? ID: 68347 · Reply Quote

Yeti Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121	Message 68348 - Posted: 15 Feb 2023, 21:12:40 UTC - in response to Message 68347. Last modified: 15 Feb 2023, 21:14:28 UTC I'm assuming you don't use the machine for anything else? It's sitting on an older ESX-Server and run's only LHC-ATLAS, Squid (for all my LHC-Machines) and now CPDN. For CPDN I have stopped LHC/ATLAS at the moment Does it have ECC RAM? Shure, it's running on a Server Is Squid there from when you used it for LHC? I moved Squid from my former Windows-VM to this Linux-VM a month ago and yes, Squid is full active for my LHC/ATLAS-Machine-Park Supporting BOINC, a great concept ! ID: 68348 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4571 Credit: 19,039,635 RAC: 18,944	Message 68349 - Posted: 15 Feb 2023, 21:25:40 UTC I am doing a second attempt of one that failed after after the last timestep with this almost at the beginning of stderr exceeded elapsed time limit 154748.62 (1920000.00G/9.28G)</message> here Looks like the same problem Richard described. I shall see what happens on my machine. ID: 68349 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 493 Credit: 31,669,049 RAC: 10,904	Message 68350 - Posted: 15 Feb 2023, 23:26:21 UTC Is this related to the floating point issue: Task 22307753 08:48:48 STEP 61 H= 15:15 +CPU= 23.513 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1538] Received signal#8 (SIGFPE) :: 4362MB (heap), 5076MB (maxrss), 0MB (maxstack), 0 (paging), nsigs = 1 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1542] Also activating Harakiri-alarm (SIGALRM=14) to expire after 500s elapsed to prevent hangs, nsigs = 1 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1544] Harakiri signal handler 'signal_harakiri' for signal#14 (SIGALRM) installed at 0x81f0c0 (old at (nil)) [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1617] Signal#8 was caused by floating-point overflow [memaddr=0x1cc4a8f], nsigs = 1 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1686] Starting DrHook backtrace for signal#8, nsigs = 1 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [c_drhook_print_@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:3843] 4362 MB (maxheap), 5076 MB (maxrss), 0 MB (maxstack) [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [c_drhook_print_@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:3897] : MASTER ID: 68350 · Reply Quote

mikey Send message Joined: 18 Nov 18 Posts: 22 Credit: 6,710,387 RAC: 2,649	Message 68351 - Posted: 16 Feb 2023, 1:11:26 UTC - in response to Message 68348. Yeti wrote: Is Squid there from when you used it for LHC? I moved Squid from my former Windows-VM to this Linux-VM a month ago and yes, Squid is full active for my LHC/ATLAS-Machine-Park[/quote] Are you using it for other Boinc Projects as well? Or just LHC? ID: 68351 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160	Message 68353 - Posted: 16 Feb 2023, 9:38:48 UTC - in response to Message 68350. Alan K. Yes that task is another example of a model forecast that failed due to a too strong perturbation. In that traceback, near the middle, there's: Signal#8 was caused by floating-point overflow which indicates model field(s) have got too big (probably winds as I mentioned before). Is this related to the floating point issue: Task 22307753 08:48:48 STEP 61 H= 15:15 +CPU= 23.513 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1538] Received signal#8 (SIGFPE) :: 4362MB (heap), 5076MB (maxrss), 0MB (maxstack), 0 (paging), nsigs = 1 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1542] Also activating Harakiri-alarm (SIGALRM=14) to expire after 500s elapsed to prevent hangs, nsigs = 1 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1544] Harakiri signal handler 'signal_harakiri' for signal#14 (SIGALRM) installed at 0x81f0c0 (old at (nil)) [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1617] Signal#8 was caused by floating-point overflow [memaddr=0x1cc4a8f], nsigs = 1 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1686] Starting DrHook backtrace for signal#8, nsigs = 1 [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [c_drhook_print_@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:3843] 4362 MB (maxheap), 5076 MB (maxrss), 0 MB (maxstack) [EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [c_drhook_print_@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:3897] : MASTER ID: 68353 · Reply Quote