Thread 'Computation Errors'

Author	Message
Aurum Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318	Message 64330 - Posted: 11 Aug 2021, 11:49:18 UTC Is this me or CP??? https://www.cpdn.org/result.php?resultid=22107426 <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( 04:43:17 (4924): called boinc_finish(22) </stderr_txt> ]]> ID: 64330 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 64332 - Posted: 11 Aug 2021, 12:22:00 UTC - in response to Message 64330. It's the model - NEGATIVE THETA means that the program has detected that some parameter has moved beyond what is physically possible, so the modelling has been stopped. I had one recently. It's possible that the researcher is exploring an area of climate that's really close to the edge of what's possible. ID: 64332 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 64334 - Posted: 11 Aug 2021, 12:31:30 UTC Is this me or CP??? As Les says, most likely the model unless you are getting lots of them across different batches? Are you running all 36 cores? If so, I would suggest cutting the number which will probably increase your throughput of tasks. (I see that the computer in question has yet to finish a task under it's current configuration. ID: 64334 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,836,638 RAC: 3,986	Message 64335 - Posted: 11 Aug 2021, 14:21:01 UTC - in response to Message 64330. Is this me or CP??? https://www.cpdn.org/result.php?resultid=22107426 This task was running on computer ID 1521341. i9-10980XE, Linux Mint with 36 cores in just 15GB of memory? 1521341 downloaded 36 tasks on 31/7 and 25 have aborted or crashed fairly quickly with a range of Signal(n) or SIGSEGV errors too. 3GB of memory per cpu is a figure discussed by others. 15GB of memory is way too small for CPDN to be happy with 36 tasks. With insufficient memory, the tasks will be 'swapped' all the time and may be exposed to soft errors in the disc I/O. (Discs will hide these soft errors from users until it's too late!). Task 22131978 (a CM3S) is running at less than half the sec/TS of our older i7, suggesting that tasks are being throttled. The 'Invalid Theta' error may simply be another emanation of the task aborts and crashes. The Intel specs suggest that i9-10980XE (Core-X) maximum memory size is 256GB? Is 1521341 a Linux virtual machine and you could increase the virtual configuration of memory and reduce the number of cores? Good luck. ID: 64335 · Reply Quote

Aurum Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318	Message 64336 - Posted: 11 Aug 2021, 15:48:34 UTC I have 7 of these WUs with a negative theta error on 5 different computers all of whom have completed at least one WU. Good to know it's not me. In light of the CPU Cache Congestion issue I'm still trying to figure out what my rules should be. For X99 CPUs I limit the number of CP WUs to the L3 Cache divided by 5. So for an E5-2699 with 55 MB L3 I'd run 11 or fewer WUs. Since it seems almost impossible to DL 11 WUs that hasn't been a problem yet. I further limit hadcm3s to a single WU since it appears hypersensitive to having multiple WUs running and slows down dramatically. I limit hadcm3s to one WU per 16 GB of RAM to avoid combinations that force using the Swap File. I fill it out with hadam4 and hadsm4. E.g., <app_config> <app> <name>hadcm3s</name> <!-- UK Met Office HadCM3 short, 3 days --> <!-- Xeon E5-2699 v4 22c44t 32 GB L3 Cache 55 MB --> <max_concurrent>1</max_concurrent> </app> <app> <name>hadam4</name> <!-- UK Met Office HadAM4 at N144 resolution, 4 days --> <max_concurrent>4</max_concurrent> </app> <app> <name>hadsm4</name> <!-- UK Met Office HadSM4 at N144 resolution, 4 days --> <max_concurrent>4</max_concurrent> </app> <app> <name>hadam4h</name> <!-- UK Met Office HadAM4 at N216 resolution, 14 days --> <max_concurrent>2</max_concurrent> </app> <project_max_concurrent>11</project_max_concurrent> </app_config> My problem is what to do with the X299 CPUs that have a different CPU Cache design. Is the rule still 5 MB per WU or is it more like 2.75 MB per WU??? i9-9980XE Level 1 cache size 18 x 32 KiB 8-way set associative instruction caches + 18 x 32 KB 8-way set associative data caches Level 2 cache size 18 x 1 MiB 16-way set associative caches Level 3 cache size 24.75 MiB = 18 x 1.375 MiB 11-way set associative write-back E5-2699 Level 1 cache size 22 x 32 KiB 8-way set associative instruction caches + 22 x 32 KB 8-way set associative data caches Level 2 cache size 22 x 256 KiB 8-way set associative caches Level 3 cache size 55 MiB = 22 x 2.5 MiB 20-way set associative shared cache ID: 64336 · Reply Quote

Aurum Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318	Message 64338 - Posted: 11 Aug 2021, 16:15:23 UTC - in response to Message 64335. Is this me or CP??? https://www.cpdn.org/result.php?resultid=22107426 This task was running on computer ID 1521341. i9-10980XE, Linux Mint with 36 cores in just 15GB of memory? 1521341 downloaded 36 tasks on 31/7 and 25 have aborted or crashed fairly quickly with a range of Signal(n) or SIGSEGV errors too. 3GB of memory per cpu is a figure discussed by others. 15GB of memory is way too small for CPDN to be happy with 36 tasks. With insufficient memory, the tasks will be 'swapped' all the time and may be exposed to soft errors in the disc I/O. (Discs will hide these soft errors from users until it's too late!). Task 22131978 (a CM3S) is running at less than half the sec/TS of our older i7, suggesting that tasks are being throttled. The 'Invalid Theta' error may simply be another emanation of the task aborts and crashes. The Intel specs suggest that i9-10980XE (Core-X) maximum memory size is 256GB? Is 1521341 a Linux virtual machine and you could increase the virtual configuration of memory and reduce the number of cores? Good luck. The Aborts are me. Before the servers were taken down I got a huge DL of hadam4h with some computers needing over a year to complete them. I hate sitting on a hoard of WUs that someone else could be running so I returned them. Fixing the Work Server configuration could easily prevent this. I'm running Linux Mint 20.2 Ubuntu 20.04 and I disable virtualization in the BIOS config. I'm trying to figure out how to interpret the stdout_mon.txt file. What is DLT? Every row is different and the time for each Time Slice increases each row until it suddenly resets (dry air vs moist air maybe?). The Average is dominated by the first value which is high, maybe yesterday's ending average when I shut down for the heat wave and sky-high summer TOU electric rates. Yet I see single sec/TS values being quoted. How can a single value represent such disparate data? I'm hoping if I can learn to interpret sec/TS correctly it could be the figure of merit to figure out how many of each WU can run efficiently on which class of CPU. E.g., why does DLT have spikes? Need to learn how to "see the graphs for this run." Trying to kill old process # 2648 Trying to kill old process # 2759 Created shared memory region key = 175990 of size 13519772 bytes (version 608) Run for 0 Years and 5 Months pShMem->UPLOAD_INTERVAL 0 ulTotalPhaseTimestep 43488 Starting model ID hadam4h_b039_200611_5_882_012035125 Phase 1 Program launched with process id # 2629 Climate model starting - use graphics to monitor progress. Or visit the website to see the graphs for this run. Getting pthread attributes - retval=0 Setting pthread size (-1778384896 bytes) - retval=0 Executing program /var/lib/boinc-client/projects/climateprediction.net/hadam4_um_8.52_i686-pc-linux-gnu hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006193 A - 22/11/2006 12:05 - H:M:S=0089:59:11 AVG=52.31 DLT= 0.00 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006194 A - 22/11/2006 12:10 - H:M:S=0090:02:01 AVG=52.33 DLT=170.71 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006195 A - 22/11/2006 12:15 - H:M:S=0090:02:44 AVG=52.33 DLT=42.56 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006196 A - 22/11/2006 12:20 - H:M:S=0090:03:25 AVG=52.33 DLT=41.39 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006197 A - 22/11/2006 12:25 - H:M:S=0090:04:04 AVG=52.32 DLT=38.19 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006198 A - 22/11/2006 12:30 - H:M:S=0090:04:47 AVG=52.32 DLT=43.29 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006199 A - 22/11/2006 12:35 - H:M:S=0090:05:28 AVG=52.32 DLT=41.29 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006200 A - 22/11/2006 12:40 - H:M:S=0090:06:08 AVG=52.32 DLT=40.26 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006201 A - 22/11/2006 12:45 - H:M:S=0090:06:50 AVG=52.32 DLT=41.26 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006202 A - 22/11/2006 12:50 - H:M:S=0090:07:33 AVG=52.31 DLT=43.02 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006203 A - 22/11/2006 12:55 - H:M:S=0090:08:17 AVG=52.31 DLT=43.99 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006204 A - 22/11/2006 13:00 - H:M:S=0090:08:55 AVG=52.31 DLT=38.67 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006205 A - 22/11/2006 13:05 - H:M:S=0090:09:39 AVG=52.31 DLT=43.29 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006206 A - 22/11/2006 13:10 - H:M:S=0090:10:20 AVG=52.31 DLT=41.76 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006207 A - 22/11/2006 13:15 - H:M:S=0090:11:01 AVG=52.31 DLT=40.45 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006208 A - 22/11/2006 13:20 - H:M:S=0090:11:41 AVG=52.30 DLT=39.82 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006209 A - 22/11/2006 13:25 - H:M:S=0090:12:19 AVG=52.30 DLT=38.48 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006210 A - 22/11/2006 13:30 - H:M:S=0090:12:53 AVG=52.30 DLT=34.36 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006211 A - 22/11/2006 13:35 - H:M:S=0090:13:33 AVG=52.30 DLT=39.82 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006212 A - 22/11/2006 13:40 - H:M:S=0090:14:16 AVG=52.30 DLT=43.08 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006213 A - 22/11/2006 13:45 - H:M:S=0090:14:58 AVG=52.29 DLT=41.30 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006214 A - 22/11/2006 13:50 - H:M:S=0090:15:35 AVG=52.29 DLT=37.62 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006215 A - 22/11/2006 13:55 - H:M:S=0090:16:17 AVG=52.29 DLT=41.61 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006216 A - 22/11/2006 14:00 - H:M:S=0090:16:58 AVG=52.29 DLT=41.20 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006217 A - 22/11/2006 14:05 - H:M:S=0090:17:39 AVG=52.29 DLT=41.05 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006218 A - 22/11/2006 14:10 - H:M:S=0090:18:17 AVG=52.28 DLT=37.72 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006219 A - 22/11/2006 14:15 - H:M:S=0090:18:54 AVG=52.28 DLT=36.77 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006220 A - 22/11/2006 14:20 - H:M:S=0090:19:34 AVG=52.28 DLT=40.56 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006221 A - 22/11/2006 14:25 - H:M:S=0090:20:11 AVG=52.28 DLT=36.62 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006222 A - 22/11/2006 14:30 - H:M:S=0090:20:55 AVG=52.28 DLT=44.05 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006223 A - 22/11/2006 14:35 - H:M:S=0090:21:37 AVG=52.27 DLT=42.20 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006224 A - 22/11/2006 14:40 - H:M:S=0090:22:19 AVG=52.27 DLT=41.45 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006225 A - 22/11/2006 14:45 - H:M:S=0090:22:59 AVG=52.27 DLT=40.86 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006226 A - 22/11/2006 14:50 - H:M:S=0090:23:39 AVG=52.27 DLT=39.36 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006227 A - 22/11/2006 14:55 - H:M:S=0090:24:22 AVG=52.27 DLT=43.32 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006228 A - 22/11/2006 15:00 - H:M:S=0090:25:01 AVG=52.26 DLT=39.43 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006229 A - 22/11/2006 15:05 - H:M:S=0090:25:45 AVG=52.26 DLT=43.46 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006230 A - 22/11/2006 15:10 - H:M:S=0090:28:34 AVG=52.28 DLT=169.03 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006231 A - 22/11/2006 15:15 - H:M:S=0090:29:17 AVG=52.28 DLT=42.74 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006232 A - 22/11/2006 15:20 - H:M:S=0090:29:59 AVG=52.28 DLT=41.84 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006233 A - 22/11/2006 15:25 - H:M:S=0090:30:41 AVG=52.28 DLT=42.91 hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006234 A - 22/11/2006 15:30 - H:M:S=0090:31:19 AVG=52.27 DLT=37.80 ID: 64338 · Reply Quote

Aurum Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318	Message 64339 - Posted: 11 Aug 2021, 16:29:42 UTC I was curious about the WUs I completed but have zero credit. E.g. https://www.cpdn.org/workunit.php?wuid=12089673. Three different computers errored out for 3 different reasons. If my 4th wingman fails will this WU be credited??? ID: 64339 · Reply Quote

bozz4science Send message Joined: 10 May 20 Posts: 50 Credit: 3,426,221 RAC: 261	Message 64340 - Posted: 11 Aug 2021, 16:43:06 UTC Regarding the credit question, you should get credit for the finished task once the weekly script will run. I guess that it is always run on Sundays if I remember correctly. Maybe others can help with the rest ID: 64340 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 64341 - Posted: 11 Aug 2021, 17:17:15 UTC - in response to Message 64339. I was curious about the WUs I completed but have zero credit. E.g. https://www.cpdn.org/workunit.php?wuid=12089673. Three different computers errored out for 3 different reasons. If my 4th wingman fails will this WU be credited??? With regards to the completed task it should get credit whatever happens to wingmen. Unless something has gone wrong, all the wingmen have failed and because you have completed the task there won't be any more wingmen. CPDN does not use validation that requires two successful completions of the task like many projects do but awards credit on the basis of the trickle up messages. I see that task does not show any trickles but that is likely because the tricle server is still down after the recent work at Oxford. My total credit did go up on Sunday when the credit script ran but average credit dropped a lot because of trickles not being recorded. (They are still sitting in /var/log/somethingorother waiting to go.) If no change I will nudge Andy tomorrow afternoon giving a chance for things to be sorted. (I am pretty sure he is aware of the problem anyway.) Interesting what you say about the sensitivity of some tasks. - Even if I use all 16 cores (8of the real) on my Ryzen, I don't seem to get any increase in error rate. (That is with 32GB Ram.) ID: 64341 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 64342 - Posted: 11 Aug 2021, 18:37:57 UTC - in response to Message 64335. 3GB of memory per cpu is a figure discussed by others. 15GB of memory is way too small for CPDN to be happy with 36 tasks. With insufficient memory, the tasks will be 'swapped' all the time and may be exposed to soft errors in the disc I/O. (Discs will hide these soft errors from users until it's too late!). Task 22131978 (a CM3S) is running at less than half the sec/TS of our older i7, suggesting that tasks are being throttled. The 'Invalid Theta' error may simply be another emanation of the task aborts and crashes. It seems to me that 3GB is way more RAM than a CPDN task, even an N216 task, needs. But it probably not a good idea to run 36 tasks on a machine with only 15 GBytes of RAM, though it should work if you have enough disk space for swapping. If your hardware is OK, it should just run them slowly because of processor-cache interference and of thrashing to the swap space on disk, especially if it is a spinning device. If running a large number of CPDN tasks causes errors, I would suspect errors on the swap disk, or overheating of the CPU(s). Right now, mine are not overheating. coretemp-isa-0000 Adapter: ISA adapter Package id 0: +78.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) Core 1: +70.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) Core 2: +68.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) Core 3: +76.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) Core 5: +68.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) Core 8: +73.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) Core 9: +67.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) Core 11: +78.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) Core 12: +70.0Â°C (high = +88.0Â°C, crit = +98.0Â°C) amdgpu-pci-6500 Adapter: PCI adapter vddgfx: +0.69 V fan1: 2051 RPM (min = 1800 RPM, max = 6000 RPM) edge: +40.0Â°C (crit = +97.0Â°C, hyst = -273.1Â°C) power1: 3.24 W (cap = 25.00 W) dell_smm-virtual-0 Adapter: Virtual device fan1: 4261 RPM fan2: 891 RPM fan3: 3033 RPM My N216 tasks have a working set of about 1.3 GBytes until near the very end, when it goes up to 1.4 GBytes. top - 14:07:51 up 1 day, 13:30, 1 user, load average: 8.51, 8.47, 8.48 Tasks: 448 total, 10 running, 436 sleeping, 2 stopped, 0 zombie %Cpu(s): 0.4 us, 0.1 sy, 49.7 ni, 47.3 id, 2.3 wa, 0.1 hi, 0.1 si, 0.0 st MiB Mem : 63902.3 total, 875.6 free, 9775.2 used, 53251.5 buff/cache MiB Swap: 15992.0 total, 15981.7 free, 10.2 used. 53312.8 avail Mem PID PPID USER PR NI S RES SHR %MEM %CPU P TIME+ COMMAND 61938 61919 boinc 39 19 T 1.3g 19912 2.1 0.0 1 969:23.66 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ 27768 27760 boinc 39 19 R 1.3g 19764 2.1 99.1 3 1469:19 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ 39401 39389 boinc 39 19 R 1.3g 19892 2.1 99.6 6 1069:19 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ ... 21832 21828 boinc 39 19 R 674088 12696 1.0 99.5 5 1482:50 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.09_i68+ 33373 33367 boinc 39 19 T 674084 12696 1.0 0.0 1 1245:39 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.09_i68+ ... 2039 1 boinc 30 10 S 32592 17628 0.0 0.1 11 10519:18 /usr/bin/boinc 61919 2039 boinc 39 19 S 17808 17136 0.0 0.0 14 0:42.56 ../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-g+ 27760 2039 boinc 39 19 S 17484 16816 0.0 0.0 13 0:48.43 ../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-g+ 39389 2039 boinc 39 19 S 17468 16804 0.0 0.0 9 0:44.24 ../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-g+ 33367 2039 boinc 39 19 S 11744 9852 0.0 0.0 14 0:45.89 ../../projects/climateprediction.net/hadam4_8.09_i686-pc-linux-g+ 21828 2039 boinc 39 19 S 11624 9736 0.0 0.0 13 0:51.84 ../../projects/climateprediction.net/hadam4_8.09_i686-pc-linux-g+ ID: 64342 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 64343 - Posted: 12 Aug 2021, 9:53:08 UTC I was curious about the WUs I completed but have zero credit. E.g. https://www.cpdn.org/workunit.php?wuid=12089673. The trickle server is now up and running and your trickles are showing for this task so next time the credit script runs, your credit will be awarded. (If all goes according to plan it will show up some time Sunday.) ID: 64343 · Reply Quote

Aurum Send message Joined: 15 Jul 17 Posts: 99 Credit: 18,701,746 RAC: 318	Message 64345 - Posted: 12 Aug 2021, 13:20:59 UTC Thanks, but I'm not concerned about RAM, just CPU cache limits. Someone pulled this running 26 WUs business out of thin air. I never said it and have never done it. Any thoughts about my questions on CPU cache limits and using sec/TS? ID: 64345 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 64346 - Posted: 12 Aug 2021, 13:36:19 UTC - in response to Message 64345. Any thoughts about my questions on CPU cache limits and using sec/TS? On my Ryzen 3600 and 3900X, I limit the N216 to no more than four at a time, though two is a bit better. I don't know how that scales for your machine, but be advised. You should be getting around 20 sec/TS. https://www.cpdn.org/results.php?hostid=1520871&offset=0&show_names=0&state=4&appid=33 The N144 take less cache, you could probably run twice as many, and are almost twice as fast. https://www.cpdn.org/results.php?hostid=1520871&offset=0&show_names=0&state=4&appid=31 I think the Ryzens, with their larger caches, will probably do better than the Intel CPUs. ID: 64346 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 64348 - Posted: 12 Aug 2021, 17:49:42 UTC - in response to Message 64346. Here is one task that finished recently on my machine. Until recently I limited my machine to four CPDN work units at a time. I have just switched to five to see how things go. I allow boinc-client to use only 8 of the 16 cores of my machine. Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 12 Aug 2021 09:45:31 1511241 22132200 hadam4h_10zb_209705_5_902_012078630_3 1 43,403 660,358 15.2146 12 Aug 2021 09:45:15 1511241 22132200 hadam4h_10zb_209705_5_902_012078630_3 1 34,763 534,513 15.3759 12 Aug 2021 09:44:24 1511241 22132200 hadam4h_10zb_209705_5_902_012078630_3 1 26,123 404,582 15.4876 12 Aug 2021 09:44:24 1511241 22132200 hadam4h_10zb_209705_5_902_012078630_3 1 17,483 272,375 15.5794 02 Aug 2021 06:44:37 1511241 22132200 hadam4h_10zb_209705_5_902_012078630_3 1 8,843 137,588 15.5590 My machine is like this: CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.4 (Ootpa) [4.18.0-305.10.2.el8_4.x86_64\|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.4 GB Cache 16896 KB Swap space 15.62 GB Total disk space 117.21 GB Free Disk Space 85.21 GB Measured floating point speed 6.42 billion ops/sec Measured integer speed 30.11 billion ops/sec Here is a more recent task that has not finished yet. I am now allowing 5 CPDN work units, but only four are running and two of those are N144 units. Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 12 Aug 2021 09:45:50 1511241 22069509 hadam4h_10py_209505_5_902_012078293_4 1 8,843 137,635 15.5643 ID: 64348 · Reply Quote

klepel Send message Joined: 9 Oct 04 Posts: 82 Credit: 70,035,400 RAC: 247	Message 64349 - Posted: 12 Aug 2021, 23:54:53 UTC I also limit to 4 WUs on AMD 3950, 2600, 1700. This seems to be a good compromise, with less WUs the calculation is definitely faster and with more the calculation is noticeable slower. All other threads are used for TN-grid or SiDock (and they are also impacted by the number of climateprediction.net WUs). This is why I limit the WUs on AMD 3950 although this chip has more cache than the others two. On my two WSL Computers I do limit climateprediction.net to 2 WUs per virtual machine, otherwise the RAM use is too high and the other WUs on Win10 are heavily impacted. ID: 64349 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 64355 - Posted: 13 Aug 2021, 16:39:11 UTC Running only two HadAM4 at N144 on a Ryzen 3900X, with the other 22 cores on SiDock, gives less than 9 sec/TS. https://www.cpdn.org/results.php?hostid=1520871&offset=0&show_names=0&state=0&appid=31 That is the best I have seen. It is usually around 12 sec/TS when running four N144 and maybe a couple of N216. ID: 64355 · Reply Quote

KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 64358 - Posted: 15 Aug 2021, 5:41:43 UTC https://www.cpdn.org/workunit.php?wuid=12102206 Errored out. UK Met Office HadAM4 at N144 resolution v8.09 ID: 64358 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 64360 - Posted: 15 Aug 2021, 9:19:56 UTC - in response to Message 64358. The last one I got that errored out was this one; about two weeks ago. that was also an hadam4. But my machine returned another work unit, also an hadam4, at the same time, that completed successfully. And at least three more completed successfully. So did hadam4h models. And one hadsm4 model. Task 22104168 Name hadam4_a1vn_201310_6_914_012100000_0 Workunit 12100000 Created 8 Jul 2021, 10:28:01 UTC Sent 30 Jul 2021, 17:14:26 UTC Report deadline 12 Jul 2022, 22:34:26 UTC Received 31 Jul 2021, 12:14:39 UTC <---<<< Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Run time 19 min 59 sec CPU time 19 min 13 sec Validate state Invalid Credit 0.00 Device peak FLOPS 6.42 GFLOPS Application version UK Met Office HadAM4 at N144 resolution v8.09 i686-pc-linux-gnu Peak working set size 649.04 MB Peak swap size 670.94 MB Peak disk usage 0.02 MB Stderr <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( <---<<< 08:03:16 (803961): called boinc_finish(22) </stderr_txt> ]]> ID: 64360 · Reply Quote

KAMasud Send message Joined: 6 Oct 06 Posts: 204 Credit: 7,608,986 RAC: 0	Message 64371 - Posted: 17 Aug 2021, 8:50:25 UTC What is the function of trickles, this handshake that goes on between server and client? I am not seeing any zip files in the uploads section. A bit peculiar. Eight WU's running normally generates a zip file or two per day. I can see the message of trickles being stuck. ID: 64371 · Reply Quote

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 150 Credit: 12,830,559 RAC: 228	Message 64372 - Posted: 17 Aug 2021, 9:01:15 UTC - in response to Message 64371. What is the function of trickles, this handshake that goes on between server and client? I am not seeing any zip files in the uploads section. A bit peculiar. Eight WU's running normally generates a zip file or two per day. I can see the message of trickles being stuck. The trickles report the results of the computation every simulated month / year. Given the length of each WU it insures against a machine crashing and loosing the WU. ID: 64372 · Reply Quote