climateprediction.net (CPDN) home page
Thread 'Computation Errors'

Thread 'Computation Errors'

Message boards : Number crunching : Computation Errors
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64330 - Posted: 11 Aug 2021, 11:49:18 UTC

Is this me or CP???
https://www.cpdn.org/result.php?resultid=22107426
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>
Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy
Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy
Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy
Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy
Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy
Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy
Sorry, too many model crashes! :-(
04:43:17 (4924): called boinc_finish(22)
</stderr_txt>
]]>
ID: 64330 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64332 - Posted: 11 Aug 2021, 12:22:00 UTC - in response to Message 64330.  

It's the model - NEGATIVE THETA means that the program has detected that some parameter has moved beyond what is physically possible, so the modelling has been stopped.
I had one recently.

It's possible that the researcher is exploring an area of climate that's really close to the edge of what's possible.
ID: 64332 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 64334 - Posted: 11 Aug 2021, 12:31:30 UTC

Is this me or CP???


As Les says, most likely the model unless you are getting lots of them across different batches?

Are you running all 36 cores? If so, I would suggest cutting the number which will probably increase your throughput of tasks. (I see that the computer in question has yet to finish a task under it's current configuration.
ID: 64334 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,332,519
RAC: 10,361
Message 64335 - Posted: 11 Aug 2021, 14:21:01 UTC - in response to Message 64330.  

Is this me or CP???
https://www.cpdn.org/result.php?resultid=22107426
This task was running on computer ID 1521341. i9-10980XE, Linux Mint with 36 cores in just 15GB of memory? 1521341 downloaded 36 tasks on 31/7 and 25 have aborted or crashed fairly quickly with a range of Signal(n) or SIGSEGV errors too.

3GB of memory per cpu is a figure discussed by others. 15GB of memory is way too small for CPDN to be happy with 36 tasks. With insufficient memory, the tasks will be 'swapped' all the time and may be exposed to soft errors in the disc I/O. (Discs will hide these soft errors from users until it's too late!). Task 22131978 (a CM3S) is running at less than half the sec/TS of our older i7, suggesting that tasks are being throttled. The 'Invalid Theta' error may simply be another emanation of the task aborts and crashes.

The Intel specs suggest that i9-10980XE (Core-X) maximum memory size is 256GB? Is 1521341 a Linux virtual machine and you could increase the virtual configuration of memory and reduce the number of cores? Good luck.
ID: 64335 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64336 - Posted: 11 Aug 2021, 15:48:34 UTC

I have 7 of these WUs with a negative theta error on 5 different computers all of whom have completed at least one WU. Good to know it's not me.
In light of the CPU Cache Congestion issue I'm still trying to figure out what my rules should be. For X99 CPUs I limit the number of CP WUs to the L3 Cache divided by 5. So for an E5-2699 with 55 MB L3 I'd run 11 or fewer WUs. Since it seems almost impossible to DL 11 WUs that hasn't been a problem yet. I further limit hadcm3s to a single WU since it appears hypersensitive to having multiple WUs running and slows down dramatically. I limit hadcm3s to one WU per 16 GB of RAM to avoid combinations that force using the Swap File. I fill it out with hadam4 and hadsm4. E.g.,
<app_config>
<app>
    <name>hadcm3s</name>
    <!-- UK Met Office HadCM3 short, 3 days -->
    <!-- Xeon E5-2699 v4   22c44t   32 GB   L3 Cache 55 MB  -->
    <max_concurrent>1</max_concurrent>
</app>
<app>
    <name>hadam4</name>
    <!-- UK Met Office HadAM4 at N144 resolution, 4 days -->
    <max_concurrent>4</max_concurrent>
</app>
<app>
    <name>hadsm4</name>
    <!-- UK Met Office HadSM4 at N144 resolution, 4 days -->
    <max_concurrent>4</max_concurrent>
</app>
<app>
    <name>hadam4h</name>
    <!-- UK Met Office HadAM4 at N216 resolution, 14 days -->
    <max_concurrent>2</max_concurrent>
</app>
<project_max_concurrent>11</project_max_concurrent>
</app_config>
My problem is what to do with the X299 CPUs that have a different CPU Cache design. Is the rule still 5 MB per WU or is it more like 2.75 MB per WU???
i9-9980XE
Level 1 cache size 18 x 32 KiB 8-way set associative instruction caches + 18 x 32 KB 8-way set associative data caches
Level 2 cache size 18 x 1 MiB 16-way set associative caches
Level 3 cache size 24.75 MiB = 18 x 1.375 MiB 11-way set associative write-back
E5-2699
Level 1 cache size 22 x 32 KiB 8-way set associative instruction caches + 22 x 32 KB 8-way set associative data caches
Level 2 cache size 22 x 256 KiB 8-way set associative caches
Level 3 cache size 55 MiB = 22 x 2.5 MiB 20-way set associative shared cache
ID: 64336 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64338 - Posted: 11 Aug 2021, 16:15:23 UTC - in response to Message 64335.  

Is this me or CP???
https://www.cpdn.org/result.php?resultid=22107426
This task was running on computer ID 1521341. i9-10980XE, Linux Mint with 36 cores in just 15GB of memory? 1521341 downloaded 36 tasks on 31/7 and 25 have aborted or crashed fairly quickly with a range of Signal(n) or SIGSEGV errors too.

3GB of memory per cpu is a figure discussed by others. 15GB of memory is way too small for CPDN to be happy with 36 tasks. With insufficient memory, the tasks will be 'swapped' all the time and may be exposed to soft errors in the disc I/O. (Discs will hide these soft errors from users until it's too late!). Task 22131978 (a CM3S) is running at less than half the sec/TS of our older i7, suggesting that tasks are being throttled. The 'Invalid Theta' error may simply be another emanation of the task aborts and crashes.

The Intel specs suggest that i9-10980XE (Core-X) maximum memory size is 256GB? Is 1521341 a Linux virtual machine and you could increase the virtual configuration of memory and reduce the number of cores? Good luck.
The Aborts are me. Before the servers were taken down I got a huge DL of hadam4h with some computers needing over a year to complete them. I hate sitting on a hoard of WUs that someone else could be running so I returned them. Fixing the Work Server configuration could easily prevent this. I'm running Linux Mint 20.2 Ubuntu 20.04 and I disable virtualization in the BIOS config.
I'm trying to figure out how to interpret the stdout_mon.txt file. What is DLT? Every row is different and the time for each Time Slice increases each row until it suddenly resets (dry air vs moist air maybe?). The Average is dominated by the first value which is high, maybe yesterday's ending average when I shut down for the heat wave and sky-high summer TOU electric rates. Yet I see single sec/TS values being quoted. How can a single value represent such disparate data? I'm hoping if I can learn to interpret sec/TS correctly it could be the figure of merit to figure out how many of each WU can run efficiently on which class of CPU.
E.g., why does DLT have spikes? Need to learn how to "see the graphs for this run."
Trying to kill old process # 2648
Trying to kill old process # 2759
Created shared memory region key = 175990 of size 13519772 bytes (version 608)
Run for 0 Years and 5 Months
pShMem->UPLOAD_INTERVAL 0
ulTotalPhaseTimestep 43488
Starting model ID hadam4h_b039_200611_5_882_012035125 Phase 1
Program launched with process id # 2629
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
Getting pthread attributes - retval=0
Setting pthread size (-1778384896 bytes) - retval=0
Executing program /var/lib/boinc-client/projects/climateprediction.net/hadam4_um_8.52_i686-pc-linux-gnu
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006193 A - 22/11/2006 12:05 - H:M:S=0089:59:11 AVG=52.31 DLT= 0.00
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006194 A - 22/11/2006 12:10 - H:M:S=0090:02:01 AVG=52.33 DLT=170.71
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006195 A - 22/11/2006 12:15 - H:M:S=0090:02:44 AVG=52.33 DLT=42.56
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006196 A - 22/11/2006 12:20 - H:M:S=0090:03:25 AVG=52.33 DLT=41.39
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006197 A - 22/11/2006 12:25 - H:M:S=0090:04:04 AVG=52.32 DLT=38.19
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006198 A - 22/11/2006 12:30 - H:M:S=0090:04:47 AVG=52.32 DLT=43.29
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006199 A - 22/11/2006 12:35 - H:M:S=0090:05:28 AVG=52.32 DLT=41.29
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006200 A - 22/11/2006 12:40 - H:M:S=0090:06:08 AVG=52.32 DLT=40.26
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006201 A - 22/11/2006 12:45 - H:M:S=0090:06:50 AVG=52.32 DLT=41.26
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006202 A - 22/11/2006 12:50 - H:M:S=0090:07:33 AVG=52.31 DLT=43.02
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006203 A - 22/11/2006 12:55 - H:M:S=0090:08:17 AVG=52.31 DLT=43.99
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006204 A - 22/11/2006 13:00 - H:M:S=0090:08:55 AVG=52.31 DLT=38.67
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006205 A - 22/11/2006 13:05 - H:M:S=0090:09:39 AVG=52.31 DLT=43.29
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006206 A - 22/11/2006 13:10 - H:M:S=0090:10:20 AVG=52.31 DLT=41.76
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006207 A - 22/11/2006 13:15 - H:M:S=0090:11:01 AVG=52.31 DLT=40.45
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006208 A - 22/11/2006 13:20 - H:M:S=0090:11:41 AVG=52.30 DLT=39.82
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006209 A - 22/11/2006 13:25 - H:M:S=0090:12:19 AVG=52.30 DLT=38.48
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006210 A - 22/11/2006 13:30 - H:M:S=0090:12:53 AVG=52.30 DLT=34.36
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006211 A - 22/11/2006 13:35 - H:M:S=0090:13:33 AVG=52.30 DLT=39.82
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006212 A - 22/11/2006 13:40 - H:M:S=0090:14:16 AVG=52.30 DLT=43.08
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006213 A - 22/11/2006 13:45 - H:M:S=0090:14:58 AVG=52.29 DLT=41.30
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006214 A - 22/11/2006 13:50 - H:M:S=0090:15:35 AVG=52.29 DLT=37.62
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006215 A - 22/11/2006 13:55 - H:M:S=0090:16:17 AVG=52.29 DLT=41.61
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006216 A - 22/11/2006 14:00 - H:M:S=0090:16:58 AVG=52.29 DLT=41.20
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006217 A - 22/11/2006 14:05 - H:M:S=0090:17:39 AVG=52.29 DLT=41.05
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006218 A - 22/11/2006 14:10 - H:M:S=0090:18:17 AVG=52.28 DLT=37.72
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006219 A - 22/11/2006 14:15 - H:M:S=0090:18:54 AVG=52.28 DLT=36.77
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006220 A - 22/11/2006 14:20 - H:M:S=0090:19:34 AVG=52.28 DLT=40.56
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006221 A - 22/11/2006 14:25 - H:M:S=0090:20:11 AVG=52.28 DLT=36.62
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006222 A - 22/11/2006 14:30 - H:M:S=0090:20:55 AVG=52.28 DLT=44.05
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006223 A - 22/11/2006 14:35 - H:M:S=0090:21:37 AVG=52.27 DLT=42.20
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006224 A - 22/11/2006 14:40 - H:M:S=0090:22:19 AVG=52.27 DLT=41.45
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006225 A - 22/11/2006 14:45 - H:M:S=0090:22:59 AVG=52.27 DLT=40.86
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006226 A - 22/11/2006 14:50 - H:M:S=0090:23:39 AVG=52.27 DLT=39.36
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006227 A - 22/11/2006 14:55 - H:M:S=0090:24:22 AVG=52.27 DLT=43.32
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006228 A - 22/11/2006 15:00 - H:M:S=0090:25:01 AVG=52.26 DLT=39.43
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006229 A - 22/11/2006 15:05 - H:M:S=0090:25:45 AVG=52.26 DLT=43.46
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006230 A - 22/11/2006 15:10 - H:M:S=0090:28:34 AVG=52.28 DLT=169.03
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006231 A - 22/11/2006 15:15 - H:M:S=0090:29:17 AVG=52.28 DLT=42.74
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006232 A - 22/11/2006 15:20 - H:M:S=0090:29:59 AVG=52.28 DLT=41.84
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006233 A - 22/11/2006 15:25 - H:M:S=0090:30:41 AVG=52.28 DLT=42.91
hadam4h_b039_200611_5_882_012035125 - PH 1 TS 0006234 A - 22/11/2006 15:30 - H:M:S=0090:31:19 AVG=52.27 DLT=37.80
ID: 64338 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64339 - Posted: 11 Aug 2021, 16:29:42 UTC

I was curious about the WUs I completed but have zero credit. E.g. https://www.cpdn.org/workunit.php?wuid=12089673.
Three different computers errored out for 3 different reasons. If my 4th wingman fails will this WU be credited???
ID: 64339 · Report as offensive     Reply Quote
bozz4science

Send message
Joined: 10 May 20
Posts: 50
Credit: 3,417,917
RAC: 2,363
Message 64340 - Posted: 11 Aug 2021, 16:43:06 UTC

Regarding the credit question, you should get credit for the finished task once the weekly script will run. I guess that it is always run on Sundays if I remember correctly. Maybe others can help with the rest
ID: 64340 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 64341 - Posted: 11 Aug 2021, 17:17:15 UTC - in response to Message 64339.  

I was curious about the WUs I completed but have zero credit. E.g. https://www.cpdn.org/workunit.php?wuid=12089673.
Three different computers errored out for 3 different reasons. If my 4th wingman fails will this WU be credited???


With regards to the completed task it should get credit whatever happens to wingmen. Unless something has gone wrong, all the wingmen have failed and because you have completed the task there won't be any more wingmen. CPDN does not use validation that requires two successful completions of the task like many projects do but awards credit on the basis of the trickle up messages. I see that task does not show any trickles but that is likely because the tricle server is still down after the recent work at Oxford.

My total credit did go up on Sunday when the credit script ran but average credit dropped a lot because of trickles not being recorded. (They are still sitting in /var/log/somethingorother waiting to go.) If no change I will nudge Andy tomorrow afternoon giving a chance for things to be sorted. (I am pretty sure he is aware of the problem anyway.)

Interesting what you say about the sensitivity of some tasks. - Even if I use all 16 cores (8of the real) on my Ryzen, I don't seem to get any increase in error rate. (That is with 32GB Ram.)
ID: 64341 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64342 - Posted: 11 Aug 2021, 18:37:57 UTC - in response to Message 64335.  


3GB of memory per cpu is a figure discussed by others. 15GB of memory is way too small for CPDN to be happy with 36 tasks. With insufficient memory, the tasks will be 'swapped' all the time and may be exposed to soft errors in the disc I/O. (Discs will hide these soft errors from users until it's too late!). Task 22131978 (a CM3S) is running at less than half the sec/TS of our older i7, suggesting that tasks are being throttled. The 'Invalid Theta' error may simply be another emanation of the task aborts and crashes.


It seems to me that 3GB is way more RAM than a CPDN task, even an N216 task, needs. But it probably not a good idea to run 36 tasks on a machine with only 15 GBytes of RAM, though it should work if you have enough disk space for swapping. If your hardware is OK, it should just run them slowly because of processor-cache interference and of thrashing to the swap space on disk, especially if it is a spinning device. If running a large number of CPDN tasks causes errors, I would suspect errors on the swap disk, or overheating of the CPU(s). Right now, mine are not overheating.
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +78.0°C  (high = +88.0°C, crit = +98.0°C)
Core 1:        +70.0°C  (high = +88.0°C, crit = +98.0°C)
Core 2:        +68.0°C  (high = +88.0°C, crit = +98.0°C)
Core 3:        +76.0°C  (high = +88.0°C, crit = +98.0°C)
Core 5:        +68.0°C  (high = +88.0°C, crit = +98.0°C)
Core 8:        +73.0°C  (high = +88.0°C, crit = +98.0°C)
Core 9:        +67.0°C  (high = +88.0°C, crit = +98.0°C)
Core 11:       +78.0°C  (high = +88.0°C, crit = +98.0°C)
Core 12:       +70.0°C  (high = +88.0°C, crit = +98.0°C)

amdgpu-pci-6500
Adapter: PCI adapter
vddgfx:       +0.69 V  
fan1:        2051 RPM  (min = 1800 RPM, max = 6000 RPM)
edge:         +40.0°C  (crit = +97.0°C, hyst = -273.1°C)
power1:        3.24 W  (cap =  25.00 W)

dell_smm-virtual-0
Adapter: Virtual device
fan1:        4261 RPM
fan2:         891 RPM
fan3:        3033 RPM


My N216 tasks have a working set of about 1.3 GBytes until near the very end, when it goes up to 1.4 GBytes.

top - 14:07:51 up 1 day, 13:30,  1 user,  load average: 8.51, 8.47, 8.48
Tasks: 448 total,  10 running, 436 sleeping,   2 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.1 sy, 49.7 ni, 47.3 id,  2.3 wa,  0.1 hi,  0.1 si,  0.0 st
MiB Mem :  63902.3 total,    875.6 free,   9775.2 used,  53251.5 buff/cache
MiB Swap:  15992.0 total,  15981.7 free,     10.2 used.  53312.8 avail Mem 

    PID    PPID USER      PR  NI S    RES    SHR  %MEM  %CPU  P     TIME+ COMMAND                                                           
  61938   61919 boinc     39  19 T   1.3g  19912   2.1   0.0  1 969:23.66 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ 
  27768   27760 boinc     39  19 R   1.3g  19764   2.1  99.1  3   1469:19 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ 
  39401   39389 boinc     39  19 R   1.3g  19892   2.1  99.6  6   1069:19 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+ 
 ...
  21832   21828 boinc     39  19 R 674088  12696   1.0  99.5  5   1482:50 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.09_i68+ 
  33373   33367 boinc     39  19 T 674084  12696   1.0   0.0  1   1245:39 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.09_i68+
...
   2039       1 boinc     30  10 S  32592  17628   0.0   0.1 11  10519:18 /usr/bin/boinc                                                    
  61919    2039 boinc     39  19 S  17808  17136   0.0   0.0 14   0:42.56 ../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-g+ 
  27760    2039 boinc     39  19 S  17484  16816   0.0   0.0 13   0:48.43 ../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-g+ 
  39389    2039 boinc     39  19 S  17468  16804   0.0   0.0  9   0:44.24 ../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-g+ 
  33367    2039 boinc     39  19 S  11744   9852   0.0   0.0 14   0:45.89 ../../projects/climateprediction.net/hadam4_8.09_i686-pc-linux-g+ 
  21828    2039 boinc     39  19 S  11624   9736   0.0   0.0 13   0:51.84 ../../projects/climateprediction.net/hadam4_8.09_i686-pc-linux-g+ 
 

ID: 64342 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 64343 - Posted: 12 Aug 2021, 9:53:08 UTC

I was curious about the WUs I completed but have zero credit. E.g. https://www.cpdn.org/workunit.php?wuid=12089673.


The trickle server is now up and running and your trickles are showing for this task so next time the credit script runs, your credit will be awarded. (If all goes according to plan it will show up some time Sunday.)
ID: 64343 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 99
Credit: 18,701,746
RAC: 318
Message 64345 - Posted: 12 Aug 2021, 13:20:59 UTC

Thanks, but I'm not concerned about RAM, just CPU cache limits. Someone pulled this running 26 WUs business out of thin air. I never said it and have never done it. Any thoughts about my questions on CPU cache limits and using sec/TS?
ID: 64345 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 64346 - Posted: 12 Aug 2021, 13:36:19 UTC - in response to Message 64345.  

Any thoughts about my questions on CPU cache limits and using sec/TS?

On my Ryzen 3600 and 3900X, I limit the N216 to no more than four at a time, though two is a bit better.
I don't know how that scales for your machine, but be advised.

You should be getting around 20 sec/TS.
https://www.cpdn.org/results.php?hostid=1520871&offset=0&show_names=0&state=4&appid=33

The N144 take less cache, you could probably run twice as many, and are almost twice as fast.
https://www.cpdn.org/results.php?hostid=1520871&offset=0&show_names=0&state=4&appid=31

I think the Ryzens, with their larger caches, will probably do better than the Intel CPUs.
ID: 64346 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64348 - Posted: 12 Aug 2021, 17:49:42 UTC - in response to Message 64346.  

Here is one task that finished recently on my machine. Until recently I limited my machine to four CPDN work units at a time. I have just switched to five to see how things go. I allow boinc-client to use only 8 of the 16 cores of my machine.
Time Sent (UTC) 	Host ID 	Result ID 	Result Name 	Phase 	Timestep 	CPU Time (sec) 	Average (sec/TS)
12 Aug 2021 09:45:31 	1511241 	22132200 	hadam4h_10zb_209705_5_902_012078630_3 	1 	43,403 	660,358 	15.2146
12 Aug 2021 09:45:15 	1511241 	22132200 	hadam4h_10zb_209705_5_902_012078630_3 	1 	34,763 	534,513 	15.3759
12 Aug 2021 09:44:24 	1511241 	22132200 	hadam4h_10zb_209705_5_902_012078630_3 	1 	26,123 	404,582 	15.4876
12 Aug 2021 09:44:24 	1511241 	22132200 	hadam4h_10zb_209705_5_902_012078630_3 	1 	17,483 	272,375 	15.5794
02 Aug 2021 06:44:37 	1511241 	22132200 	hadam4h_10zb_209705_5_902_012078630_3 	1 	8,843 	137,588 	15.5590


My machine is like this:
CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.4 (Ootpa) [4.18.0-305.10.2.el8_4.x86_64|libc 2.28 (GNU libc)]
BOINC version 	    7.16.11
Memory 	                   62.4 GB
Cache 	                  16896 KB
Swap space 	          15.62 GB
Total disk space 	 117.21 GB
Free Disk Space 	  85.21 GB
Measured floating point speed 	6.42 billion ops/sec
Measured integer speed 	       30.11 billion ops/sec


Here is a more recent task that has not finished yet. I am now allowing 5 CPDN work units, but only four are running and two of those are N144 units.
Time Sent (UTC) 	Host ID 	Result ID 	Result Name 	Phase 	Timestep 	CPU Time (sec) 	Average (sec/TS)
12 Aug 2021 09:45:50 	1511241 	22069509 	hadam4h_10py_209505_5_902_012078293_4 	1 	8,843 	137,635 	15.5643

ID: 64348 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 69,922,704
RAC: 8,182
Message 64349 - Posted: 12 Aug 2021, 23:54:53 UTC

I also limit to 4 WUs on AMD 3950, 2600, 1700. This seems to be a good compromise, with less WUs the calculation is definitely faster and with more the calculation is noticeable slower. All other threads are used for TN-grid or SiDock (and they are also impacted by the number of climateprediction.net WUs).
This is why I limit the WUs on AMD 3950 although this chip has more cache than the others two.
On my two WSL Computers I do limit climateprediction.net to 2 WUs per virtual machine, otherwise the RAM use is too high and the other WUs on Win10 are heavily impacted.
ID: 64349 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 64355 - Posted: 13 Aug 2021, 16:39:11 UTC

Running only two HadAM4 at N144 on a Ryzen 3900X, with the other 22 cores on SiDock, gives less than 9 sec/TS.
https://www.cpdn.org/results.php?hostid=1520871&offset=0&show_names=0&state=0&appid=31

That is the best I have seen. It is usually around 12 sec/TS when running four N144 and maybe a couple of N216.
ID: 64355 · Report as offensive     Reply Quote
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 64358 - Posted: 15 Aug 2021, 5:41:43 UTC

https://www.cpdn.org/workunit.php?wuid=12102206

Errored out. UK Met Office HadAM4 at N144 resolution v8.09
ID: 64358 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64360 - Posted: 15 Aug 2021, 9:19:56 UTC - in response to Message 64358.  

The last one I got that errored out was this one; about two weeks ago. that was also an hadam4. But my machine returned another work unit, also an hadam4, at the same time, that completed successfully. And at least three more completed successfully. So did hadam4h models. And one hadsm4 model.
Task 22104168
Name 	hadam4_a1vn_201310_6_914_012100000_0
Workunit 	12100000
Created 	8 Jul 2021, 10:28:01 UTC
Sent 	30 Jul 2021, 17:14:26 UTC
Report deadline 	12 Jul 2022, 22:34:26 UTC
Received 	31 Jul 2021, 12:14:39 UTC     <---<<<
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	22 (0x00000016) Unknown error code
Computer ID 	1511241
Run time 	19 min 59 sec
CPU time 	19 min 13 sec
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	6.42 GFLOPS
Application version 	UK Met Office HadAM4 at N144 resolution v8.09
i686-pc-linux-gnu
Peak working set size 	649.04 MB
Peak swap size 	670.94 MB
Peak disk usage 	0.02 MB
Stderr 	

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            
Sorry, too many model crashes! :-(     <---<<<
08:03:16 (803961): called boinc_finish(22)

</stderr_txt>
]]>

ID: 64360 · Report as offensive     Reply Quote
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 64371 - Posted: 17 Aug 2021, 8:50:25 UTC

What is the function of trickles, this handshake that goes on between server and client? I am not seeing any zip files in the uploads section. A bit peculiar. Eight WU's running normally generates a zip file or two per day. I can see the message of trickles being stuck.
ID: 64371 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 64372 - Posted: 17 Aug 2021, 9:01:15 UTC - in response to Message 64371.  

What is the function of trickles, this handshake that goes on between server and client? I am not seeing any zip files in the uploads section. A bit peculiar. Eight WU's running normally generates a zip file or two per day. I can see the message of trickles being stuck.


The trickles report the results of the computation every simulated month / year. Given the length of each WU it insures against a machine crashing and loosing the WU.
ID: 64372 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Computation Errors

©2024 cpdn.org