Message boards : Number crunching : New work Discussion
Message board moderation
Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 91 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
And bar some inevitable resends, the hadcm3s tasks are all gone now and at the current rate, it won't be that many days till the current batch of HADAM4 tasks are also gone. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I just now manually terminated all the "shorts" batch 926 waiting to run. Because they've all been dying sigsegv If cpdn runs out of work, I can run other projects, and maybe spend a few hours on hardware and software updates. keep on crunching e |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,832,769 RAC: 5,024 |
I just now manually terminated all the "shorts" batch 926 waiting to run. Thanks for doing that! My ancient Mac mini is slowly crunching through those models without failing so far, and has just picked up one of your rejects. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I just now manually terminated all the "shorts" batch 926 waiting to run. I have had about 1/3 success and 2/3 fail with sigsegv on this batch. It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I have had about 1/3 success and 2/3 fail with sigsegv on this batch. It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these. I have had 100% failure rate with this batch, but pretty good success rate with those like this from last spring; most were like this Name hadcm3s_a1dq_191012_120_900_012072461_0 Workunit 12072461 Created 31 Mar 2021, 15:02:24 UTC Sent 21 Apr 2021, 0:45:49 UTC Report deadline 3 Apr 2022, 6:05:49 UTC Received 23 Apr 2021, 17:47:34 UTC Server state Over Outcome Success but one failed like this; I consider this a legitimate failure perhaps due to unfortunate values for the initial conditions. The other user failed due to missing libraries. Task 22024864 Name hadcm3s_r157_190012_240_837_011897728_1 Workunit 11897728 Created 28 Feb 2021, 11:33:16 UTC Sent 9 Mar 2021, 12:13:36 UTC Report deadline 19 Feb 2022, 17:33:36 UTC Received 12 Mar 2021, 11:59:59 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Run time 2 days 5 hours 3 min 21 sec CPU time 30 sec Validate state Invalid Credit 3,421.44 Device peak FLOPS 6.58 GFLOPS Application version UK Met Office HadCM3 short v8.36 i686-pc-linux-gnu Peak working set size 175.64 MB Peak swap size 215.60 MB Peak disk usage 98.92 MB Stderr <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> MainError: 11:23:44 AM No files match the supplied pattern. MainError: 11:23:44 AM No files match the supplied pattern. MainError: 04:04:46 PM No files match the supplied pattern. MainError: 04:04:46 PM No files match the supplied pattern. MainError: 08:50:45 PM No files match the supplied pattern. MainError: 08:50:45 PM No files match the supplied pattern. MainError: 01:39:23 AM No files match the supplied pattern. MainError: 01:39:23 AM No files match the supplied pattern. MainError: 06:24:21 AM No files match the supplied pattern. MainError: 06:24:21 AM No files match the supplied pattern. MainError: 11:01:18 AM No files match the supplied pattern. MainError: 11:01:18 AM No files match the supplied pattern. MainError: 03:43:02 PM No files match the supplied pattern. MainError: 03:43:02 PM No files match the supplied pattern. MainError: 08:28:44 PM No files match the supplied pattern. MainError: 08:28:44 PM No files match the supplied pattern. MainError: 01:12:53 AM No files match the supplied pattern. MainError: 01:12:53 AM No files match the supplied pattern. MainError: 05:51:28 AM No files match the supplied pattern. MainError: 05:51:28 AM No files match the supplied pattern. MainError: 10:26:43 AM No files match the supplied pattern. MainError: 10:26:43 AM No files match the supplied pattern. Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Sorry, too many model crashes! :-( 06:39:07 (80419): called boinc_finish(22) </stderr_txt> ]]> It seems to me that if the model or the initial conditions were bad, but marginal enough that some machines, including their processors and libraries, were slightly different, one wiould expect to get floating point exceptions, NaN exceptions, and so forth. BUT NOT SEGMENTATION VIOLATIONS.[/code] |
Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258 |
Dave Jackson wrote: It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these. My Mac looks set to complete the seventh of the seven tasks it's received from the current HadCM3 batch tomorrow. Five of the work units had previously failed on other computers and three of them on two others. The eight previous computers all have a history of failing HadCM3s. The work units in this batch only get three chances each, so if their third chances go to "bad" computers, presumably they'll be "weeded out" as bad units. This and the 11.5-month deadlines make me wonder what useful science is actually being done here. :\ NG |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
The one computer that I tried to run these on failed all 8 tasks with segmentation violations. Looking at those work units, a couple of those work units had all 3 tasks fail with segmentation violation errors, a few had all fail for various errors, and a couple work units had one of the tasks progress and produce trickles, including producing trickles on another Linux PC. Quite odd. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
As you say George, "quite odd." Just looked at the work units with most of my failures. Two went on to produce trickles but still failed eventually. One on a Mac and one on another Linux box. One I have found went on to complete on another Linux box. I have just closed down the links - I should really have checked whether the ones that went on to produce trickles/complete on Linux were AMD architecture like my own or Intel. Might get around to that later. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,903,103 RAC: 18,120 |
Just finished last 2 HadCM3s, both failed. Altogether 2 for 22. Some were on WSL2 Ubuntu 20.04 some on Hyper-V Ubuntu 20.04, both setups on the same Ryzen 9 computer. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
The one computer that I tried to run these on failed all 8 tasks with segmentation violations. Looking at those work units, a couple of those work units had all 3 tasks fail with segmentation violation errors, a few had all fail for various errors, and a couple work units had one of the tasks progress and produce trickles, including producing trickles on another Linux PC. Quite odd. Odd indeed. I'm clueless SIGSEGV ?? and some work a bit at least on on Macs? I have no further useful input. Hope someone somewhere can figure this problem out. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Odd indeed. I'm clueless SIGSEGV ?? and some work a bit at least on on Macs? Wikipedia has this to say (long): https://en.wikipedia.org/wiki/Segmentation_fault Of course, most of the above applies mainly to C and C++ programs that can allocate RAM dynamically, and use pointers. Since CPDN programs are mostly FORTRAN, they do not use pointers, so the best way to get these faults is to use arrays and let their subscripts go off the end of the array. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Hopper almost empty and so far nothing seen preparing to be poured in. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
All gone. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Two OpenIFS tasks currently running in testing but no discussion to suggest they are getting near ready to launch on main site. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
My machine has now run out of ClimagePrediction work units. It has also run out of Rosetta and Universe work units, so I am running only WCG work units, and at most 5 of them. This has greatly improved my cache hit ratio. I infer that the N216 work unit are the ones that gobble up the processor cache(s). This is nor a surprise, but it is gratifying to be able to see why. # perf stat -aB -e cache-references,cache-misses Performance counter stats for 'system wide': 7,549,164,603 cache-references 1,883,988,132 cache-misses # 24.956 % of all cache refs 65.847219356 seconds time elapsed # ps -fu boinc UID PID PPID C STIME TTY TIME CMD boinc 19484 1 0 Jan23 ? 00:08:17 /usr/bin/boinc [this is the boinc client] boinc 509317 19484 99 05:37 ? 05:40:44 ../../projects/www.worldcommunitygrid.org/wcgrid_arp1_wrf_7.32_x86_64-pc-linux-gnu boinc 525303 19484 98 10:14 ? 01:05:44 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu boinc 526551 19484 99 10:38 ? 00:42:01 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu boinc 527966 19484 99 11:01 ? 00:19:20 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu boinc 528648 19484 99 11:13 ? 00:06:50 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc-linux-gnu -Sett |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Hopper almost empty and so far nothing seen preparing to be poured in. I got an N216 and an hadcm3s work unit (both re-runs) recently and they are both running fine. I am kind-of amazed at the hadcm3s one because I never got more than about 3 seconds on the 16 or so of those I ran recently whereupon they crashed after three seconds or so with a segmentation fault. This one has run for over 10 hours and delivered two trickles. The two previous attempts error-ed out for reasons I could not understand. (not missing libraries, not segmentation violations). They were on apple-Darwin machines. Task 22191699 Name hadcm3s_1k9d_200012_168_926_012129726_2 Workunit 12129726 Created 29 Jan 2022, 20:46:55 UTC Sent 29 Jan 2022, 20:48:05 UTC Report deadline 12 Jan 2023, 2:08:05 UTC Received --- Server state In progress Outcome --- Client state New Exit status 0 (0x00000000) Computer ID 1511241 |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Just downloading 4 N144 tasks from testing. I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work. Do those openIFS tasks work, or do they crash? How much RAM do they currently take? Any idea how much processor cache they require? |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work. They don't crash, the last batch I checked were taking 12GB of RAM each and uploads were about 550MB Haven't tried to check on CPU cache but it hasn't been raised as an issue by other testers so I suspect not as much as the N216 tasks. Some batches have had final uploads of over 1GB so I have had them uploading while I sleep if on a day when I am doing any Zoom calls. Obviously not an issue for those with real broad as opposed to bored band. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
They don't crash, the last batch I checked were taking 12GB of RAM each and uploads were about 550MB Haven't tried to check on CPU cache but it hasn't been raised as an issue by other testers so I suspect not as much as the N216 tasks. Some batches have had final uploads of over 1GB so I have had them uploading while I sleep if on a day when I am doing any Zoom calls. Obviously not an issue for those with real broad as opposed to bored band. That is good information. The memory and bandwidth requirements are quite large, but a number of us could do a few at a time if that is what it takes. Of course, that may not be enough to do them much good, but that is another question. |
©2024 cpdn.org