Thread 'New work Discussion'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 64974 - Posted: 16 Jan 2022, 9:09:46 UTC And bar some inevitable resends, the hadcm3s tasks are all gone now and at the current rate, it won't be that many days till the current batch of HADAM4 tasks are also gone. ID: 64974 ·

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 64975 - Posted: 17 Jan 2022, 15:01:34 UTC I just now manually terminated all the "shorts" batch 926 waiting to run. Because they've all been dying sigsegv If cpdn runs out of work, I can run other projects, and maybe spend a few hours on hardware and software updates. keep on crunching e ID: 64975 ·

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,832,769 RAC: 5,024	Message 64976 - Posted: 17 Jan 2022, 15:24:55 UTC - in response to Message 64975. I just now manually terminated all the "shorts" batch 926 waiting to run. Because they've all been dying sigsegv If cpdn runs out of work, I can run other projects, and maybe spend a few hours on hardware and software updates. keep on crunching e Thanks for doing that! My ancient Mac mini is slowly crunching through those models without failing so far, and has just picked up one of your rejects. ID: 64976 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 64977 - Posted: 17 Jan 2022, 17:10:28 UTC - in response to Message 64975. I just now manually terminated all the "shorts" batch 926 waiting to run. Because they've all been dying sigsegv If cpdn runs out of work, I can run other projects, and maybe spend a few hours on hardware and software updates. keep on crunching e I have had about 1/3 success and 2/3 fail with sigsegv on this batch. It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these. ID: 64977 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64978 - Posted: 17 Jan 2022, 20:12:19 UTC - in response to Message 64977. I have had about 1/3 success and 2/3 fail with sigsegv on this batch. It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these. I have had 100% failure rate with this batch, but pretty good success rate with those like this from last spring; most were like this Name hadcm3s_a1dq_191012_120_900_012072461_0 Workunit 12072461 Created 31 Mar 2021, 15:02:24 UTC Sent 21 Apr 2021, 0:45:49 UTC Report deadline 3 Apr 2022, 6:05:49 UTC Received 23 Apr 2021, 17:47:34 UTC Server state Over Outcome Success but one failed like this; I consider this a legitimate failure perhaps due to unfortunate values for the initial conditions. The other user failed due to missing libraries. Task 22024864 Name hadcm3s_r157_190012_240_837_011897728_1 Workunit 11897728 Created 28 Feb 2021, 11:33:16 UTC Sent 9 Mar 2021, 12:13:36 UTC Report deadline 19 Feb 2022, 17:33:36 UTC Received 12 Mar 2021, 11:59:59 UTC Server state Over Outcome Computation error Client state Compute error Exit status 22 (0x00000016) Unknown error code Computer ID 1511241 Run time 2 days 5 hours 3 min 21 sec CPU time 30 sec Validate state Invalid Credit 3,421.44 Device peak FLOPS 6.58 GFLOPS Application version UK Met Office HadCM3 short v8.36 i686-pc-linux-gnu Peak working set size 175.64 MB Peak swap size 215.60 MB Peak disk usage 98.92 MB Stderr <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> MainError: 11:23:44 AM No files match the supplied pattern. MainError: 11:23:44 AM No files match the supplied pattern. MainError: 04:04:46 PM No files match the supplied pattern. MainError: 04:04:46 PM No files match the supplied pattern. MainError: 08:50:45 PM No files match the supplied pattern. MainError: 08:50:45 PM No files match the supplied pattern. MainError: 01:39:23 AM No files match the supplied pattern. MainError: 01:39:23 AM No files match the supplied pattern. MainError: 06:24:21 AM No files match the supplied pattern. MainError: 06:24:21 AM No files match the supplied pattern. MainError: 11:01:18 AM No files match the supplied pattern. MainError: 11:01:18 AM No files match the supplied pattern. MainError: 03:43:02 PM No files match the supplied pattern. MainError: 03:43:02 PM No files match the supplied pattern. MainError: 08:28:44 PM No files match the supplied pattern. MainError: 08:28:44 PM No files match the supplied pattern. MainError: 01:12:53 AM No files match the supplied pattern. MainError: 01:12:53 AM No files match the supplied pattern. MainError: 05:51:28 AM No files match the supplied pattern. MainError: 05:51:28 AM No files match the supplied pattern. MainError: 10:26:43 AM No files match the supplied pattern. MainError: 10:26:43 AM No files match the supplied pattern. Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Sorry, too many model crashes! :-( 06:39:07 (80419): called boinc_finish(22) </stderr_txt> ]]> It seems to me that if the model or the initial conditions were bad, but marginal enough that some machines, including their processors and libraries, were slightly different, one wiould expect to get floating point exceptions, NaN exceptions, and so forth. BUT NOT SEGMENTATION VIOLATIONS.[/code] ID: 64978 ·

Nigel Garvey Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258	Message 64979 - Posted: 17 Jan 2022, 20:32:50 UTC - in response to Message 64977. Dave Jackson wrote: It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these. My Mac looks set to complete the seventh of the seven tasks it's received from the current HadCM3 batch tomorrow. Five of the work units had previously failed on other computers and three of them on two others. The eight previous computers all have a history of failing HadCM3s. The work units in this batch only get three chances each, so if their third chances go to "bad" computers, presumably they'll be "weeded out" as bad units. This and the 11.5-month deadlines make me wonder what useful science is actually being done here. :\ NG ID: 64979 ·

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275	Message 64980 - Posted: 18 Jan 2022, 4:52:21 UTC The one computer that I tried to run these on failed all 8 tasks with segmentation violations. Looking at those work units, a couple of those work units had all 3 tasks fail with segmentation violation errors, a few had all fail for various errors, and a couple work units had one of the tasks progress and produce trickles, including producing trickles on another Linux PC. Quite odd. ID: 64980 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 64981 - Posted: 18 Jan 2022, 7:55:35 UTC - in response to Message 64980. As you say George, "quite odd." Just looked at the work units with most of my failures. Two went on to produce trickles but still failed eventually. One on a Mac and one on another Linux box. One I have found went on to complete on another Linux box. I have just closed down the links - I should really have checked whether the ones that went on to produce trickles/complete on Linux were AMD architecture like my own or Intel. Might get around to that later. ID: 64981 ·

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,903,103 RAC: 18,120	Message 64991 - Posted: 23 Jan 2022, 3:06:17 UTC Just finished last 2 HadCM3s, both failed. Altogether 2 for 22. Some were on WSL2 Ubuntu 20.04 some on Hyper-V Ubuntu 20.04, both setups on the same Ryzen 9 computer. ID: 64991 ·

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 64992 - Posted: 23 Jan 2022, 8:24:54 UTC - in response to Message 64980. The one computer that I tried to run these on failed all 8 tasks with segmentation violations. Looking at those work units, a couple of those work units had all 3 tasks fail with segmentation violation errors, a few had all fail for various errors, and a couple work units had one of the tasks progress and produce trickles, including producing trickles on another Linux PC. Quite odd. Odd indeed. I'm clueless SIGSEGV ?? and some work a bit at least on on Macs? I have no further useful input. Hope someone somewhere can figure this problem out. ID: 64992 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 64993 - Posted: 23 Jan 2022, 13:26:47 UTC - in response to Message 64992. Odd indeed. I'm clueless SIGSEGV ?? and some work a bit at least on on Macs? I have no further useful input. Hope someone somewhere can figure this problem out. Wikipedia has this to say (long): https://en.wikipedia.org/wiki/Segmentation_fault Of course, most of the above applies mainly to C and C++ programs that can allocate RAM dynamically, and use pointers. Since CPDN programs are mostly FORTRAN, they do not use pointers, so the best way to get these faults is to use arrays and let their subscripts go off the end of the array. ID: 64993 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 64994 - Posted: 23 Jan 2022, 18:18:46 UTC Hopper almost empty and so far nothing seen preparing to be poured in. ID: 64994 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 64995 - Posted: 24 Jan 2022, 0:44:36 UTC All gone. ID: 64995 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 64996 - Posted: 24 Jan 2022, 13:55:42 UTC Two OpenIFS tasks currently running in testing but no discussion to suggest they are getting near ready to launch on main site. ID: 64996 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 65019 - Posted: 29 Jan 2022, 16:31:09 UTC - in response to Message 64970. My machine has now run out of ClimagePrediction work units. It has also run out of Rosetta and Universe work units, so I am running only WCG work units, and at most 5 of them. This has greatly improved my cache hit ratio. I infer that the N216 work unit are the ones that gobble up the processor cache(s). This is nor a surprise, but it is gratifying to be able to see why. # perf stat -aB -e cache-references,cache-misses Performance counter stats for 'system wide': 7,549,164,603 cache-references 1,883,988,132 cache-misses # 24.956 % of all cache refs 65.847219356 seconds time elapsed # ps -fu boinc UID PID PPID C STIME TTY TIME CMD boinc 19484 1 0 Jan23 ? 00:08:17 /usr/bin/boinc [this is the boinc client] boinc 509317 19484 99 05:37 ? 05:40:44 ../../projects/www.worldcommunitygrid.org/wcgrid_arp1_wrf_7.32_x86_64-pc-linux-gnu boinc 525303 19484 98 10:14 ? 01:05:44 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu boinc 526551 19484 99 10:38 ? 00:42:01 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu boinc 527966 19484 99 11:01 ? 00:19:20 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu boinc 528648 19484 99 11:13 ? 00:06:50 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc-linux-gnu -Sett ID: 65019 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 65020 - Posted: 30 Jan 2022, 7:24:46 UTC - in response to Message 64994. Last modified: 30 Jan 2022, 7:27:44 UTC Hopper almost empty and so far nothing seen preparing to be poured in. I got an N216 and an hadcm3s work unit (both re-runs) recently and they are both running fine. I am kind-of amazed at the hadcm3s one because I never got more than about 3 seconds on the 16 or so of those I ran recently whereupon they crashed after three seconds or so with a segmentation fault. This one has run for over 10 hours and delivered two trickles. The two previous attempts error-ed out for reasons I could not understand. (not missing libraries, not segmentation violations). They were on apple-Darwin machines. Task 22191699 Name hadcm3s_1k9d_200012_168_926_012129726_2 Workunit 12129726 Created 29 Jan 2022, 20:46:55 UTC Sent 29 Jan 2022, 20:48:05 UTC Report deadline 12 Jan 2023, 2:08:05 UTC Received --- Server state In progress Outcome --- Client state New Exit status 0 (0x00000000) Computer ID 1511241 ID: 65020 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 65021 - Posted: 31 Jan 2022, 12:13:05 UTC Just downloading 4 N144 tasks from testing. I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work. ID: 65021 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 65022 - Posted: 31 Jan 2022, 13:56:35 UTC - in response to Message 65021. I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work. Do those openIFS tasks work, or do they crash? How much RAM do they currently take? Any idea how much processor cache they require? ID: 65022 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 65023 - Posted: 31 Jan 2022, 14:39:11 UTC - in response to Message 65022. I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work. Do those openIFS tasks work, or do they crash? How much RAM do they currently take? Any idea how much processor cache they require? They don't crash, the last batch I checked were taking 12GB of RAM each and uploads were about 550MB Haven't tried to check on CPU cache but it hasn't been raised as an issue by other testers so I suspect not as much as the N216 tasks. Some batches have had final uploads of over 1GB so I have had them uploading while I sleep if on a day when I am doing any Zoom calls. Obviously not an issue for those with real broad as opposed to bored band. ID: 65023 ·

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 65024 - Posted: 31 Jan 2022, 16:12:04 UTC - in response to Message 65023. They don't crash, the last batch I checked were taking 12GB of RAM each and uploads were about 550MB Haven't tried to check on CPU cache but it hasn't been raised as an issue by other testers so I suspect not as much as the N216 tasks. Some batches have had final uploads of over 1GB so I have had them uploading while I sleep if on a day when I am doing any Zoom calls. Obviously not an issue for those with real broad as opposed to bored band. That is good information. The memory and bandwidth requirements are quite large, but a number of us could do a few at a time if that is what it takes. Of course, that may not be enough to do them much good, but that is another question. ID: 65024 ·