Thread 'One of my oifs_43r3_bl_1018 taskss errored out.'

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 70988 - Posted: 14 Jun 2024, 13:25:39 UTC Last modified: 14 Jun 2024, 13:32:14 UTC One of my oifs_43r3_bl_1018 tasks errored out. Task 22443572 Name oifs_43r3_bl_a1a0_2016092300_20_1018_12289983_0 Workunit 12289983 Server state Over Outcome Computation error Client state Compute error Exit status 1 (0x00000001) Unknown error code Computer ID 1511241 Run time 7 hours 24 min 44 sec CPU time 7 hours 20 min 23 sec Validate state Invalid Credit 1,318.46 Device peak FLOPS 5.93 GFLOPS Application version OpenIFS 43r3 Baroclinic Lifecycle v1.13 x86_64-pc-linux-gnu Peak working set size 5,548.55 MB Peak swap size 5,981.07 MB Peak disk usage 1,286.50 MB I wont bore you with the entire stderr file, but the important part is [EC_DRHOOK:localhost.localdomain:1:1:1644391:1644391] [20240614:084327:1718369007:27437.112] [signal_drhook@/home/abowery/Working_folder/OpenIFS/oifs_43r3_bl/gc_oifs43r3_2/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#8, nsigs = 1 [EC_DRHOOK:localhost.localdomain:1:1:1644391:1644391] [20240614:084327:1718369007:27437.112] [signal_drhook@/home/abowery/Working_folder/OpenIFS/oifs_43r3_bl/gc_oifs43r3_2/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1cf8da0 for signal#8, nsigs = 1 forrtl: error (72): floating overflow I infer that the software and hardware are working correctly, but the mathematics of the model disagreed with reality. I have been processing _1018_ tasks successfully. This is the first failure. ID: 70988 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 70989 - Posted: 14 Jun 2024, 14:47:19 UTC I have had one error out too. I think it is just the physics of the model. ID: 70989 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 70990 - Posted: 14 Jun 2024, 16:12:30 UTC - in response to Message 70989. forrtl: error (72): floating overflow Does this mean the program is written in Fortran? It would not have occurred to me Oifs tasks are in that 1950's language. Or do they just call a library that is so-written? My failing task has been assigned to another user, so we will see how (s)he does with it. Task 22448825 Name oifs_43r3_bl_a1a0_2016092300_20_1018_12289983_1 Workunit 12289983 ID: 70990 · Reply Quote

pututu Send message Joined: 18 Jun 17 Posts: 18 Credit: 10,293,533 RAC: 33,275	Message 70991 - Posted: 14 Jun 2024, 18:52:45 UTC Last modified: 14 Jun 2024, 18:59:56 UTC I've one WU where all three different computers are erroring out. https://www.cpdn.org/workunit.php?wuid=12291686 Same error forrtl: error (72): floating overflow This WU has two computers erroring out. https://www.cpdn.org/cpdnboinc/workunit.php?wuid=12289928 This time slightly different error number: forrtl: error (65): floating invalid Let us see if the third computer can validate the model. ID: 70991 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 70992 - Posted: 14 Jun 2024, 22:28:44 UTC - in response to Message 70991. If one or both of the previous tasks in a workunit have failed with a floating point exception, the 3rd definitely will not work. It's expected some of the tasks will fail in this batch in this way. There's no need to report it. --- CPDN Visiting Scientist ID: 70992 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 70993 - Posted: 14 Jun 2024, 23:31:51 UTC - in response to Message 70992. Last modified: 14 Jun 2024, 23:42:09 UTC It's expected some of the tasks will fail in this batch in this way. There's no need to report it. I could expect the floating point overflow errors, but the forrtl: error (65): floating invalid is another thing entirely. If something got out of range, you can get a floating point overflow, but an invalid floating point number is something else entirely. This would not be something out of range, but something that is a bunch of bits that are not a floating point number at all. https://lucid.co/techblog/2022/03/04/if-its-not-a-number-what-is-it-demystifying-nan-for-the-working-programmer ID: 70993 · Reply Quote

pututu Send message Joined: 18 Jun 17 Posts: 18 Credit: 10,293,533 RAC: 33,275	Message 70994 - Posted: 15 Jun 2024, 3:32:08 UTC Last modified: 15 Jun 2024, 3:45:15 UTC This WU is interesting. First returned unit has "Error while computing" but full credit is granted. Didn't notice anything abnormal in the stderr output. The task then got sent to another PC but with error (65): floating invalid. https://www.cpdn.org/workunit.php?wuid=12292871. Is there a way for the standard boinc server setup to allow exception where if the error is 72 floating overflow, the task won't be recycled since it is going to fail anyway in the next two hosts? My guess is probably not easy to do but just asking. Alternatively, each volunteer could manually check the recycled tasks and decide on their own to abort or let it crunch. ID: 70994 · Reply Quote

alanb1951 Send message Joined: 31 Aug 04 Posts: 38 Credit: 9,581,380 RAC: 3,853	Message 70995 - Posted: 15 Jun 2024, 4:45:28 UTC I've had one of those "apparently valid but flagged as error" tasks as well. Workunit https://www.cpdn.org/workunit.php?wuid=12293995 Looking at the stderr.txt it appears that a SIGKILL has been issued after boinc_finish() has been called -- hence the Error status :-( Also, looking at some of the wingmen for failing tasks, I notice there are one or two cases where there's no stderr.txt visible on the result page -- looks like some systems are sometimes having problems completing the wrap-up of tasks (successfully completed or otherwise!) Cheers - Al. ID: 70995 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 130 Credit: 44,254,664 RAC: 9,487	Message 70998 - Posted: 16 Jun 2024, 18:00:07 UTC - in response to Message 70994. Last modified: 16 Jun 2024, 18:04:17 UTC I have a couple of interesting ones that I have to abort. Upon reaching 99.98% or something, they just never finished, with time left continue to count hours into the negative territory. For one of them, I checked `ps` and the oifs process has exited already actually. I originally thought it's one of my specific host, until another host got a similar result. However, the resends were successful. It's unclear to me what went wrong for them. Perhaps in the wrapper that's handling the final results? https://www.cpdn.org/result.php?resultid=22441906 https://www.cpdn.org/result.php?resultid=22443068 https://www.cpdn.org/result.php?resultid=22443808 https://www.cpdn.org/result.php?resultid=22446273 It's pretty rare though, affecting ~1% of my WUs so far. Just a bit annoying to babysit because I need to abort them manually... ID: 70998 · Reply Quote

ChelseaOilman Send message Joined: 24 Dec 19 Posts: 32 Credit: 41,655,699 RAC: 15,244	Message 70999 - Posted: 16 Jun 2024, 18:24:49 UTC Last modified: 16 Jun 2024, 18:25:07 UTC I also had 2 tasks get to 99.9% and just sit there doing nothing. One each on 2 different computers. I aborted them. ID: 70999 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71001 - Posted: 16 Jun 2024, 21:38:50 UTC - in response to Message 70999. Ok, thanks for reporting. I will bring this up with Andy tomorrow as he manages the code controlling the OpenIFS model. I don't work on it anymore. --- CPDN Visiting Scientist ID: 71001 · Reply Quote

pututu Send message Joined: 18 Jun 17 Posts: 18 Credit: 10,293,533 RAC: 33,275	Message 71020 - Posted: 22 Jun 2024, 3:19:34 UTC I got one task that stayed at 99.99% much longer than my average run time. I suspended the task and keep crunching the rest of the tasks. At some point, I restarted the boinc client and resumed that suspended task and it got validated. Maybe I'm just restarted the missing application or stalled application. Anyway this was my first encounter. ID: 71020 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71021 - Posted: 22 Jun 2024, 8:16:24 UTC - in response to Message 71020. Which task was it? I can look at the logs and see what happened. --- CPDN Visiting Scientist ID: 71021 · Reply Quote

pututu Send message Joined: 18 Jun 17 Posts: 18 Credit: 10,293,533 RAC: 33,275	Message 71022 - Posted: 22 Jun 2024, 13:30:05 UTC Sorry, I didn't write down that one particular task number. After restarting the boinc client, the run time got recalculated and back to normal average run time (not the cpu time) and continue where it thinks it has left off. I believe this happen to every task when you restart the boinc client. It will recalculate the run time again. At least this is what I observed on my system. ID: 71022 · Reply Quote

pututu Send message Joined: 18 Jun 17 Posts: 18 Credit: 10,293,533 RAC: 33,275	Message 71023 - Posted: 22 Jun 2024, 15:43:30 UTC - in response to Message 71022. Sorry, I didn't write down that one particular task number. After restarting the boinc client, the run time got recalculated and back to normal average run time (not the cpu time) and continue where it thinks it has left off. I believe this happen to every task when you restart the boinc client. It will recalculate the run time again. At least this is what I observed on my system. I meant to say using data from the checkpoint to recalculate the run time. ID: 71023 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1067 Credit: 17,020,946 RAC: 5,160	Message 71024 - Posted: 23 Jun 2024, 14:21:24 UTC - in response to Message 71023. To calculate the percentage done, the controlling program (not boinc) reads one of the model log files to see what model step it's on. I think that sometimes goes wrong. When boinc and therefore the task are restarted, the file is read correctly. --- CPDN Visiting Scientist ID: 71024 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 71222 - Posted: 12 Aug 2024, 16:13:01 UTC I jus got six 1018 work units. Four are running and two are waiting to run. They are all _1 reruns and all the _0 ones failed. Here is one of mine: Task 22505782 Name oifs_43r3_bl_a0k2_2016092300_20_1018_12289049_1 Workunit 12289049 All the _0 ones are from the same machine and user. Here is one of those work units. Workunit 12289049 22505782 1511241 12 Aug 2024, 7:18:49 UTC 11 Oct 2024, 7:18:49 UTC In progress --- --- 1,318.46 OpenIFS 43r3 Baroclinic Lifecycle v1.13 x86_64-pc-linux-gnu 22442624 1443502 13 Jun 2024, 7:16:37 UTC 12 Aug 2024, 7:16:37 UTC Timed out - no response 0.00 0.00 --- OpenIFS 43r3 Baroclinic Lifecycle v1.13 x86_64-pc-linux-gnu ID: 71222 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 71223 - Posted: 12 Aug 2024, 16:28:55 UTC - in response to Message 71222. That the originals errored out with 0 run time and your running ones have produced a first zip suggests a problem with the original computer rather than the tasks. ID: 71223 · Reply Quote

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 71224 - Posted: 12 Aug 2024, 16:53:03 UTC - in response to Message 71223. Last modified: 12 Aug 2024, 17:01:51 UTC That the originals errored out with 0 run time and your running ones have produced a first zip suggests a problem with the original computer rather than the tasks. No, read more carefully please---"timed out - no respons" Today is the deadline for that batch, 2 months ID: 71224 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4559 Credit: 19,039,635 RAC: 18,944	Message 71225 - Posted: 12 Aug 2024, 18:17:57 UTC - in response to Message 71224. My bad. Presumably tasks downloaded and never started. ID: 71225 · Reply Quote