Message boards : Number crunching : One of my oifs_43r3_bl_1018 taskss errored out.
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
One of my oifs_43r3_bl_1018 tasks errored out. Task 22443572 Name oifs_43r3_bl_a1a0_2016092300_20_1018_12289983_0 Workunit 12289983 Server state Over Outcome Computation error Client state Compute error Exit status 1 (0x00000001) Unknown error code Computer ID 1511241 Run time 7 hours 24 min 44 sec CPU time 7 hours 20 min 23 sec Validate state Invalid Credit 1,318.46 Device peak FLOPS 5.93 GFLOPS Application version OpenIFS 43r3 Baroclinic Lifecycle v1.13 x86_64-pc-linux-gnu Peak working set size 5,548.55 MB Peak swap size 5,981.07 MB Peak disk usage 1,286.50 MB I wont bore you with the entire stderr file, but the important part is [EC_DRHOOK:localhost.localdomain:1:1:1644391:1644391] [20240614:084327:1718369007:27437.112] [signal_drhook@/home/abowery/Working_folder/OpenIFS/oifs_43r3_bl/gc_oifs43r3_2/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#8, nsigs = 1 [EC_DRHOOK:localhost.localdomain:1:1:1644391:1644391] [20240614:084327:1718369007:27437.112] [signal_drhook@/home/abowery/Working_folder/OpenIFS/oifs_43r3_bl/gc_oifs43r3_2/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1cf8da0 for signal#8, nsigs = 1 forrtl: error (72): floating overflow I infer that the software and hardware are working correctly, but the mathematics of the model disagreed with reality. I have been processing _1018_ tasks successfully. This is the first failure. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I have had one error out too. I think it is just the physics of the model. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
forrtl: error (72): floating overflow Does this mean the program is written in Fortran? It would not have occurred to me Oifs tasks are in that 1950's language. Or do they just call a library that is so-written? My failing task has been assigned to another user, so we will see how (s)he does with it. Task 22448825 Name oifs_43r3_bl_a1a0_2016092300_20_1018_12289983_1 Workunit 12289983 |
Send message Joined: 18 Jun 17 Posts: 18 Credit: 9,962,503 RAC: 46,650 |
I've one WU where all three different computers are erroring out. https://www.cpdn.org/workunit.php?wuid=12291686 Same error forrtl: error (72): floating overflow This WU has two computers erroring out. https://www.cpdn.org/cpdnboinc/workunit.php?wuid=12289928 This time slightly different error number: forrtl: error (65): floating invalid Let us see if the third computer can validate the model. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
If one or both of the previous tasks in a workunit have failed with a floating point exception, the 3rd definitely will not work. It's expected some of the tasks will fail in this batch in this way. There's no need to report it. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
It's expected some of the tasks will fail in this batch in this way. There's no need to report it. I could expect the floating point overflow errors, but the forrtl: error (65): floating invalid is another thing entirely. If something got out of range, you can get a floating point overflow, but an invalid floating point number is something else entirely. This would not be something out of range, but something that is a bunch of bits that are not a floating point number at all. https://lucid.co/techblog/2022/03/04/if-its-not-a-number-what-is-it-demystifying-nan-for-the-working-programmer |
Send message Joined: 18 Jun 17 Posts: 18 Credit: 9,962,503 RAC: 46,650 |
This WU is interesting. First returned unit has "Error while computing" but full credit is granted. Didn't notice anything abnormal in the stderr output. The task then got sent to another PC but with error (65): floating invalid. https://www.cpdn.org/workunit.php?wuid=12292871. Is there a way for the standard boinc server setup to allow exception where if the error is 72 floating overflow, the task won't be recycled since it is going to fail anyway in the next two hosts? My guess is probably not easy to do but just asking. Alternatively, each volunteer could manually check the recycled tasks and decide on their own to abort or let it crunch. |
Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853 |
I've had one of those "apparently valid but flagged as error" tasks as well. Workunit https://www.cpdn.org/workunit.php?wuid=12293995 Looking at the stderr.txt it appears that a SIGKILL has been issued after boinc_finish() has been called -- hence the Error status :-( Also, looking at some of the wingmen for failing tasks, I notice there are one or two cases where there's no stderr.txt visible on the result page -- looks like some systems are sometimes having problems completing the wrap-up of tasks (successfully completed or otherwise!) Cheers - Al. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 42,540,021 RAC: 76,179 |
I have a couple of interesting ones that I have to abort. Upon reaching 99.98% or something, they just never finished, with time left continue to count hours into the negative territory. For one of them, I checked `ps` and the oifs process has exited already actually. I originally thought it's one of my specific host, until another host got a similar result. However, the resends were successful. It's unclear to me what went wrong for them. Perhaps in the wrapper that's handling the final results? https://www.cpdn.org/result.php?resultid=22441906 https://www.cpdn.org/result.php?resultid=22443068 https://www.cpdn.org/result.php?resultid=22443808 https://www.cpdn.org/result.php?resultid=22446273 It's pretty rare though, affecting ~1% of my WUs so far. Just a bit annoying to babysit because I need to abort them manually... |
Send message Joined: 24 Dec 19 Posts: 32 Credit: 41,364,675 RAC: 49,156 |
I also had 2 tasks get to 99.9% and just sit there doing nothing. One each on 2 different computers. I aborted them. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
Ok, thanks for reporting. I will bring this up with Andy tomorrow as he manages the code controlling the OpenIFS model. I don't work on it anymore. --- CPDN Visiting Scientist |
Send message Joined: 18 Jun 17 Posts: 18 Credit: 9,962,503 RAC: 46,650 |
I got one task that stayed at 99.99% much longer than my average run time. I suspended the task and keep crunching the rest of the tasks. At some point, I restarted the boinc client and resumed that suspended task and it got validated. Maybe I'm just restarted the missing application or stalled application. Anyway this was my first encounter. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
Which task was it? I can look at the logs and see what happened. --- CPDN Visiting Scientist |
Send message Joined: 18 Jun 17 Posts: 18 Credit: 9,962,503 RAC: 46,650 |
Sorry, I didn't write down that one particular task number. After restarting the boinc client, the run time got recalculated and back to normal average run time (not the cpu time) and continue where it thinks it has left off. I believe this happen to every task when you restart the boinc client. It will recalculate the run time again. At least this is what I observed on my system. |
Send message Joined: 18 Jun 17 Posts: 18 Credit: 9,962,503 RAC: 46,650 |
Sorry, I didn't write down that one particular task number. After restarting the boinc client, the run time got recalculated and back to normal average run time (not the cpu time) and continue where it thinks it has left off. I believe this happen to every task when you restart the boinc client. It will recalculate the run time again. At least this is what I observed on my system. I meant to say using data from the checkpoint to recalculate the run time. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,578,380 RAC: 15,009 |
To calculate the percentage done, the controlling program (not boinc) reads one of the model log files to see what model step it's on. I think that sometimes goes wrong. When boinc and therefore the task are restarted, the file is read correctly. --- CPDN Visiting Scientist |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I jus got six 1018 work units. Four are running and two are waiting to run. They are all _1 reruns and all the _0 ones failed. Here is one of mine: Task 22505782 Name oifs_43r3_bl_a0k2_2016092300_20_1018_12289049_1 Workunit 12289049 All the _0 ones are from the same machine and user. Here is one of those work units. Workunit 12289049 22505782 1511241 12 Aug 2024, 7:18:49 UTC 11 Oct 2024, 7:18:49 UTC In progress --- --- 1,318.46 OpenIFS 43r3 Baroclinic Lifecycle v1.13 x86_64-pc-linux-gnu 22442624 1443502 13 Jun 2024, 7:16:37 UTC 12 Aug 2024, 7:16:37 UTC Timed out - no response 0.00 0.00 --- OpenIFS 43r3 Baroclinic Lifecycle v1.13 x86_64-pc-linux-gnu |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
That the originals errored out with 0 run time and your running ones have produced a first zip suggests a problem with the original computer rather than the tasks. |
Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928 |
That the originals errored out with 0 run time and your running ones have produced a first zip suggests a problem with the original computer rather than the tasks. No, read more carefully please---"timed out - no respons" Today is the deadline for that batch, 2 months |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
My bad. Presumably tasks downloaded and never started. |
©2024 cpdn.org