Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,967,615 RAC: 14,422 |
"Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet." For the six that I have from batch 990 the estimated run time is 2days 23hrs compared to 16hrs (ish) for the previous batches. Edit: Actually running at 5.04% per hour. First one 73% complete after 14 hrs, remaining estimated at 19hrs so adjusting as it goes. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
"Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet." Unfortunately, I have no estimate of how long they were to take. Task 22250483 First one done on my Linux machine... Name oifs_43r3_bl_a051_2016092300_15_949_12166575_0 Workunit 12166575 Created 14 Dec 2022, 14:15:27 UTC Sent 14 Dec 2022, 14:24:00 UTC Report deadline 13 Jan 2023, 14:24:00 UTC Received 15 Dec 2022, 12:25:25 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 6 hours 46 min 55 sec CPU time 6 hours 41 min 2 sec Validate state Valid Credit 1,232.00 Application version OpenIFS 43r3 Baroclinic Lifecycle v1.07 x86_64-pc-linux-gnu Task 22250807 Most recent one done on my Linux machine. Name oifs_43r3_bl_a04c_2016092300_15_949_12166550_2 Workunit 12166550 Created 19 Dec 2022, 2:21:53 UTC Sent 19 Dec 2022, 2:23:58 UTC Report deadline 18 Jan 2023, 2:23:58 UTC Received 19 Dec 2022, 9:23:21 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 6 hours 12 min 40 sec CPU time 6 hours 7 min 11 sec Validate state Valid Credit 1,232.00 |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,807,823 RAC: 19,824 |
Unfortunately, I have no estimate of how long they were to take. Those 2 tasks are from a BL test batch (949) from a coupe of months ago using the old app version (1.07). I'm not sure that I'd use them for any significant info or comparison as they were just part of the initial test runs in preparation for OIFS release. Production runs are likely to be different and will use the latest app version (1.11 or newer). |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
Got this on the last of my tasks from 990 [EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#8, nsigs = 1 [EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1ce8cf0 for signal#8, nsigs = 1 forrtl: error (65): floating invalid |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
If you want to know what the signals mean in Linux, consider the following table where they are defined. Especially #8. Floating point exception. You might wish to keep it around for reference. And here is an explanation on how it can occur. https://itslinuxfoss.com/floating-point-exception-core-dumped/ #define SIGHUP 1 #define SIGINT 2 #define SIGQUIT 3 #define SIGILL 4 #define SIGTRAP 5 #define SIGABRT 6 #define SIGIOT 6 #define SIGBUS 7 #define SIGFPE 8 #define SIGKILL 9 #define SIGUSR1 10 #define SIGSEGV 11 #define SIGUSR2 12 #define SIGPIPE 13 #define SIGALRM 14 #define SIGTERM 15 #define SIGSTKFLT 16 #define SIGCHLD 17 #define SIGCONT 18 #define SIGSTOP 19 #define SIGTSTP 20 #define SIGTTIN 21 #define SIGTTOU 22 #define SIGURG 23 #define SIGXCPU 24 #define SIGXFSZ 25 #define SIGVTALRM 26 #define SIGPROF 27 #define SIGWINCH 28 #define SIGIO 29 #define SIGPOLL SIGIO /* #define SIGLOST 29 */ #define SIGPWR 30 #define SIGSYS 31 #define SIGUNUSED 31 |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
If you want to know what the signals mean in Linux, consider the following table where they are defined. Especially #8. Floating point exception. You might wish to keep it around for reference.Thanks, looking at the link and also in a couple of other places, this one is I suspect down to the physics of the model producing a value that the program doesn't like. It will be interesting to see what happens on subsequent attempts. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Thanks, looking at the link and also in a couple of other places, this one is I suspect down to the physics of the model producing a value that the program doesn't like. It will be interesting to see what happens on subsequent attempts. I agree. But do not overlook the possibility of bad addresses, bad subscripts in arrays, or using dynamically allocated memory that has been freed, yet still used by defective programs. These can all give very strange, difficult-to-reproduce, errors. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
But do not overlook the possibility of bad addresses, bad subscripts in arrays, or using dynamically allocated memory that has been freed, yet still used by defective programs. These can all give very strange, difficult-to-reproduce, errors. Of course. Though very little running on this computer. Only programs open apart from BOINC were Firefox with only a couple of tabs open, Thunderbird and Libre Office. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Got this on the last of my tasks from 990 Of course. Though very little running on this computer. Only programs open apart from BOINC were Firefox with only a couple of tabs open, Thunderbird and Libre Office. If you could somehow send me that work unit, I could run it on my machine. IIRC my machine has never failed to complete any of these Oifs work units. Neither the _ps nor the _bl ones. Almost 300 tasks altogether. P.s.: I just got Name oifs_43r3_ps_0561_2021050100_123_990_12206681_2 Workunit 12206681 that has failed for the two previous attempts. Each has failed for very different reasons. I betcha it works on my machine. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
If you could somehow send me that work unit, I could run it on my machine. IIRC my machine has never failed to complete any of these Oifs work units. Neither the _ps nor the _bl ones. Almost 300 tasks altogether.The Intel machine running the second attempt has a very good record (about 1% failure rate.) I don't know if the failure on my machine is something AMD ones are more prone to or not. I know the facility to send a work unit to only a specific machine exists as it has been used on the testing site at times but I am not aware of it ever being used on the Main site. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
The Intel machine running the second attempt has a very good record (about 1% failure rate.) I don't know if the failure on my machine is something AMD ones are more prone to or not. Well, my machine is Intel; at present I am allowing 12 cores to run Boinc tasks. CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64|libc 2.28] BOINC version 7.20.2 Memory 62.4 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 479.24 GB |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Got this on the last of my tasks from 990I've seen that one. If you're interested, look further back in the traceback and you'll see: >OMP-RADINTG-RADLSW (1210) RADIATION_SCHEME radiation_interface:radiation radiation_cloud_optics:cloud_optics The model has failed in the radiation code. Floating invalid is usually a divide-by-zero. There were a few WUs that failed each try because the butterfly wings were perhaps too big :) Interesting though, there were a few other cases where the model failed like this on AMD hardware, the resend went to an Intel CPU and worked fine. Which is why they've been tried again. It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs). |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs). I agree that this is the most likely explanation. I did not mean that it was a failure of the memory of the machine doing the task. I meant that the current (and probably all future) Oifs models do a lot of memory allocation and freeing during their execution, and some failures seem to complain about freeing the same memory more than once; indicating, most likely, a programming error. And that being a possible thing, it is most vexing to find. In a former life, I was involved in writing (part of) the optimizer for the C compiler in UNIX. And people accused the optimizer of being defective because it gave different results than when code was not optimized. It turns out that the optimizer was not at fault. We guaranteed that our optimizer would give the same result for correctly-written code, but were silent about what would happen for incorrect code. We even compiled and ran the UNIX kernel and all the libraries with the optimizer turned on. It turns out that there was a lot of code out there that used pointers that were not initialized, so G.O.K. what values they had. Most were zero, and it was easy to trap those since we never stored anything in the bottom page of RAM, so all traps to there were uninitialized pointers. We found so many of those that they would not even read my MRs after a while. We had a secretary file my MRs with her name on them for a while, but then the caught on. By the time I left, they had never fixed those problems. |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Don't get me started on code optimizers - especially when dealing with vector instructions. I have a couple of stories there...It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs).I agree that this is the most likely explanation. Anyway, the OpenIFS code does do alot of heap allocate/free (it's mostly Fortran code) but the memory problems that have been reported here are not from the model but from the C++ wrapper code that monitors it and talks to boinc, just in case I've confused things. It's a newer code and not so tried & tested as the model. I agree completely about being careful with code & optimizers. I once saw a model go from radiative heating in the model stratosphere to radiative cooling just by moving the code to a new machine & compiler (I forget what that was now). That wasn't a good thing, which took time to understand. Before we put out these batches which have slight model perturbations, the idea of how much perturbation occurs from different computers was discussed. The machine perturbations are relatively small compared to the model changes being made, so the "hardware-only" model outcomes will still be part of the perturbation space explored by the scientist's perturbations. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I just got I win. My attempt worked just fine: Task 22306953 Name oifs_43r3_ps_0561_2021050100_123_990_12206681_2 Workunit 12206681 Created 11 Feb 2023, 12:22:34 UTC Sent 11 Feb 2023, 12:25:27 UTC Report deadline 12 Apr 2023, 12:25:27 UTC Received 12 Feb 2023, 3:56:11 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 15 hours 20 min 39 sec CPU time 15 hours 1 min 51 sec Validate state Valid Credit 0.00 Device peak FLOPS 6.06 GFLOPS Application version OpenIFS 43r3 Perturbed Surface v1.09 x86_64-pc-linux-gnu OpenIFS 43r3 Perturbed Surface 1.09 x86_64-pc-linux-gnu Number of tasks completed 2 Max tasks per day 6 Number of tasks today 0 Consecutive valid tasks 2 Average processing rate 28.85 GFLOPS Average turnaround time 0.64 days |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
And the one that failed on my Ryzen has completed on its second attempt. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is free(): invalid next size (fast) |
Send message Joined: 29 Oct 17 Posts: 1049 Credit: 16,432,494 RAC: 17,331 |
Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel isDon't take this the wrong way, but I sincerely hope that fails as well. Then we may have found a repeatable failure - which has eluded me so far.free(): invalid next size (fast) As for the other AMD:fail, Intel:Ok, I am wondering whether to turn down the optimization level on the Intel compiler I use for the model. Thx for reporting. Links to the WUs pages are useful too. |
Send message Joined: 15 May 09 Posts: 4536 Credit: 18,993,249 RAC: 21,753 |
Thx for reporting. Links to the WUs pages are useful too. Work unit I am only running the one task at the moment and set to a maximum of 2 which will minimise the chances of other tasks interfering. Edited to provide the correct work unit. Edit2: Intel failed after uploading zip 95. The AMD managed another 10 zips so possibly not a smoking gun. |
Send message Joined: 2 Oct 19 Posts: 21 Credit: 47,674,094 RAC: 24,265 |
I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project? |
©2024 cpdn.org