Message boards : Number crunching : OpenIFS Discussion
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 32 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I wonder about OpenIFS crashes. I have gotten a re-run work unit https://www.cpdn.org/workunit.php?wuid=12166060 and I am the third (and last?) to try it. I am a little over two hours into it with about 13.5 hours to go, so it is a bit premature to jump to any firm conclusions. But I do notice that the two previous processes took a lot more time than I have, but both crashed with no stderr and no diagnostics. Both of these are running Virtual box, and I am not. We will see if I do better than they do. Mine is the last of these three, Virtualization Virtualbox (6.1.34_Ubuntur150636) installed, CPU has hardware virtualization support and it is enabled Operating System Linux Linuxmint Linux Mint 20.3 [5.4.0-122-generic|libc 2.31] Virtualization Virtualbox (6.1.26_Ubuntur145957) installed, CPU has hardware virtualization support and it is enabled Operating System Linux Ubuntu Ubuntu 21.10 [5.13.0-52-generic|libc 2.34 (Ubuntu GLIBC 2.34-0ubuntu3.2)] Virtualization None Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.6 (Ootpa) [4.18.0-372.26.1.el8_6.x86_64|libc 2.28] It has now sent 3 trickles and 23 .zip files. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,852,553 RAC: 19,917 |
Those Virtualization entries don't necessarily mean that virtualization is being used, just that BOINC detects that VBox is installed and that CPU support is enabled. I actually think that since BOINC detects those things it means that it's not running in a virtualized environment. I don't think BOINC can tell if it's running inside a VM or on bare metal. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
You are probably right. I know nothing about virtualization as practiced by UNIX or Linux systems.* My definitely non-virtual system has now completed these steps of the work unit the other two processes failed. Boinc thinks I am 58% done. Mon 05 Dec 2022 05:17:27 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_2971_2021050100_123_947_12166060_2_r52189318_69.zip Mon 05 Dec 2022 05:24:36 AM EST | climateprediction.net | Sending scheduler request: To send trickle-up message. _____ * Once, in the distant past, I wrote an operating system for a mini-computer that had no memory management unit. And the entire system crashed when I ran the FORTRAN compiler on a particular program. The crash made everything just stop. What happened was that the transfer vectors at the bottom of memory were getting zeroed out so when an interrupt occurred, it had no where to go. So I wrote a simulator for another computer, just like the one I really had, but that had memory management. I then had that simulated machine run the compiler on the offending program, and found the bug in the FORTRAN compiler that was over-writing the transfer vectors when compiling that program. The simulator ran about 30x slower than the real machine, but I would never have found the error in the compiler any other way. That is all that I know about virtualization. Not enough for this current problem. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
You are probably right. I know nothing about virtualization as practiced by UNIX or Linux systems.* Just checked. my vm machine shows as no virtualisation unlike my host machine. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I do notice that the two previous processes took a lot more time than I have, but both crashed with no stderr and no diagnostics. Both of these are running Virtual box, and I am not. We will see if I do better than they do. Mine is the last of these three, Well, my conjecture about Virtual box may be incorrect, but I wonder what the problem really is. I realize the Post Hoc Sed Non Propter Hoc. But the two tasks of this work unit that may have run with it failed, and mine without virtual box completed successfully. May have, because others have indicated that is highly likely they did not. Could my machine just be more reliable than the other two? I do not overclock it. It is a Dell T5820, just over two years old. https://www.cpdn.org/show_host_detail.php?hostid=1511241 This is the end of my stderr file. 11:48:20 STEP 2952 H=2952:00 +CPU= 27.054 The child process terminated with status: 0 Moving to projects directory: /var/lib/boinc/slots/4/ICMGGhpi1+002952 Moving to projects directory: /var/lib/boinc/slots/4/ICMSHhpi1+002952 Moving to projects directory: /var/lib/boinc/slots/4/ICMUAhpi1+002952 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMGGhpi1+002928 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMSHhpi1+002928 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMUAhpi1+002928 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMGGhpi1+002940 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMSHhpi1+002940 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMUAhpi1+002940 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMGGhpi1+002952 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMSHhpi1+002952 Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_12166060/ICMUAhpi1+002952 Zipping up the final file: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_ps_2971_2021050100_123_947_12166060_2_r52189318_122.zip Uploading the final file: upload_file_122.zip Uploading trickle at timestep: 10623600 11:52:13 (1096801): called boinc_finish(0) </stderr_txt> ]]> |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
I do notice that the two previous processes took a lot more time than I have, but both crashed with no stderr and no diagnostics. Both of these are running Virtual box, and I am not. We will see if I do better than they do. Mine is the last of these three,I really think the VB thing is a red herring. Just because BOINC detects it as being installed does not mean it is running. Certainly most of the time on my computer it isn't. Currently my longest run of these without any errors is my current one of 13. There is no way to tell if VB is running either via BOINC for another project or for something else so finding out more would rely on a lot more people posting here. Once I have upped my RAM which will also be a bit faster I will see if increasing the numbers running at once affects the error rate. Edit: Your machine being more reliable would not surprise me. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Currently my longest run of these without any errors is my current one of 13 My longest run of OpenIFS work units is all of them: 22. Early on, I ran two at a time, but then I raised it to three at a time. My machine is now set to run those 4 at a time, but since then, I have received only the single unit that finished at around noon today, my time. (For HadSM4 work units, they seem to almost all work correctly. Most of the failures get me a message like Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. ) I am quite willing to try running 5 or six at a time it that would be useful for finding bugs. I believe from a computing standpoint, running more than two or three of these at a time would probably over work my processor cache, but I am willing to try it if it makes sense in a debugging effort. What is your opinion? |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463 |
I don't think BOINC can tell if it's running inside a VM or on bare metal. If you care, there are a variety of ways to ask. Most well behaved hypervisors simply tell you that you're in a VM on one of the CPUID leaves, and most of them implement a standard enough set of leaves that are hypervisor-information. CPUID being unprivileged, you can simply ask. A hypervisor could lie about it, but most don't, because it's far too easy to tell you're in a hypervisor (at least on Intel) other ways, too. |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
My resent task 22249228 has been sent out twice before. Previous Task 22246540 and Task 22248943 Task 22246540 has no Stderr, it failed with a Run Time of 1 Day 5 Hours and a CPU Time of 31 Minutes. It also had an unusual amount of Peak Disk Usage of 23,961.87 MB (or 23.9 GB) way above the norm as I have seen. Task 22248943 has the error "Process exited with code 9" other than that seemed to have run fine. This one belonged to wateroakley I was able to run this WU to completion without error. Another resent task I have running is Task 22249324 Previous Task 22247025 and Task 22249194 Task 22247025 on computer 1524992 it had a Run Time of 42 Minutes with a CPU Time of 20 Seconds with a Peak Disk Usage of just 404.06 MB. This computer still has work on it but has not completed a successful OpenIFS WU all failed work units have the same long run times and short CPU times and have different error codes as well, codes 1, 5 and 148 all appear on this computer. Task 22249194 on computer 1504810 has No Stderr, has a Run Time of 1 Day 1 Hour and CPU Time of 7 Hours. This computer has run 9 OpenIFS work units all have failed with the long Run Time and short CPU Time. This computer belongs to happywetter.at So a few different reasons that some work units have failed or thrown an error. Conan |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
My most recent task succeeded, whereas the previous two attempts failed. Mine ended like this: Task 22249293 Name oifs_43r3_ps_2971_2021050100_123_947_12166060_2 Workunit 12166060 Created 5 Dec 2022, 1:22:38 UTC Sent 5 Dec 2022, 1:24:38 UTC Report deadline 4 Jan 2023, 1:24:38 UTC Received 5 Dec 2022, 16:59:33 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 15 hours 26 min 24 sec CPU time 15 hours 14 min 21 sec Validate state Valid Credit 0.00 Device peak FLOPS 6.13 GFLOPS Application version OpenIFS 43r3 Perturbed Surface v1.01 x86_64-pc-linux-gnu Peak working set size 4,451.36 MB Peak swap size 4,974.14 MB Peak disk usage 1,226.74 MB But here are the two failures. Notice their processors are about 30% slower than mine, but that alone does not explain the much longer time they took. What were they doing with all that time? They also used consideravly more disk space than mine did. What were they doing with that? Not writing their stderr files for sure. Task 22248905 Name oifs_43r3_ps_2971_2021050100_123_947_12166060_1 Workunit 12166060 Created 1 Dec 2022, 16:51:07 UTC Sent 1 Dec 2022, 16:51:36 UTC Report deadline 31 Dec 2022, 16:51:36 UTC Received 5 Dec 2022, 1:22:33 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 1534160 Run time 2 days 12 hours 43 min 20 sec CPU time 1 days 17 hours 46 min 24 sec Validate state Invalid Credit 0.00 Device peak FLOPS 2.98 GFLOPS Application version OpenIFS 43r3 Perturbed Surface v1.01 x86_64-pc-linux-gnu Peak working set size 4,419.36 MB Peak swap size 4,977.94 MB Peak disk usage 3,165.64 MB and Task 22248096 Name oifs_43r3_ps_2971_2021050100_123_947_12166060_0 Workunit 12166060 Created 29 Nov 2022, 15:20:03 UTC Sent 30 Nov 2022, 5:27:26 UTC Report deadline 30 Dec 2022, 5:27:26 UTC Received 1 Dec 2022, 16:51:02 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 1514203 Run time 1 days 10 hours 3 min 37 sec CPU time 14 hours 10 min 40 sec Validate state Invalid Credit 0.00 Device peak FLOPS 2.06 GFLOPS Application version OpenIFS 43r3 Perturbed Surface v1.01 x86_64-pc-linux-gnu Peak working set size 4,418.82 MB Peak swap size 4,978.08 MB Peak disk usage 2,873.10 MB |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
My resent task 22249228 has been sent out twice before. I completed Task 22249324 successfully in just under 17 1/2 hours. |
Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,852,553 RAC: 19,917 |
Glenn said that he'll start pulling statistics and work on figuring out the various failure reasons this Monday. The specifics he's figuring out but the general reason is the variety of configurations and usages of PCs that are out there trying to run these models. In limited tests the issues weren't there, problems started showing up when work got released to the great variety of machines out there. Some are able to run the models and others not for as yet unknown reasons. One known reason is restarts of BOINC/PC, Glenn mentioned that there's an issue with the wrapper code. For me, it's 4/18 failures, a rather high 22%. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
Glenn said that he'll start pulling statistics and work on figuring out the various failure reasons this Monday. The specifics he's figuring out but the general reason is the variety of configurations and usages of PCs that are out there trying to run these models. In limited tests the issues weren't there, problems started showing up when work got released to the great variety of machines out there. Some are able to run the models and others not for as yet unknown reasons. One known reason is restarts of BOINC/PC, Glenn mentioned that there's an issue with the wrapper code. For me, it's 4/18 failures, a rather high 22%. 13% here which is still high for when I haven't been turning the computer off at all. Since restricting to only running 2 tasks at any time, no more failures. Current run is 16 successes and two running, with two still to run. My machine should have enough memory to run five at a time so it will be interesting to see if wrapper code changes improve things though bored band can only keep up with two at a time. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,709,934 RAC: 9,107 |
Glenn has been running private tests on his own machine from the dev site, and I found I'd been sent several tasks as resends following his deliberate failures. I've been able to isolate some specific symptoms and failure causes by comparing the two results. The specific case we've been looking at is the "failure after retstart, following interruption": it's clear to me that this failure mode is nothing to do with the wrapper, although that was Glenn's first suspicion. I've told him I think it's because the progress counters aren't properly restored on resumption. That's probably code added to the IFS application itself, to meet the needs of running in the BOINC environment. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Glenn said that he'll start pulling statistics and work on figuring out the various failure reasons this Monday. The specifics he's figuring out but the general reason is the variety of configurations and usages of PCs that are out there trying to run these models. In limited tests the issues weren't there, problems started showing up when work got released to the great variety of machines out there. Some are able to run the models and others not for as yet unknown reasons. One known reason is restarts of BOINC/PC, Glenn mentioned that there's an issue with the wrapper code. For me, it's 4/18 failures, a rather high 22%. For me it's 0/22, extremely low. I just checked though. I did not reboot my machine at all during that interval. I usually reboot my machine (1511241) only when installing software updates. I have not had any failures with this drill, but I seldom do this procedure; My most recent time was 14 days, 9 hours, 27 minutes ago. For that, my drill is: Set each project to No New Tasks. Let each task with short remaining time to complete. For all projects except CPDN, Suspend all remaining tasks For all CPDN tasks waiting to start, Suspend them. For those CPDN tasks still running, Suspend each one one at a time. Shutdown the Boinc=client. Do the software update When the machine comes up, the Boinc-client will be running. Resume the CPDN tasks that were running. Then Resume all the other ones that were running, then Resume all the rest of the CPDN tasks. Then (re-)enable new tasks for each project. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638 |
For me a rather alarming 10 completed out of 22. Two of the failures I can put down to a forced reboot when everything "froze" and some of the early ones to using 4cpu's rather than 3 for the amount of RAM I have. Some were also -ve theta fails. I have eliminated the repeated suspends I was getting but setting 100% on CPU usage. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
I was getting but setting 100% on CPU usage. I am not quite sure what you mean by but setting, but I would hope you would be getting mighty close to 100% CPU usage on the CPUs you are using for Boinc. I do not have any CPDN work units on my machine, but it has found plenty of work to do. PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 1257731 2146 boinc 39 19 R 318352 0.5 99.3 2 138:48.14 ../../projects/boinc.bakerlab.org_rosetta/ro+ 1257637 2146 boinc 39 19 R 317548 0.5 99.1 0 141:18.53 ../../projects/boinc.bakerlab.org_rosetta/ro+ 1260071 2146 boinc 39 19 R 313764 0.5 99.3 3 106:54.39 ../../projects/boinc.bakerlab.org_rosetta/ro+ 1247258 2146 boinc 39 19 R 213136 0.3 99.3 8 277:09.25 ../../projects/einstein.phys.uwm.edu/einstei+ 1247665 2146 boinc 39 19 R 213016 0.3 99.3 7 270:22.55 ../../projects/einstein.phys.uwm.edu/einstei+ 1260446 2146 boinc 39 19 R 181744 0.3 99.3 9 99:54.08 ../../projects/www.worldcommunitygrid.org/wc+ 1264629 2146 boinc 39 19 R 128404 0.2 99.3 12 26:00.47 ../../projects/www.worldcommunitygrid.org/wc+ 1265512 2146 boinc 39 19 R 72756 0.1 99.4 6 7:31.83 ../../projects/www.worldcommunitygrid.org/wc+ 2146 1 boinc 30 10 S 54400 0.1 0.1 14 244329:15 /usr/bin/boinc <---<<< Boinc client 1263358 2146 boinc 39 19 R 7112 0.0 99.2 5 51:52.89 ../../projects/milkyway.cs.rpi.edu_milkyway/+ 1263191 2146 boinc 39 19 R 7084 0.0 99.4 11 55:24.73 ../../projects/milkyway.cs.rpi.edu_milkyway/+ 1261085 2146 boinc 39 19 R 4944 0.0 99.3 4 89:04.68 ../../projects/universeathome.pl_universe/BH+ 1264392 2146 boinc 39 19 R 4880 0.0 99.3 13 30:46.62 ../../projects/universeathome.pl_universe/BH+ |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,026,382 RAC: 20,431 |
And now mine are all gone. Restricting my box to never running more than the two at time that is the maximum my bored band can keep up with anyway seems so far to give 100% success rate with all 20 since I started doing that completing including one that was on its third attempt. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,000,748 RAC: 14,638 |
Typo. Should have been "by setting CPU usage to 100%" rather than 90%. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154 |
Typo. Should have been "by setting CPU usage to 100%" rather than 90%. OK: that makes much more sense. (We all make typos at one time or another.) But why would you limit CPU usage at all? Temperature problems? To reduce upload bandwidth limitations? |
©2024 cpdn.org