OpenIFS Discussion

Author	Message
Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449	Message 66806 - Posted: 7 Dec 2022, 11:10:00 UTC - in response to Message 66805. I'm surprised that you're using 'CPU usage' at all. BOINC implements that very crudely, in quanta of 1 second: 90% means '9 seconds on, followed by 1 second off', 80% '4 on, 1 off', and so on. It was implemented very early on, primarily for thermal control on single-core CPUs. Most people with multi-core processors find it better to limit 'Use at most xx % of the CPUs'. That reduces thermal cycling, and it reduces the overall thermal & memory stresses by reducing the total number of tasks BOINC can launch concurrently. It does not limit the efficiency of any task selected for running in any way. ID: 66806 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,352,554 RAC: 91,976	Message 66807 - Posted: 7 Dec 2022, 13:41:49 UTC - in response to Message 66753. The task page says a task failed because... Outcome Computation error Client state Compute error Exit status 9 (0x00000009) Unknown error code But the end of the stderr log says all okay... 11:20:07 STEP 2951 H=2951:00 +CPU= 18.626 11:20:37 STEP 2952 H=2952:00 +CPU= 29.460 The child process terminated with status: 0 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMGGhpi1+002952 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMSHhpi1+002952 Moving to projects directory: /var/lib/boinc-client/slots/0/ICMUAhpi1+002952 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMGGhpi1+002928 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMGGhpi1+002952 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMSHhpi1+002940 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMUAhpi1+002952 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMUAhpi1+002940 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMUAhpi1+002928 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMGGhpi1+002940 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMSHhpi1+002952 Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMSHhpi1+002928 Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0894_2021050100_123_945_12163983_1_r1746892727_122.zip Uploading the final file: upload_file_122.zip Uploading trickle at timestep: 10623600 11:24:30 (825570): called boinc_finish(0) </stderr_txt> The event log had no errors/problems either showing this task completed and a new task started. There was no master executable left to kill. The run/CPU time for the task looked correct. Yet it is claimed it is a failure ! ID: 66807 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449	Message 66808 - Posted: 7 Dec 2022, 13:50:57 UTC - in response to Message 66807. Give us a link to the task or computer you copied that from (your computers are hidden, so we can't find them for ourselves). You'll probably find an 'error 9' by searching higher up in stderr. ID: 66808 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,979,167 RAC: 21,830	Message 66809 - Posted: 7 Dec 2022, 14:25:46 UTC - in response to Message 66808. Give us a link to the task or computer you copied that from (your computers are hidden, so we can't find them for ourselves). You'll probably find an 'error 9' by searching higher up in stderr. this one which came to me after two failed attempts and completed on my machine has the only instances of, "error" on the page are Outcome Computation error Client state Compute error Nothing in the stderr on the instance from the work unit linked. I did a, "find in page" for the word error (not case sensitive.) So, no guarantee there is anything there. I have had some on my own machine that don't show anything as well. ID: 66809 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66810 - Posted: 7 Dec 2022, 16:23:40 UTC - in response to Message 66794. My most recent task succeeded, whereas the previous two attempts failed. I got two more re-runs last night. In each case, I am the third to try them. Computer ID 1511241 https://www.cpdn.org/workunit.php?wuid=12163528 https://www.cpdn.org/workunit.php?wuid=12163156 Both are over half done and seem to be running correctly. ID: 66810 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66811 - Posted: 7 Dec 2022, 22:23:36 UTC - in response to Message 66810. One of those just finished normally. Task 22249418 Name oifs_43r3_ps_0067_2021050100_123_945_12163156_2 Workunit 12163156 Created 7 Dec 2022, 6:23:56 UTC Sent 7 Dec 2022, 6:24:02 UTC Report deadline 6 Jan 2023, 6:24:02 UTC Received 7 Dec 2022, 22:09:46 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 15 hours 43 min 41 sec CPU time 15 hours 31 min 14 sec Validate state Valid Credit 0.00 Device peak FLOPS 6.13 GFLOPS Application version OpenIFS 43r3 Perturbed Surface v1.01 x86_64-pc-linux-gnu Peak working set size 4,782.75 MB Peak swap size 4,974.32 MB Peak disk usage 1,214.05 MB I watched my machine as this task finished. First master.exe finished and got off the machine. It then took a very long time for the wrapper to complete: much more than a minute, but I did not think to time it. The CPDN web site did not notice it, and I figured out my machine did not report completion even though it was done. It was queued, but I ran update on it and almost immediately, the CPDN web site noticed. ID: 66811 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66812 - Posted: 7 Dec 2022, 23:16:48 UTC - in response to Message 66811. The other one just finished OK too. Task 22249420 Name oifs_43r3_ps_0439_2021050100_123_945_12163528_2 Workunit 12163528 The time between when the master.exe finished and the wrapper finished was four minutes. I guess that was rumaging around all the left over files, compressing them into .zip files, and sending them. No wonder I was beginning to think something was wrong. ID: 66812 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,953,533 RAC: 14,026	Message 66813 - Posted: 7 Dec 2022, 23:22:23 UTC - in response to Message 66805. Suspected tempeature problems. I need to clean the case fan inlets and the CPU heatsink. One of the joys of having semi-long haired cats! ID: 66813 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,953,533 RAC: 14,026	Message 66814 - Posted: 7 Dec 2022, 23:24:16 UTC - in response to Message 66806. I limit CPUs on the ifs models anyway as I don't have enough RAM (24Gb for a 4 core CPU). ID: 66814 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449	Message 66815 - Posted: 8 Dec 2022, 11:10:48 UTC I've just reported task 22249428, the second resend from WU 12163983. Both my wingmates errored - one at the final hurdle, with a completely blank stderr, and the other with an "exit code 9", but a normal finish to the run. Mine finished normally. My host has 64 GB RAM, and SSD storage, but the CPU is (knowingly) overcommitted with other tasks - that accounts for the difference between elapsed and CPU time. Otherwise, it ran continuously. The other two machines have 16 GB and 32 GB RAM respectively, and all of us have GPUs. I don't think the error/success difference can be attributed to the model code or data, but must be down to something in the host environment. My local event log records the final stages as: 08/12/2022 08:29:51 \| climateprediction.net \| Started upload of oifs_43r3_ps_0894_2021050100_123_945_12163983_2_r1529576475_119.zip 08/12/2022 08:30:01 \| climateprediction.net \| Finished upload of oifs_43r3_ps_0894_2021050100_123_945_12163983_2_r1529576475_119.zip 08/12/2022 08:36:48 \| climateprediction.net \| Sending scheduler request: To send trickle-up message. 08/12/2022 09:00:31 \| climateprediction.net \| Started upload of oifs_43r3_ps_0894_2021050100_123_945_12163983_2_r1529576475_122.zip 08/12/2022 09:00:46 \| climateprediction.net \| Finished upload of oifs_43r3_ps_0894_2021050100_123_945_12163983_2_r1529576475_122.zip 08/12/2022 09:02:32 \| climateprediction.net \| Computation for task oifs_43r3_ps_0894_2021050100_123_945_12163983_2 finished 08/12/2022 09:37:27 \| climateprediction.net \| Reporting 1 completed tasks I think the time gaps before 'finish' and 'report' are entirely normal and deliberate: watch out for them when discussing your own results. ID: 66815 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,979,167 RAC: 21,830	Message 66817 - Posted: 8 Dec 2022, 11:30:15 UTC I don't think the error/success difference can be attributed to the model code or data, but must be down to something in the host environment. I agree with your analysis Richard but whatever it is about the host environments that these tasks don't like is I am pretty sure, not something that shows up on the computers' pages. I have looked at quite a few trying to spot the differences between those that finish complete these tasks and those that appear to crash them at the end. ID: 66817 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449	Message 66818 - Posted: 8 Dec 2022, 11:40:46 UTC - in response to Message 66817. Last modified: 8 Dec 2022, 11:41:33 UTC The trouble is, one of my wingmates is anonymous, and the other has never interacted on these message boards. If the error condition isn't recorded on the host and result pages, we have no way of matching the subjective local experience with the final outcome. That's why I discussed my outcome in some detail, in the hope that other volunteers who are active here can try something similar and hopefully uncover some explanation for the differences. ID: 66818 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,979,167 RAC: 21,830	Message 66819 - Posted: 8 Dec 2022, 12:00:40 UTC - in response to Message 66818. And I notice that some of the machines crashing these are completing nearly all the HADAM type models they run so users savvy enough to follow instructions to get 32bit libs and not fall into the more obvious traps. I might try stopping the client and restarting next time I get a some tasks to see what the failures look like then though if a machine is not crashing any hadam4 tasks it almost certainly isn't being turned off regularly. ID: 66819 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4535 Credit: 18,979,167 RAC: 21,830	Message 66820 - Posted: 8 Dec 2022, 18:19:31 UTC More tasks from testing. - more tomorrow after a number of buts ironed out Might be more for main site soon if tomorrow's testing goes well. ID: 66820 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66821 - Posted: 8 Dec 2022, 19:21:11 UTC - in response to Message 66820. Last modified: 8 Dec 2022, 19:22:38 UTC I have my Boinc-client set up to run up to 4 of these oifs_43r3_ps tasks at a time with my main machine, Computer 1511241 Number of processors 16 Memory 62.28 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 482.21 GB Measured floating point speed 6.13 billion ops/sec Measured integer speed 26.09 billion ops/sec Average upload rate 78.07 KB/sec Average download rate 20482.41 KB/sec [/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml <app_config> <project_max_concurrent>5</project_max_concurrent> <app> <name>oifs_43r3</name> <max_concurrent>2</max_concurrent> </app> <app> <name>oifs_43r3_bl</name> <max_concurrent>2</max_concurrent> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>4</max_concurrent> </app> </app_config> I would be willing to change some of these settings if it will help in testing in any way. Just tell me what you would like. ID: 66821 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 14,352,554 RAC: 91,976	Message 66825 - Posted: 8 Dec 2022, 23:37:37 UTC - in response to Message 66815. I don't think the error/success difference can be attributed to the model code or data, but must be down to something in the host environment. Not sure that I can agree with that completely yet. Another task completed the application successfully but failed with error 9. I had turned task_debug on so have more info in the event log but that'll have to wait till tomorrow, though it is still no clearer of the exact problem. Linux says Error 9 is EBADF - It is displayed in case of bad file descriptor. Having observed the ending of the task it is the process of 'tidying up' after the last zip file has been uploaded when the problem occurred. It took 1 minute 45 seconds between completing the zip upload to the status 9 being reported. All the files, and there are a lot of them, need to be cleared away. No idea if Boinc does that by just trying to delete the slot directory or has a list of the files/directories to remove. Perhaps something in that process either loses the plot and doesn't track what is being asked to be deleted with what has been deleted or asks for something to be deleted that is no longer there. Maybe it is during the process of creating the "what to tell the server" dialog but that makes it up to the server so seems unlikely. The host has lots of free memory. It is returning about 1,200 tasks daily, the only one that fails with this error is the oifs task. The other project's server shows the host has about 700 tasks pending validation, 16,000 valid tasks, 0 invalid and 3 error tasks ( those are 3 cancelled by the server), so no problem seen there. The host has approximately 1.75 hyperthreads free according to top and uses an SSD drive. Some oifs tasks work fine and some don't and return error 9. Is there a better Event Log flag option that will look even deeper into the abyss and determine what actually is causing the 9 to be returned ? I think the time gaps before 'finish' and 'report' are entirely normal and deliberate The report time is most likely delayed because within Boinc it knows it is somewhere along the 3,636 second delay the project requested. If you actually hit Update the task is reported immediately. Having the report_results_immediately flag set in cc_config.xml makes no difference. ID: 66825 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,915 RAC: 2,154	Message 66826 - Posted: 9 Dec 2022, 1:09:25 UTC - in response to Message 66821. I just got five of these "new" work units. For each, at least one previous user failed. Mostly the master.exe task finished with a 0, but the wrapper disliked what it found. FWIW, this is what my machine thinks of its current workload. It has 16 cores of which I allow 12 to do boinc tasks. top - 20:00:24 up 16 days, 19:35, 1 user, load average: 12.82, 12.64, 12.42 Tasks: 467 total, 13 running, 454 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.8 us, 1.5 sy, 72.9 ni, 24.5 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st MiB Mem : 63772.8 total, 4916.1 free, 17830.9 used, 41025.8 buff/cache MiB Swap: 15992.0 total, 15778.0 free, 214.0 used. 45043.3 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 1462284 1462279 boinc 39 19 R 3.7g 6.0 98.9 7 122:17.28 /var/lib/boinc/slots/0/./master.exe 1461373 1461365 boinc 39 19 R 3.6g 5.8 98.9 3 139:37.90 /var/lib/boinc/slots/1/./master.exe 1461265 1461260 boinc 39 19 R 2.7g 4.4 99.1 5 142:46.72 /var/lib/boinc/slots/7/./master.exe 1463150 1463147 boinc 39 19 R 2.5g 4.0 99.0 4 106:48.63 /var/lib/boinc/slots/4/./master.exe 1459722 2146 boinc 39 19 R 323144 0.5 98.8 0 173:19.37 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 1459726 2146 boinc 39 19 R 314272 0.5 99.0 6 173:00.81 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 1459708 2146 boinc 39 19 R 314248 0.5 99.0 13 173:34.01 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 1452480 2146 boinc 39 19 R 213048 0.3 99.0 9 270:54.51 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 1453354 2146 boinc 39 19 R 213000 0.3 99.0 8 255:59.86 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 1463816 2146 boinc 39 19 R 162044 0.2 99.0 11 93:26.51 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 1464363 2146 boinc 39 19 R 159424 0.2 98.9 2 80:33.57 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 1465049 2146 boinc 39 19 R 156888 0.2 98.6 14 66:35.57 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 2146 1 boinc 30 10 S 67008 0.1 0.2 4 277145:38 /usr/bin/boinc <---<<< This is the Boinc client 1461260 2146 boinc 39 19 S 4796 0.0 0.1 14 0:18.87 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-g+ 1462279 2146 boinc 39 19 S 4796 0.0 0.1 10 0:16.07 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-g+ 1463147 2146 boinc 39 19 S 4792 0.0 0.1 14 0:15.25 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-g+ 1461365 2146 boinc 39 19 S 4752 0.0 0.0 12 0:19.02 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-g+ ID: 66826 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 317 Credit: 14,791,235 RAC: 19,552	Message 66828 - Posted: 9 Dec 2022, 9:52:28 UTC Just successfully finished a task that failed on 2 other machines: https://www.cpdn.org/workunit.php?wuid=12165996. One the PCs it failed on has crashed all of the few dozen OIFS tasks it attempted and still has about a dozen to go: (https://www.cpdn.org/results.php?hostid=1536378&offset=0&show_names=0&state=0&appid=39). It has a pretty old CPU, Xeon E5530, I wonder if that has something to do with it. I wonder if it's possible that the code is too specialized, if I can put it that way, in that it won't run well on a variety of configurations and only on certain, preferred configurations? ID: 66828 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,694,196 RAC: 10,449	Message 66829 - Posted: 9 Dec 2022, 10:34:26 UTC - in response to Message 66828. I've had a look, but I can't see anything anything that looks like a smoking gun. Both your failing wingmates have pretty hefty CPUs with high core counts: I would be surprised if that was in any way an 'unsupported configuration'. Even if it was, I'd expect an instant crash, rather than well into the run. I think I'd rather like to look at what else is going on in those machine's environments. We'll never know about the anonymous one, but the other would be possible: but this project doesn't seem to be displaying the usual BOINC "Projects in which you are participating" section on user account pages. That would help. ID: 66829 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1048 Credit: 16,404,330 RAC: 16,403	Message 66830 - Posted: 9 Dec 2022, 11:06:17 UTC - in response to Message 66779. Those Virtualization entries don't necessarily mean that virtualization is being used, just that BOINC detects that VBox is installed and that CPU support is enabled. I actually think that since BOINC detects those things it means that it's not running in a virtualized environment. I don't think BOINC can tell if it's running inside a VM or on bare metal. The OpenIFS apps do not use virtualization. That would require a completely different app built around VirtualBox. There are no vbox apps available from CPDN. It's just boinc saying what's available on your system in case you want to subscribe to projects with vbox apps. ID: 66830 · Reply Quote