climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 32 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,709,934
RAC: 9,107
Message 66806 - Posted: 7 Dec 2022, 11:10:00 UTC - in response to Message 66805.  

I'm surprised that you're using 'CPU usage' at all. BOINC implements that very crudely, in quanta of 1 second: 90% means '9 seconds on, followed by 1 second off', 80% '4 on, 1 off', and so on. It was implemented very early on, primarily for thermal control on single-core CPUs.

Most people with multi-core processors find it better to limit 'Use at most xx % of the CPUs'. That reduces thermal cycling, and it reduces the overall thermal & memory stresses by reducing the total number of tasks BOINC can launch concurrently. It does not limit the efficiency of any task selected for running in any way.
ID: 66806 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,593,899
RAC: 89,206
Message 66807 - Posted: 7 Dec 2022, 13:41:49 UTC - in response to Message 66753.  

The task page says a task failed because...

Outcome Computation error
Client state Compute error
Exit status 9 (0x00000009) Unknown error code

But the end of the stderr log says all okay...

  11:20:07 STEP 2951 H=2951:00 +CPU= 18.626
  11:20:37 STEP 2952 H=2952:00 +CPU= 29.460
The child process terminated with status: 0
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMGGhpi1+002952
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMSHhpi1+002952
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMUAhpi1+002952
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMGGhpi1+002928
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMGGhpi1+002952
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMSHhpi1+002940
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMUAhpi1+002952
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMUAhpi1+002940
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMUAhpi1+002928
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMGGhpi1+002940
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMSHhpi1+002952
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12163983/ICMSHhpi1+002928
Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0894_2021050100_123_945_12163983_1_r1746892727_122.zip
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
11:24:30 (825570): called boinc_finish(0)

</stderr_txt>

The event log had no errors/problems either showing this task completed and a new task started.
There was no master executable left to kill.
The run/CPU time for the task looked correct.
Yet it is claimed it is a failure !
ID: 66807 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,709,934
RAC: 9,107
Message 66808 - Posted: 7 Dec 2022, 13:50:57 UTC - in response to Message 66807.  

Give us a link to the task or computer you copied that from (your computers are hidden, so we can't find them for ourselves).

You'll probably find an 'error 9' by searching higher up in stderr.
ID: 66808 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,027,210
RAC: 20,226
Message 66809 - Posted: 7 Dec 2022, 14:25:46 UTC - in response to Message 66808.  

Give us a link to the task or computer you copied that from (your computers are hidden, so we can't find them for ourselves).

You'll probably find an 'error 9' by searching higher up in stderr.

this one which came to me after two failed attempts and completed on my machine has the only instances of, "error" on the page are
Outcome Computation error
Client state Compute error


Nothing in the stderr on the instance from the work unit linked. I did a, "find in page" for the word error (not case sensitive.) So, no guarantee there is anything there. I have had some on my own machine that don't show anything as well.
ID: 66809 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66810 - Posted: 7 Dec 2022, 16:23:40 UTC - in response to Message 66794.  

My most recent task succeeded, whereas the previous two attempts failed.


I got two more re-runs last night. In each case, I am the third to try them.
Computer ID 1511241

https://www.cpdn.org/workunit.php?wuid=12163528
https://www.cpdn.org/workunit.php?wuid=12163156

Both are over half done and seem to be running correctly.
ID: 66810 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66811 - Posted: 7 Dec 2022, 22:23:36 UTC - in response to Message 66810.  

One of those just finished normally.

Task 22249418
Name 	oifs_43r3_ps_0067_2021050100_123_945_12163156_2
Workunit 	12163156
Created 	7 Dec 2022, 6:23:56 UTC
Sent 	        7 Dec 2022, 6:24:02 UTC
Report deadline 6 Jan 2023, 6:24:02 UTC
Received 	7 Dec 2022, 22:09:46 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1511241
Run time 	15 hours 43 min 41 sec
CPU time 	15 hours 31 min 14 sec
Validate state 	Valid
Credit 	        0.00
Device peak FLOPS 	6.13 GFLOPS
Application version 	OpenIFS 43r3 Perturbed Surface v1.01
                        x86_64-pc-linux-gnu
Peak working set size 	4,782.75 MB
Peak swap size 	        4,974.32 MB
Peak disk usage 	1,214.05 MB


I watched my machine as this task finished.
First master.exe finished and got off the machine.
It then took a very long time for the wrapper to complete: much more than a minute, but I did not think to time it.
The CPDN web site did not notice it, and I figured out my machine did not report completion even though it was done. It was queued, but I ran update on it and almost immediately, the CPDN web site noticed.
ID: 66811 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66812 - Posted: 7 Dec 2022, 23:16:48 UTC - in response to Message 66811.  

The other one just finished OK too.

Task 22249420
Name oifs_43r3_ps_0439_2021050100_123_945_12163528_2
Workunit 12163528

The time between when the master.exe finished and the wrapper finished was four minutes. I guess that was rumaging around all the left over files, compressing them into .zip files, and sending them.

No wonder I was beginning to think something was wrong.
ID: 66812 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,000,748
RAC: 14,638
Message 66813 - Posted: 7 Dec 2022, 23:22:23 UTC - in response to Message 66805.  

Suspected tempeature problems. I need to clean the case fan inlets and the CPU heatsink. One of the joys of having semi-long haired cats!
ID: 66813 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,000,748
RAC: 14,638
Message 66814 - Posted: 7 Dec 2022, 23:24:16 UTC - in response to Message 66806.  

I limit CPUs on the ifs models anyway as I don't have enough RAM (24Gb for a 4 core CPU).
ID: 66814 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,709,934
RAC: 9,107
Message 66815 - Posted: 8 Dec 2022, 11:10:48 UTC

I've just reported task 22249428, the second resend from WU 12163983. Both my wingmates errored - one at the final hurdle, with a completely blank stderr, and the other with an "exit code 9", but a normal finish to the run. Mine finished normally.

My host has 64 GB RAM, and SSD storage, but the CPU is (knowingly) overcommitted with other tasks - that accounts for the difference between elapsed and CPU time. Otherwise, it ran continuously. The other two machines have 16 GB and 32 GB RAM respectively, and all of us have GPUs. I don't think the error/success difference can be attributed to the model code or data, but must be down to something in the host environment.

My local event log records the final stages as:

08/12/2022 08:29:51 | climateprediction.net | Started upload of oifs_43r3_ps_0894_2021050100_123_945_12163983_2_r1529576475_119.zip
08/12/2022 08:30:01 | climateprediction.net | Finished upload of oifs_43r3_ps_0894_2021050100_123_945_12163983_2_r1529576475_119.zip
08/12/2022 08:36:48 | climateprediction.net | Sending scheduler request: To send trickle-up message.
08/12/2022 09:00:31 | climateprediction.net | Started upload of oifs_43r3_ps_0894_2021050100_123_945_12163983_2_r1529576475_122.zip
08/12/2022 09:00:46 | climateprediction.net | Finished upload of oifs_43r3_ps_0894_2021050100_123_945_12163983_2_r1529576475_122.zip
08/12/2022 09:02:32 | climateprediction.net | Computation for task oifs_43r3_ps_0894_2021050100_123_945_12163983_2 finished
08/12/2022 09:37:27 | climateprediction.net | Reporting 1 completed tasks
I think the time gaps before 'finish' and 'report' are entirely normal and deliberate: watch out for them when discussing your own results.
ID: 66815 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,027,210
RAC: 20,226
Message 66817 - Posted: 8 Dec 2022, 11:30:15 UTC

I don't think the error/success difference can be attributed to the model code or data, but must be down to something in the host environment.
I agree with your analysis Richard but whatever it is about the host environments that these tasks don't like is I am pretty sure, not something that shows up on the computers' pages. I have looked at quite a few trying to spot the differences between those that finish complete these tasks and those that appear to crash them at the end.
ID: 66817 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,709,934
RAC: 9,107
Message 66818 - Posted: 8 Dec 2022, 11:40:46 UTC - in response to Message 66817.  
Last modified: 8 Dec 2022, 11:41:33 UTC

The trouble is, one of my wingmates is anonymous, and the other has never interacted on these message boards. If the error condition isn't recorded on the host and result pages, we have no way of matching the subjective local experience with the final outcome.

That's why I discussed my outcome in some detail, in the hope that other volunteers who are active here can try something similar and hopefully uncover some explanation for the differences.
ID: 66818 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,027,210
RAC: 20,226
Message 66819 - Posted: 8 Dec 2022, 12:00:40 UTC - in response to Message 66818.  

And I notice that some of the machines crashing these are completing nearly all the HADAM type models they run so users savvy enough to follow instructions to get 32bit libs and not fall into the more obvious traps. I might try stopping the client and restarting next time I get a some tasks to see what the failures look like then though if a machine is not crashing any hadam4 tasks it almost certainly isn't being turned off regularly.
ID: 66819 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,027,210
RAC: 20,226
Message 66820 - Posted: 8 Dec 2022, 18:19:31 UTC

More tasks from testing. - more tomorrow after a number of buts ironed out Might be more for main site soon if tomorrow's testing goes well.
ID: 66820 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66821 - Posted: 8 Dec 2022, 19:21:11 UTC - in response to Message 66820.  
Last modified: 8 Dec 2022, 19:22:38 UTC

I have my Boinc-client set up to run up to 4 of these oifs_43r3_ps tasks at a time with my main machine,

Computer 1511241
Number of processors 	16
Memory 	         62.28 GB
Cache 	         16896 KB
Swap space 	 15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	482.21 GB
Measured floating point speed 6.13 billion ops/sec
Measured integer speed 	     26.09 billion ops/sec
Average upload rate 	   78.07 KB/sec
Average download rate 	20482.41 KB/sec


[/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml 
<app_config>
    <project_max_concurrent>5</project_max_concurrent>
    <app>
        <name>oifs_43r3</name>
        <max_concurrent>2</max_concurrent>
        </app>
    <app>
        <name>oifs_43r3_bl</name>
        <max_concurrent>2</max_concurrent>
        </app>
    <app>
        <name>oifs_43r3_ps</name>
        <max_concurrent>4</max_concurrent>
        </app>
</app_config>



I would be willing to change some of these settings if it will help in testing in any way. Just tell me what you would like.
ID: 66821 · Report as offensive     Reply Quote
ProfilePDW

Send message
Joined: 29 Nov 17
Posts: 82
Credit: 14,593,899
RAC: 89,206
Message 66825 - Posted: 8 Dec 2022, 23:37:37 UTC - in response to Message 66815.  

I don't think the error/success difference can be attributed to the model code or data, but must be down to something in the host environment.

Not sure that I can agree with that completely yet.

Another task completed the application successfully but failed with error 9.
I had turned task_debug on so have more info in the event log but that'll have to wait till tomorrow, though it is still no clearer of the exact problem.
Linux says Error 9 is EBADF - It is displayed in case of bad file descriptor.

Having observed the ending of the task it is the process of 'tidying up' after the last zip file has been uploaded when the problem occurred. It took 1 minute 45 seconds between completing the zip upload to the status 9 being reported. All the files, and there are a lot of them, need to be cleared away. No idea if Boinc does that by just trying to delete the slot directory or has a list of the files/directories to remove. Perhaps something in that process either loses the plot and doesn't track what is being asked to be deleted with what has been deleted or asks for something to be deleted that is no longer there. Maybe it is during the process of creating the "what to tell the server" dialog but that makes it up to the server so seems unlikely.

The host has lots of free memory.
It is returning about 1,200 tasks daily, the only one that fails with this error is the oifs task.
The other project's server shows the host has about 700 tasks pending validation, 16,000 valid tasks, 0 invalid and 3 error tasks ( those are 3 cancelled by the server), so no problem seen there.
The host has approximately 1.75 hyperthreads free according to top and uses an SSD drive.
Some oifs tasks work fine and some don't and return error 9.

Is there a better Event Log flag option that will look even deeper into the abyss and determine what actually is causing the 9 to be returned ?

I think the time gaps before 'finish' and 'report' are entirely normal and deliberate

The report time is most likely delayed because within Boinc it knows it is somewhere along the 3,636 second delay the project requested.
If you actually hit Update the task is reported immediately.
Having the report_results_immediately flag set in cc_config.xml makes no difference.
ID: 66825 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66826 - Posted: 9 Dec 2022, 1:09:25 UTC - in response to Message 66821.  

I just got five of these "new" work units. For each, at least one previous user failed. Mostly the master.exe task finished with a 0, but the wrapper disliked what it found.

FWIW, this is what my machine thinks of its current workload. It has 16 cores of which I allow 12 to do boinc tasks.

top - 20:00:24 up 16 days, 19:35,  1 user,  load average: 12.82, 12.64, 12.42
Tasks: 467 total,  13 running, 454 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us,  1.5 sy, 72.9 ni, 24.5 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  63772.8 total,   4916.1 free,  17830.9 used,  41025.8 buff/cache
MiB Swap:  15992.0 total,  15778.0 free,    214.0 used.  45043.3 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
1462284 1462279 boinc     39  19 R   3.7g   6.0  98.9  7 122:17.28 /var/lib/boinc/slots/0/./master.exe                                       
1461373 1461365 boinc     39  19 R   3.6g   5.8  98.9  3 139:37.90 /var/lib/boinc/slots/1/./master.exe                                       
1461265 1461260 boinc     39  19 R   2.7g   4.4  99.1  5 142:46.72 /var/lib/boinc/slots/7/./master.exe                                       
1463150 1463147 boinc     39  19 R   2.5g   4.0  99.0  4 106:48.63 /var/lib/boinc/slots/4/./master.exe                                       
1459722    2146 boinc     39  19 R 323144   0.5  98.8  0 173:19.37 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
1459726    2146 boinc     39  19 R 314272   0.5  99.0  6 173:00.81 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
1459708    2146 boinc     39  19 R 314248   0.5  99.0 13 173:34.01 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
1452480    2146 boinc     39  19 R 213048   0.3  99.0  9 270:54.51 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 
1453354    2146 boinc     39  19 R 213000   0.3  99.0  8 255:59.86 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 
1463816    2146 boinc     39  19 R 162044   0.2  99.0 11  93:26.51 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 
1464363    2146 boinc     39  19 R 159424   0.2  98.9  2  80:33.57 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 
1465049    2146 boinc     39  19 R 156888   0.2  98.6 14  66:35.57 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 
   2146       1 boinc     30  10 S  67008   0.1   0.2  4 277145:38 /usr/bin/boinc   <---<<< This is the Boinc client                                                         
1461260    2146 boinc     39  19 S   4796   0.0   0.1 14   0:18.87 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-g+ 
1462279    2146 boinc     39  19 S   4796   0.0   0.1 10   0:16.07 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-g+ 
1463147    2146 boinc     39  19 S   4792   0.0   0.1 14   0:15.25 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-g+ 
1461365    2146 boinc     39  19 S   4752   0.0   0.0 12   0:19.02 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-g+ 

ID: 66826 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 317
Credit: 14,853,382
RAC: 19,752
Message 66828 - Posted: 9 Dec 2022, 9:52:28 UTC

Just successfully finished a task that failed on 2 other machines: https://www.cpdn.org/workunit.php?wuid=12165996. One the PCs it failed on has crashed all of the few dozen OIFS tasks it attempted and still has about a dozen to go: (https://www.cpdn.org/results.php?hostid=1536378&offset=0&show_names=0&state=0&appid=39). It has a pretty old CPU, Xeon E5530, I wonder if that has something to do with it.

I wonder if it's possible that the code is too specialized, if I can put it that way, in that it won't run well on a variety of configurations and only on certain, preferred configurations?
ID: 66828 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,709,934
RAC: 9,107
Message 66829 - Posted: 9 Dec 2022, 10:34:26 UTC - in response to Message 66828.  

I've had a look, but I can't see anything anything that looks like a smoking gun. Both your failing wingmates have pretty hefty CPUs with high core counts: I would be surprised if that was in any way an 'unsupported configuration'. Even if it was, I'd expect an instant crash, rather than well into the run.

I think I'd rather like to look at what else is going on in those machine's environments. We'll never know about the anonymous one, but the other would be possible: but this project doesn't seem to be displaying the usual BOINC "Projects in which you are participating" section on user account pages. That would help.
ID: 66829 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1049
Credit: 16,432,494
RAC: 17,331
Message 66830 - Posted: 9 Dec 2022, 11:06:17 UTC - in response to Message 66779.  

Those Virtualization entries don't necessarily mean that virtualization is being used, just that BOINC detects that VBox is installed and that CPU support is enabled. I actually think that since BOINC detects those things it means that it's not running in a virtualized environment. I don't think BOINC can tell if it's running inside a VM or on bare metal.
The OpenIFS apps do not use virtualization. That would require a completely different app built around VirtualBox. There are no vbox apps available from CPDN. It's just boinc saying what's available on your system in case you want to subscribe to projects with vbox apps.
ID: 66830 · Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org