climateprediction.net (CPDN) home page
Thread 'OpenIFS Discussion'

Thread 'OpenIFS Discussion'

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 32 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,767,175
RAC: 3,168
Message 66665 - Posted: 30 Nov 2022, 14:37:09 UTC - in response to Message 66664.  

As I said in my earlier message 66661, which klepel only partially quotes, disk limits can be checked either on the server before the task is issued, or on the client before the task is run. Different checks may be applied at either stage: we need to consider them as separate problems.
ID: 66665 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,727,545
RAC: 12,617
Message 66666 - Posted: 30 Nov 2022, 14:41:08 UTC - in response to Message 66665.  

As I said in my earlier message 66661, which klepel only partially quotes, disk limits can be checked either on the server before the task is issued, or on the client before the task is run. Different checks may be applied at either stage: we need to consider them as separate problems.
Yes, apologies, I read both messages too quickly the first time. Have edited my response now.
ID: 66666 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 82
Credit: 69,990,648
RAC: 4,159
Message 66667 - Posted: 30 Nov 2022, 15:14:22 UTC - in response to Message 66665.  

As I said in my earlier message 66661, which klepel only partially quotes, disk limits can be checked either on the server before the task is issued, or on the client before the task is run. Different checks may be applied at either stage: we need to consider them as separate problems.
Richard, you know better than me how BOINC works. My point is: It seems to me, that the model OpenIFS indicates BOINC, it needs more space than it actually needs. And as several pointed out, causes the problem that the assigned disk-space to BOINC has to be quite large. This is the case with WSL2 and Linux Computers as well. Some of my Linux installations are dual boot on small SSDs so there is no disc-space for 40 GB for BOINC alone (Just checked: One of my Linux computers does not download OpenIFS because it is 7 GB short). As I understand Glenn, he has a lot of WUs to run, so I try to unlock more computers for climateprediction.net.
ID: 66667 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 66668 - Posted: 30 Nov 2022, 15:58:41 UTC - in response to Message 66637.  

I've got a few of these new units. So far two completed ok and two with errors.
The first error log ends with:
Uploading trickle at timestep: 1900800
00:22:36 STEP 530 H= 530:00 +CPU= 15.541
double free or corruption (out)

The other:
18:58:37 STEP 482 H= 482:00 +CPU= 10.168
free(): invalid next size (fast)
Ah! Excellent. I've been trying to understand why some tasks are apparently stopping with nothing in the stderr.txt returned to the server to explain why it stopped.

@DarkAngel - can you tell me which resultids those were so I can look them up?
Also, what machine & OS are you using these on?

This kind of error message indicates a memory problem, often caused by a bug in the code but I've also seen it caused by certain versions of compilers/system libraries. I've never seen it with the model itself but then I've never run the model on such a wide range of systems like this. Could also be the wrapper code we use.

Quick question. When the tasks are running, if you do 'ps -ef' you should see the same number of 'master.exe' processes as 'oifs_43r3_ps_1.01_x86_64-pc-linux-gnu'. The latter is the 'controller' for the model itself (master.exe). Do you have the same number of each? I ask because we know of one issue that can kill the 'oifs_43r3....' process running but still leave the model 'master.exe' running.

Thanks for your help.

I received a "double free or corruption (out)" error on this task https://www.cpdn.org/cpdnboinc/result.php?resultid=22247251 around step 1539.

Another problem has occurred on the same PC. This time, apparently the task ran to the end (got to step 2952 (listed in stderr.txt and ifs.stat), but never completed/reported. The "master.exe" associated with this process is labeled as defunct in ps -ef master, and the task in boinc manager has a progress of 3.256% (stuck) with CPU time continuing to increase. Task: https://www.cpdn.org/cpdnboinc/result.php?resultid=22247938 I'm going to suspend this task since it is blocking others from running. If you need anything from the slots directory, let me know.

Four other tasks have run successfully to completion on this same PC.
Ryzen 5 5600 with 32 GB of DDR4 3200 running fully updated Ubuntu 20.04 LTS.
ID: 66668 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 66670 - Posted: 30 Nov 2022, 19:17:24 UTC - in response to Message 66651.  
Last modified: 30 Nov 2022, 19:28:22 UTC

AndreyOR wrote:
xii5ku wrote:
... request to suspend the task and wait until it did. ...
At what time point (BOINC elapsed time) did you suspend the referenced task, about 16 minutes ?
Sorry, I did not take note of such detail. These were at least 25 tasks, perhaps more, all with different elapsed times.
ID: 66670 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 66671 - Posted: 30 Nov 2022, 19:18:06 UTC - in response to Message 66657.  
Last modified: 30 Nov 2022, 19:26:52 UTC

Glenn Carver wrote:
We can reduce the data output from the model, but maybe we need to control how many tasks run at once on a volunteer machine?
I don't think the latter is feasible, as there is no way to know which hosts share what bandwidth. BOINC keeps track of sort of recent average transfer speeds of a host, but IMO that's a figure which is prone to grave measurement errors. I only noticed this BOINC feature at a project where it went horribly wrong; I don't recall details. Furthermore, server side can at most control a host's number of tasks in progress ("host" as in client instance, not as in physical host), whereas concurrency can only be controlled at the client's side.

What you can do however is to put it into the FAQ. We do know that there are 1.72 GB of output per task, and we do know the recent average:min:max runtime across all recently reporting clients (16:9:40 h), so we do know what sort of upload bandwidth per each running task a prospective contributor should have to spare. (1.72 GB / 16:9:40 h = 250:450:100 kbit/s)

Edit: BTW, as soon as the BOINC client has got a certain number of tasks in "uploading" state at a project, it will stop requesting new work at that project. Though an OpenIFS contributor may already have a possibly frightening queue of files waiting for transfer at that point.
ID: 66671 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 66672 - Posted: 30 Nov 2022, 19:50:50 UTC - in response to Message 66653.  
Last modified: 30 Nov 2022, 19:56:06 UTC

Dave Jackson wrote:
Just thinking, if there are problems with some versions of GCC on hosts, maybe long term the answer would be to have Linux hosts as well as Windows ones use VB?
The application binary is statically linked, so whatever issues there are in particular, are they really related to host-side installed software versions?

Also keep in mind, so far all vboxwrapper based applications out there are a pain in the rear to deal with. To somewhat varying degrees, but in the end they are all trouble. For many users, a vboxwrapper enabled boinc host is difficult to set up. The applications are generally fragile (among else though, as I have been told, because project developers tend to provide outdated known buggy vboxwrapper versions; so here is a point which project developers presumably could improve upon). And virtualized applications are naturally more resource hungry than native applications (disk and RAM mainly). Last but not least, vboxwrapper takes quite some of boinc client's resource management away (most notorious: 1. process priority control -> computers which run vbox apps may easily become unresponsive; 2. network traffic from inside the VM, entirely behind the client's back).

There seem to be some proponents of vboxwrapper out there, but they must be a little minority and ignorant of all the real trouble many folks have with such applications.
ID: 66672 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66673 - Posted: 30 Nov 2022, 19:54:11 UTC - in response to Message 66671.  

What you can do however is to put it into the FAQ. We do know that there are 1.72 GB of output per task, and we do know the recent average:min:max runtime across all recently reporting clients (16:9:40 h), so we do know what sort of upload bandwidth per each running task a prospective contributor should have to spare. (1.72 GB / 16:9:40 h = 250:450:100 kbit/s)


I am not disagreeing with you here. But I do not know any of the things that you say we do know. You may well be right. Where do "we" get your numbers?

I can watch the trickles go up and they take about 5 seconds each. and if each task sends 122 of those, that would come to 610 seconds. Now if my Internet connection runs at 75 megabits per second, that would be 45.75 Gbits (45.75 x 10^9 bits -- 5.72 GBytes). I could send in that amount of time. I guess there could easily be 67% overhead in sending a 5 second message all the way to the server (wherever it is) and their sending me back an acknowledgement.

But for me, 5 seconds sending a trickle every 8 minutes (or whatever it is) between trickles (per process) is not much.
ID: 66673 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 66674 - Posted: 30 Nov 2022, 20:02:16 UTC - in response to Message 66673.  

Jean-David Beyer wrote:
xii5ku wrote:
We do know that there are 1.72 GB of output per task,
Where do "we" get your numbers?
During a normal error-free run, each task produces exactly 123 output files; 122 are ≈14.25 MB, and 1 is ≈24.4 MB after zipping/ before upload. Somebody correct me if this is not true for all workunits which were issued so far, or have yet to be issued.
ID: 66674 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66675 - Posted: 30 Nov 2022, 20:08:25 UTC - in response to Message 66672.  

Dave Jackson wrote:

Just thinking, if there are problems with some versions of GCC on hosts, maybe long term the answer would be to have Linux hosts as well as Windows ones use VB?

Also keep in mind, so far all vboxwrapper based applications out there are a pain in the rear to deal with.


Right on! I will not do it. But the current Oifs stuff works just fine for me. Eight completed successfully so far; three currently running with over 1.5 hours on them; no failures.
ID: 66675 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 53
Credit: 4,725,987
RAC: 9,174
Message 66676 - Posted: 30 Nov 2022, 20:11:27 UTC

Another error, this time right at the end of the run. Same host.

https://www.cpdn.org/result.php?resultid=22245239

The child process terminated with status: 0
Moving to projects directory: /home/michael/media/BOINC/slots/7/ICMGGhpi1+002952
Moving to projects directory: /home/michael/media/BOINC/slots/7/ICMSHhpi1+002952
Moving to projects directory: /home/michael/media/BOINC/slots/7/ICMUAhpi1+002952
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMGGhpi1+002952
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMGGhpi1+002940
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMGGhpi1+002928
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMUAhpi1+002928
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMUAhpi1+002940
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMSHhpi1+002928
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMUAhpi1+002952
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMSHhpi1+002940
Adding to the zip: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_12163296/ICMSHhpi1+002952
Zipping up the final file: /home/michael/media/BOINC/projects/climateprediction.net/oifs_43r3_ps_0207_2021050100_123_945_12163296_0_r967272684_122.zip
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
00:31:56 (2177745): called boinc_finish(0)
free(): invalid pointer
ID: 66676 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66679 - Posted: 30 Nov 2022, 22:24:59 UTC - in response to Message 66674.  

During a normal error-free run, each task produces exactly 123 output files; 122 are ≈14.25 MB, and 1 is ≈24.4 MB after zipping/ before upload.


I infer that means I can upload 14.25 MegaBytes in about 5 seconds including the time to get the acknowledgement from the server! That is 114 Megabits. So 22.8 Megabits per second. Since my ISP provides me a 75 Megabit/second connection, both up and down, it looks as though it has no trouble transmitting those trickles. I am presently running three of those tasks, and each takes about 8 minutes between trickles.
ID: 66679 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 66681 - Posted: 30 Nov 2022, 23:13:22 UTC
Last modified: 30 Nov 2022, 23:23:04 UTC

As an experiment, I have downloaded a work unit to my 4 core 8 GB Linux computer to see how it would run.

The computer is running other BOINC projects and at the moment is running LODA and PRIVATE GFN SEARCH plus iThena.Measurements and WUProp@Home.
iThena.Measurements and WUProp are Non-CPU intensive. PRIVATE GFN SEARCH uses minimal resources and less than 50 kB of RAM to run, however LODA is different and uses 1 GB per work unit of RAM.

When started the Climate model maxed out my 8 GB and used half my SWAP (7.6 GB so about 3 to 4 GB) this is along with the other BOINC projects.

So the computer slowed to a crawl but kept running.

Once settled down the Climate model is now using from 2 to 4.5 GB and no SWAP even with 3 LODA work units running as well, but does start to lag a lot. With only 2 LODA, 1 PRIVATE GFN SEARCH and 1 Climate Open IFS running it is quite usable.

The Open IFS Climate model is now at 76.425% after 13 hours with about 4 1/2 hours or so to go.

So it can be done on 8 GB memory but I would not recommend it if you also want to use the computer as well, because you can go to sleep waiting for the screens to change.

As an aside to this I have been having no trouble with all the trickles from 5 work units (now 3 as 2 finished) they go as soon as they are ready.
Using a hybrid Fibre to the Node and copper cable to the house Broadband system with around 15 MB upload and 25+ MB download (both on good days with low usage by others on the ISP network).

I will stick to my RYZEN 5900x with 64 GB RAM, much less hassle even running 4 at a time does not use over 20 GB.

Conan
ID: 66681 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 66683 - Posted: 1 Dec 2022, 3:10:10 UTC - in response to Message 66681.  

I will stick to my RYZEN 5900x with 64 GB RAM, much less hassle even running 4 at a time does not use over 20 GB.


On my Dell T5820 with 64 GBytes of RAM, my machine is using 13.940 GBytes of RAM for everything, which means
3 of the Oifs CPDN jobs, 2 Einstein jobs, 3 WCG jobs, 2 Milky jobs, and 2 Universe jobs. 156 MBytes of swap have been used since I booted the machine about 9 days ago. I do not know what I did to use that little bit of swap.

20GB/4 = 5
14GB/3 = 4.67

So this is not that big of a memory hog. But it is hitting the processor cache pretty hard.

Memory 	62.28 GB
Cache 	16896 KB

# perf stat -aB -e cache-references,cache-misses 
 Performance counter stats for 'system wide':

    20,338,306,817      cache-references                                            
     9,955,631,115      cache-misses              #   48.950 % of all cache refs    

      63.552913728 seconds time elapsed

ID: 66683 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 66684 - Posted: 1 Dec 2022, 5:11:39 UTC

Experiment successful, work unit completed without error in a shade under 18 hours.

The time may of been due to how loaded up the processor was during this time but still good.

Don't know about the cache hits as the experiment was done on an older Intel i5. My newer Ryzen I believe has a larder cache but without looking things up I don't know what it is either.

The 2 still on the Ryzen are paused at the moment due to some PrimeGrid work I need to do, they both still have 33% left to run.

Conan
ID: 66684 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 66685 - Posted: 1 Dec 2022, 6:07:02 UTC
Last modified: 1 Dec 2022, 6:42:17 UTC

Another thought about potential vboxwrapper applications (for Windows or otherwise): Here at CPDN, task runtimes of several days or weeks are the norm, with the current OpenIFS batch still 0.7 days. Going by the quality of vboxwrapper based applications at other projects, the failure percentage of vboxwrapper based CPDN work — with runtimes as long typical at CPDN — will be absurdly high in such a fragile environment that vboxwrapper is. Unless CPDN's developers somehow can achieve a robustness of the wrapper which by far surpasses the level which other projects who already use vboxwrapper have achieved so far.

Oh, and another thought on OpenIFS on Windows: What if you port OpenIFS to Cygwin?
ID: 66685 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 66686 - Posted: 1 Dec 2022, 10:29:52 UTC - in response to Message 66685.  

I don't see the advantage of Cygwin over WSL which wouldn't require rebuilding of anything. Does BOINC run on Cygwin without the user having to build it from Source? I agree that getting, VB set up correctly for BOINC projects is not straightforward, though once I had done it for LHC I have found I have had all my VB tasks for projects that use it complete successfully. I don't understand enough to know why the default setup isn't sufficient for running these tasks or why the projects don't have a straightforward howto that explains the process of setting it cup correctly from Linux.
ID: 66686 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,727,545
RAC: 12,617
Message 66687 - Posted: 1 Dec 2022, 11:24:02 UTC - in response to Message 66668.  

geophi:
I received a "double free or corruption (out)" error on this task https://www.cpdn.org/cpdnboinc/result.php?resultid=22247251 around step 1539.

Another problem has occurred on the same PC. This time, apparently the task ran to the end (got to step 2952 (listed in stderr.txt and ifs.stat), but never completed/reported. The "master.exe" associated with this process is labeled as defunct in ps -ef master, and the task in boinc manager has a progress of 3.256% (stuck) with CPU time continuing to increase. Task: https://www.cpdn.org/cpdnboinc/result.php?resultid=22247938 I'm going to suspend this task since it is blocking others from running. If you need anything from the slots directory, let me know.
This is typical of the problems we see. The 'wrapper' process, that sits between the client and the model ('oifs_43r3_1*') dies but for some as-yet undetermined reason it does not cleanly shut the model down, so you see the model process 'master.exe' still running. If you see a stuck process then just kill it rather than risk it corrupting the slot directory.

To work out which 'master.exe' to kill if it's got stuck, use 'ps -elf' and check the parent process id and see if it has a matching 'oifs_43r3_*' process which has the same process id. If you can't find one, that's the master.exe to kill. Hope that makes sense.

I think what's happened is CPDN have not set the memory usage limit high enough and depending on what process does what when, it can hit blow past the limit. It's a working theory I want them to test.
ID: 66687 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,727,545
RAC: 12,617
Message 66688 - Posted: 1 Dec 2022, 11:38:38 UTC - in response to Message 66685.  

Another thought about potential vboxwrapper applications (for Windows or otherwise): Here at CPDN, task runtimes of several days or weeks are the norm, with the current OpenIFS batch still 0.7 days. Going by the quality of vboxwrapper based applications at other projects, the failure percentage of vboxwrapper based CPDN work — with runtimes as long typical at CPDN — will be absurdly high in such a fragile environment that vboxwrapper is. Unless CPDN's developers somehow can achieve a robustness of the wrapper which by far surpasses the level which other projects who already use vboxwrapper have achieved so far.

Oh, and another thought on OpenIFS on Windows: What if you port OpenIFS to Cygwin?
The problem I encountered with vbox was on the boinc client side (latest version) which incorrectly set security in the systemctl startup, not with the virtualbox application itself. LHC tasks run happily for several days on all my mixed hardware. The implementation of OpenIFS in virtualbox looks straightforward (as usual it'll be the boinc server changes which will take time).

I've been using virtualbox and containers for many years with OpenIFS. I much prefer them as I then have more control over the environment the model needs to run successfully. It may well be the only sane way I can implement multi-core OpenIFS for CPDN, rather than deal with the variety of OSes CPDN has on its books.

I agree with your other comment about providing more information about these batches on the CPDN website/forums. I intend to talk to the senior scientist (who's away at the moment) about this. However, not everyone reads FAQs/forums etc. So I really do want to try to get to a 'it-just-works' setup, that doesn't involve people coming to these forums to try to get something to work that should just run in their spare time (which is how I want it to work for me!).

Cygwin is a non-starter. There are better approaches these days such as WSL.
ID: 66688 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,727,545
RAC: 12,617
Message 66689 - Posted: 1 Dec 2022, 11:48:29 UTC

Update.
After meeting yesterday with CPDN, the disk and memory requirements for these tasks need revising: memory requirement up & disk down. What was not taken into account when setting the memory was the additional amount required by the wrapper code & all the boinc functions it uses (such as zipping). Hopefully this will eliminate some of the memory errors.

The plan is to put out a repeat of the first batch with corrected limits to check how it performs before sending out the rest of this experiment.

On trickles, agree these longer (3 month) runs are producing too many trickle files which I'll adjust. However, I looked at the output filesize per output instance and it's reasonable and at the lower limit of what the scientist needs. I am reluctant to change it.

Question for ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with).
ID: 66689 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 32 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 cpdn.org