Thread 'OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected!'

Author	Message
Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647	Message 67353 - Posted: 5 Jan 2023, 9:19:10 UTC - in response to Message 67352. If it isn't checkpointing when I wake up I'll know that bit has worked and reduce the figure to induce checkpointing. It was checkpointing. So yes, BOINC needs fixing. Is it checkpointing, or is it reporting checkpoints to the log? ID: 67353 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,775,367 RAC: 72,227	Message 67354 - Posted: 5 Jan 2023, 9:22:16 UTC - in response to Message 67353. Reporting certainly, whether it was actually checkpointing as it should if it was supposed to I don't know. ID: 67354 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647	Message 67356 - Posted: 5 Jan 2023, 9:51:01 UTC - in response to Message 67354. Reporting certainly, whether it was actually checkpointing as it should if it was supposed to I don't know. How often? If it's once per second, it's a false report. Real checkpoints happen every few minutes. ID: 67356 · Reply Quote

PDW Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,775,367 RAC: 72,227	Message 67357 - Posted: 5 Jan 2023, 10:38:57 UTC - in response to Message 67356. Reporting certainly, whether it was actually checkpointing as it should if it was supposed to I don't know. How often? If it's once per second, it's a false report. Real checkpoints happen every few minutes. Every second. In this instance it should have been every 10 hours. ID: 67357 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,649,638 RAC: 12,396	Message 67364 - Posted: 5 Jan 2023, 14:58:25 UTC - in response to Message 67330. Last modified: 5 Jan 2023, 15:03:07 UTC The original problem was that the Event Log was reporting that the model was checkpointing every second. It should not be saying that. It should be saying that the model has checkpointed if and only if a real life checkpoint - all those restart files - has been completed in the last second. My interpretation is now that the model (or, probably more accurately, the wrapper) reports the current state of play to the BOINC client every second - timings, progress made, changes in status, anything like that. The client will analyse that, store what needs storing, and process all those changes in state. The client then has a snapshot of the overall status, and can respond when the Manager asks - again every second - for a summary fit for display to the user. The bits of XML I posted are a small fraction of all that. For the current bug-hunt, I think the critical data points are "checkpoint_cpu_time" and "current_cpu_time" - both are the number of seconds of CPU work done since the task started. If they are identical, I'm suggesting that the client will notice that fact, and interpret it as "a checkpoint has happened in the last second", and report that fact in the Event Log, and pass it to the Manager for display to the user. The model/wrapper should only report a change in checkpoint_cpu_time once per restart file dump. Right got it. I think I can explain what's going on. The wrapper code makes this call every 1 sec back to the client: // Provide the current cpu_time to the BOINC server (note: this is deprecated in BOINC) boinc_report_app_status(current_cpu_time,current_cpu_time,fraction_done); boinc_fraction_done(fraction_done); If go look at what this does we have to look at the API for compound apps here: https://boinc.berkeley.edu/trac/wiki/CompoundApps and look at the very bottom for the call specs: boinc_report_app_status( double cpu_time, // CPU time since start of WU double checkpoint_cpu_time, // CPU time at last checkpoint double fraction_done ); which, as we can see from above, the checkpoint_cpu_time passed is just the current cpu time (and the fraction_done). So it's just the wrapper saying 'hello' to the client and nothing to do with checkpointing that the model might do. Not difficult to add in the correct checkpointing time but it's really not a priority right now. I don't know what the replacement for the deprecated boinc_report_app_status call is. And I'm not certain this qualifies as a CompoundApp as boinc seems to describe it. I'll discuss with Andy when he has more time. For completeness, every 10 secs the wrapper will check that any trickle messages & uploads need processing (just in case you find a 10 sec loop somewhere in the logs!). P.S. Richard - while I'm here, CPDN told me this morning there's BOINC meeting coming up, online only. I forget the date but CPDN are thinking about what to present & what issues they might like to bring up. Perhaps any thoughts might be worth another thread (or PM?). P.P.S. I've created an issue on github for this so we don't forget about it. ID: 67364 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647	Message 67365 - Posted: 5 Jan 2023, 15:45:32 UTC - in response to Message 67364. Last modified: 5 Jan 2023, 15:46:25 UTC Right got it. I think I can explain what's going on. The wrapper code makes this call every 1 sec back to the client: // Provide the current cpu_time to the BOINC server (note: this is deprecated in BOINC) boinc_report_app_status(current_cpu_time,current_cpu_time,fraction_done); boinc_fraction_done(fraction_done); If go look at what this does we have to look at the API for compound apps here: https://boinc.berkeley.edu/trac/wiki/CompoundApps and look at the very bottom for the call specs: boinc_report_app_status( double cpu_time, // CPU time since start of WU double checkpoint_cpu_time, // CPU time at last checkpoint double fraction_done ); which, as we can see from above, the checkpoint_cpu_time passed is just the current cpu time (and the fraction_done). So it's just the wrapper saying 'hello' to the client and nothing to do with checkpointing that the model might do. Not difficult to add in the correct checkpointing time but it's really not a priority right now. Yes, passing the same data to two different variables certainly explains what we've been observing! I don't know what the replacement for the deprecated boinc_report_app_status call is. And I'm not certain this qualifies as a CompoundApp as boinc seems to describe it. I'll discuss with Andy when he has more time. I can't understand the deprecation either. There's been a change of terminology - BOINC now supplies it's own wrapper app (in precompiled form, if needed), which I suppose is designed to replace the need for project-developed compound apps. And the new, centrally-provided wrapper app contains precisely the same code: https://github.com/BOINC/boinc/blob/master/samples/wrapper/wrapper.cpp#L1317 No mention of a dreprecation there. P.S. Richard - while I'm here, CPDN told me this morning there's BOINC meeting coming up, online only. I forget the date but CPDN are thinking about what to present & what issues they might like to bring up. Perhaps any thoughts might be worth another thread (or PM?). P.P.S. I've created an issue on github for this so we don't forget about it. The meeting is probably the BOINC (virtual) Workshop 2023 to be held March 1 and 8, which was announced a couple of days ago at https://boinc.berkeley.edu/forum_thread.php?id=14887. I'll go and look at your issue next. ID: 67365 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647	Message 67366 - Posted: 5 Jan 2023, 15:51:39 UTC Last modified: 5 Jan 2023, 16:11:12 UTC Which repo? Oh, OK, found you now. ID: 67366 · Reply Quote

ktf Send message Joined: 28 Jun 07 Posts: 6 Credit: 929,653 RAC: 20,169	Message 67633 - Posted: 13 Jan 2023, 6:46:35 UTC Hi all, I have a 4-core CPU with 8GB of memory, no swap file. BOINC regularly crashes when it chooses to run 3 or more OpenIFS tasks concurrently. Is there any way to instruct BOINC never to run more than one Climateprediction task concurrently, even if other projects have no work available? I'd like to run a little Climateprediction alongside other projects, but I can't seem to find a way to make this work. ID: 67633 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 318 Credit: 14,987,679 RAC: 9,968	Message 67634 - Posted: 13 Jan 2023, 7:22:16 UTC - in response to Message 67633. Last modified: 13 Jan 2023, 7:27:36 UTC ktf, My suspicion is that that machine doesn't have enough RAM to run even 1 task. OIFS tasks require 5GB RAM which leaves only 3GB overhead and I don't believe that's sufficient. I have an older 16GB RAM PC and was getting frequent errors trying to run 2 tasks. Once I reduced it to 1 I've had no problems. I believe one needs around 10GB RAM overhead in order to run these tasks with minimal risk of failure. If you still want to try, I'd make sure the PC is unused and pretty much nothing else, background or foreground is running on it, except the OIFS task. As well as making sure that the task doesn't get paused or interrupted in any way until it finishes, which might take a couple of days with the older CPU on that machine. To limit it to 1 task, create a file called app_config.xml with the following entry and place it in the CPDN project directory. Usually /var/lib/boinc/projects/climateprediction.net. Once you have the file there go to Options in BOINC manager and click on Read config files or run the equivalent boinccmd command. <app_config> <project_max_concurrent>1</project_max_concurrent> </app_config> For more info on creating BOINC configurations files see: https://boinc.berkeley.edu/wiki/Client_configuration ID: 67634 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,510,803 RAC: 10,061	Message 67640 - Posted: 13 Jan 2023, 10:10:44 UTC - in response to Message 67633. ktf I’d concur with Andrey. 8GB ram is less than the minimum recommended for Ubuntu (4GB) and a current openIFS task (about 5GB). Without enough ram, your three or four cpdn tasks will get continually sent to disc whenever a swap occurs - that’s a recipe for crashes. One openIFS task may be ok. From experience of Ubuntu and cpdn in a VM, it’s not going to be happy with anything less than 10 or 11 GB ram. Your celeron cpu should support 16GB ram, subject to the mobo and chipset limitations. That should let you run one or two of the current openIFS tasks. You can set the BOINC options to limit the cpu count to nn% and thereby limit it to one or two tasks. However, the upcoming ram requirements, that Glenn has indicated are much higher for future models, are going to exceed the maximum possible ram of your celeron cpu. H. ID: 67640 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,649,638 RAC: 12,396	Message 67643 - Posted: 13 Jan 2023, 11:12:27 UTC - in response to Message 67633. Last modified: 13 Jan 2023, 11:15:11 UTC I think there's a design flaw/bug with boinc. It should not be starting 3 openifs tasks concurrently because if we add up the total memory of 3 tasks (3x5Gb) that's easily more than 8Gb. The task memory only seems to be used to decide whether to download the task in the first place. What then happens is boinc looks at how many cpus it's allowed to use, starts up that many and then only suspends the tasks when it sees the memory go higher than your client limit. By that point, 3 running openifs models have run out of memory and crash. I've been looking at the logs of failed tasks and we see alot of fails on low memory machines (real and virtual machines) for this reason. Out of interest, in boincmgr, what are your Memory limits 'when computer is in use/not in use' set to? If you want to run just one OpenIFS at a time, you can create an 'app_config.xml' file but only if you don't intend doing much with the machine as it will take most of your memory. Assuming your boinc client data directory is /var/lib/boinc, use 'sudo' if you need to, create the following file in your /var/lib/boinc/projects/climateprediction.net directory, and then use 'options->read config files' in boincmgr to update it: <app_config> <project_max_concurrent>1</project_max_concurrent> <report_results_immediately/> <app> <name>oifs_43r3</name> <max_concurrent>1</max_concurrent> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>1</max_concurrent> </app> <app> <name>oifs_43r3_bl</name> <max_concurrent>1</max_concurrent> </app> </app_config> This will stop the client from starting up more than 1 OpenIFS task on your system, regardless of how many cpus you make available. Note! This will not stop boinc from running tasks from other projects as well so be careful. As an ex-sysadmin, no swap with 8Gb memory is living on the edge, you should have at least as much swap again. I always configure at least 16Gb swap regardless of how much memory I have. Swap is useful as I think suspend/hibernate uses swap? Hi all, I have a 4-core CPU with 8GB of memory, no swap file. BOINC regularly crashes when it chooses to run 3 or more OpenIFS tasks concurrently. Is there any way to instruct BOINC never to run more than one Climateprediction task concurrently, even if other projects have no work available? I'd like to run a little Climateprediction alongside other projects, but I can't seem to find a way to make this work. ID: 67643 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647	Message 67659 - Posted: 13 Jan 2023, 18:07:51 UTC - in response to Message 67643. I've had a random walk into the spaghetti forest, and emerged relatively unscathed. The memory checks, such as they are, seem to be in client/cpu_sched.cpp, which for a change seems to follow a fairly logical sequence. The client starts by creating a list of 'runnable' jobs, and then pares it down by numerous checks - first by throwing out the impossible ones (e.g. requires missing GPU), then various checks on urgency (missing deadline?), selects highest priority project, etc., etc. There's an early memory check at line 177, but it's crude: will this (single) task fit in the available RAM? If not, it probably shouldn't have been sent by the server, but just in case... And then I can't see anything else memory related until nearly the end, At line 1297, there's a test to "skip jobs whose working set is too large to fit in available RAM" - no mention of memory_bound. There's a debug log flag of <mem_usage_debug>, which would be triggered if 'won't fit' is ever enforced. I've looked at the working set size for running IFS tasks, and it seems to be around 3.5 GB smoothed, some way short of the 6.01 GB bound. Quite what size would be used for wss for a new task which has never run before is unclear. And that seems to be the end of it. I could try removing max_concurrent on my 32 GB machine, to see where it stops, but that might be an experiment best carried out on the dev server once we've gathered these production tasks in. ID: 67659 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,647,036 RAC: 1,892	Message 67694 - Posted: 14 Jan 2023, 10:44:58 UTC - in response to Message 67643. I'm new to the app_config wagon, but gave it a try. <app_config> <project_max_concurrent>4</project_max_concurrent> <report_results_immediately/> <app> <name>oifs_43r3</name> <max_concurrent>3</max_concurrent> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>3</max_concurrent> </app> <app> <name>oifs_43r3_bl</name> <max_concurrent>3</max_concurrent> </app> </app_config> I've got that message in the event log and in BOINC notices: Sat 14 Jan 2023 12:38:35 PM EET \| climateprediction.net \| Your app_config.xml file refers to an unknown application 'oifs_43r3'. Known applications: 'hadam4', 'hadam4h', 'hadcm3s', 'hadsm4', 'oifs_43r3_ps' Sat 14 Jan 2023 12:38:35 PM EET \| climateprediction.net \| Your app_config.xml file refers to an unknown application 'oifs_43r3_bl'. Known applications: 'hadam4', 'hadam4h', 'hadcm3s', 'hadsm4', 'oifs_43r3_ps' and I did not get any WUs. I will turn <file_xfer> and <sched_ops> to see what happens...in one hour ID: 67694 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647	Message 67696 - Posted: 14 Jan 2023, 11:10:38 UTC - in response to Message 67694. That's OK. We're only running oifs_43r3_ps tasks at the moment, so your machine wen't have encountered any of the other two IFS variants yet. But you're prepared for when they do start arriving. ID: 67696 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,647,036 RAC: 1,892	Message 67697 - Posted: 14 Jan 2023, 11:14:47 UTC - in response to Message 67696. Last modified: 14 Jan 2023, 11:15:34 UTC Thanks Richard, But there are plenty of oifs_43r3_ps 's and I did crunch them on the same machine ID: 67697 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647	Message 67698 - Posted: 14 Jan 2023, 11:27:46 UTC - in response to Message 67694. That why 'oifs_43r3_ps' is listed as a "Known application" It's only oifs_43r3 and oifs_43r3_bl which are unknown (yet). Your app_config will work as it is. <file_xfer> won't tell you anything useful until you get allocated work - I'd turn it off (it wastes a lot of space in the log). But <sched_op_debug> is useful, and relatively quiet. ID: 67698 · Reply Quote