Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected!
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
Is it checkpointing, or is it reporting checkpoints to the log?If it isn't checkpointing when I wake up I'll know that bit has worked and reduce the figure to induce checkpointing.It was checkpointing. |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,775,367 RAC: 72,227 |
Reporting certainly, whether it was actually checkpointing as it should if it was supposed to I don't know. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
Reporting certainly, whether it was actually checkpointing as it should if it was supposed to I don't know.How often? If it's once per second, it's a false report. Real checkpoints happen every few minutes. |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,775,367 RAC: 72,227 |
Every second. In this instance it should have been every 10 hours.Reporting certainly, whether it was actually checkpointing as it should if it was supposed to I don't know.How often? If it's once per second, it's a false report. Real checkpoints happen every few minutes. |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,649,638 RAC: 12,396 |
The original problem was that the Event Log was reporting that the model was checkpointing every second. It should not be saying that. It should be saying that the model has checkpointed if and only if a real life checkpoint - all those restart files - has been completed in the last second. Right got it. I think I can explain what's going on. The wrapper code makes this call every 1 sec back to the client: // Provide the current cpu_time to the BOINC server (note: this is deprecated in BOINC) boinc_report_app_status(current_cpu_time,current_cpu_time,fraction_done); boinc_fraction_done(fraction_done); If go look at what this does we have to look at the API for compound apps here: https://boinc.berkeley.edu/trac/wiki/CompoundApps and look at the very bottom for the call specs: boinc_report_app_status( double cpu_time, // CPU time since start of WU double checkpoint_cpu_time, // CPU time at last checkpoint double fraction_done ); which, as we can see from above, the checkpoint_cpu_time passed is just the current cpu time (and the fraction_done). So it's just the wrapper saying 'hello' to the client and nothing to do with checkpointing that the model might do. Not difficult to add in the correct checkpointing time but it's really not a priority right now. I don't know what the replacement for the deprecated boinc_report_app_status call is. And I'm not certain this qualifies as a CompoundApp as boinc seems to describe it. I'll discuss with Andy when he has more time. For completeness, every 10 secs the wrapper will check that any trickle messages & uploads need processing (just in case you find a 10 sec loop somewhere in the logs!). P.S. Richard - while I'm here, CPDN told me this morning there's BOINC meeting coming up, online only. I forget the date but CPDN are thinking about what to present & what issues they might like to bring up. Perhaps any thoughts might be worth another thread (or PM?). P.P.S. I've created an issue on github for this so we don't forget about it. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
Right got it. I think I can explain what's going on.Yes, passing the same data to two different variables certainly explains what we've been observing! I don't know what the replacement for the deprecated boinc_report_app_status call is. And I'm not certain this qualifies as a CompoundApp as boinc seems to describe it. I'll discuss with Andy when he has more time.I can't understand the deprecation either. There's been a change of terminology - BOINC now supplies it's own wrapper app (in precompiled form, if needed), which I suppose is designed to replace the need for project-developed compound apps. And the new, centrally-provided wrapper app contains precisely the same code: https://github.com/BOINC/boinc/blob/master/samples/wrapper/wrapper.cpp#L1317 No mention of a dreprecation there. P.S. Richard - while I'm here, CPDN told me this morning there's BOINC meeting coming up, online only. I forget the date but CPDN are thinking about what to present & what issues they might like to bring up. Perhaps any thoughts might be worth another thread (or PM?).The meeting is probably the BOINC (virtual) Workshop 2023 to be held March 1 and 8, which was announced a couple of days ago at https://boinc.berkeley.edu/forum_thread.php?id=14887. I'll go and look at your issue next. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
Which repo? Oh, OK, found you now. |
Send message Joined: 28 Jun 07 Posts: 6 Credit: 929,653 RAC: 20,169 |
Hi all, I have a 4-core CPU with 8GB of memory, no swap file. BOINC regularly crashes when it chooses to run 3 or more OpenIFS tasks concurrently. Is there any way to instruct BOINC never to run more than one Climateprediction task concurrently, even if other projects have no work available? I'd like to run a little Climateprediction alongside other projects, but I can't seem to find a way to make this work. |
Send message Joined: 12 Apr 21 Posts: 318 Credit: 14,987,679 RAC: 9,968 |
ktf, My suspicion is that that machine doesn't have enough RAM to run even 1 task. OIFS tasks require 5GB RAM which leaves only 3GB overhead and I don't believe that's sufficient. I have an older 16GB RAM PC and was getting frequent errors trying to run 2 tasks. Once I reduced it to 1 I've had no problems. I believe one needs around 10GB RAM overhead in order to run these tasks with minimal risk of failure. If you still want to try, I'd make sure the PC is unused and pretty much nothing else, background or foreground is running on it, except the OIFS task. As well as making sure that the task doesn't get paused or interrupted in any way until it finishes, which might take a couple of days with the older CPU on that machine. To limit it to 1 task, create a file called app_config.xml with the following entry and place it in the CPDN project directory. Usually /var/lib/boinc/projects/climateprediction.net. Once you have the file there go to Options in BOINC manager and click on Read config files or run the equivalent boinccmd command. <app_config> <project_max_concurrent>1</project_max_concurrent> </app_config> For more info on creating BOINC configurations files see: https://boinc.berkeley.edu/wiki/Client_configuration |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,510,803 RAC: 10,061 |
ktf I’d concur with Andrey. 8GB ram is less than the minimum recommended for Ubuntu (4GB) and a current openIFS task (about 5GB). Without enough ram, your three or four cpdn tasks will get continually sent to disc whenever a swap occurs - that’s a recipe for crashes. One openIFS task may be ok. From experience of Ubuntu and cpdn in a VM, it’s not going to be happy with anything less than 10 or 11 GB ram. Your celeron cpu should support 16GB ram, subject to the mobo and chipset limitations. That should let you run one or two of the current openIFS tasks. You can set the BOINC options to limit the cpu count to nn% and thereby limit it to one or two tasks. However, the upcoming ram requirements, that Glenn has indicated are much higher for future models, are going to exceed the maximum possible ram of your celeron cpu. H. |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,649,638 RAC: 12,396 |
I think there's a design flaw/bug with boinc. It should not be starting 3 openifs tasks concurrently because if we add up the total memory of 3 tasks (3x5Gb) that's easily more than 8Gb. The task memory only seems to be used to decide whether to download the task in the first place. What then happens is boinc looks at how many cpus it's allowed to use, starts up that many and then only suspends the tasks when it sees the memory go higher than your client limit. By that point, 3 running openifs models have run out of memory and crash. I've been looking at the logs of failed tasks and we see alot of fails on low memory machines (real and virtual machines) for this reason. Out of interest, in boincmgr, what are your Memory limits 'when computer is in use/not in use' set to? If you want to run just one OpenIFS at a time, you can create an 'app_config.xml' file but only if you don't intend doing much with the machine as it will take most of your memory. Assuming your boinc client data directory is /var/lib/boinc, use 'sudo' if you need to, create the following file in your /var/lib/boinc/projects/climateprediction.net directory, and then use 'options->read config files' in boincmgr to update it: <app_config> <project_max_concurrent>1</project_max_concurrent> <report_results_immediately/> <app> <name>oifs_43r3</name> <max_concurrent>1</max_concurrent> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>1</max_concurrent> </app> <app> <name>oifs_43r3_bl</name> <max_concurrent>1</max_concurrent> </app> </app_config> This will stop the client from starting up more than 1 OpenIFS task on your system, regardless of how many cpus you make available. Note! This will not stop boinc from running tasks from other projects as well so be careful. As an ex-sysadmin, no swap with 8Gb memory is living on the edge, you should have at least as much swap again. I always configure at least 16Gb swap regardless of how much memory I have. Swap is useful as I think suspend/hibernate uses swap? Hi all, |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
I've had a random walk into the spaghetti forest, and emerged relatively unscathed. The memory checks, such as they are, seem to be in client/cpu_sched.cpp, which for a change seems to follow a fairly logical sequence. The client starts by creating a list of 'runnable' jobs, and then pares it down by numerous checks - first by throwing out the impossible ones (e.g. requires missing GPU), then various checks on urgency (missing deadline?), selects highest priority project, etc., etc. There's an early memory check at line 177, but it's crude: will this (single) task fit in the available RAM? If not, it probably shouldn't have been sent by the server, but just in case... And then I can't see anything else memory related until nearly the end, At line 1297, there's a test to "skip jobs whose working set is too large to fit in available RAM" - no mention of memory_bound. There's a debug log flag of <mem_usage_debug>, which would be triggered if 'won't fit' is ever enforced. I've looked at the working set size for running IFS tasks, and it seems to be around 3.5 GB smoothed, some way short of the 6.01 GB bound. Quite what size would be used for wss for a new task which has never run before is unclear. And that seems to be the end of it. I could try removing max_concurrent on my 32 GB machine, to see where it stops, but that might be an experiment best carried out on the dev server once we've gathered these production tasks in. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,647,036 RAC: 1,892 |
I'm new to the app_config wagon, but gave it a try. <app_config> <project_max_concurrent>4</project_max_concurrent> <report_results_immediately/> <app> <name>oifs_43r3</name> <max_concurrent>3</max_concurrent> </app> <app> <name>oifs_43r3_ps</name> <max_concurrent>3</max_concurrent> </app> <app> <name>oifs_43r3_bl</name> <max_concurrent>3</max_concurrent> </app> </app_config> I've got that message in the event log and in BOINC notices: Sat 14 Jan 2023 12:38:35 PM EET | climateprediction.net | Your app_config.xml file refers to an unknown application 'oifs_43r3'. Known applications: 'hadam4', 'hadam4h', 'hadcm3s', 'hadsm4', 'oifs_43r3_ps' Sat 14 Jan 2023 12:38:35 PM EET | climateprediction.net | Your app_config.xml file refers to an unknown application 'oifs_43r3_bl'. Known applications: 'hadam4', 'hadam4h', 'hadcm3s', 'hadsm4', 'oifs_43r3_ps' and I did not get any WUs. I will turn <file_xfer> and <sched_ops> to see what happens...in one hour |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
That's OK. We're only running oifs_43r3_ps tasks at the moment, so your machine wen't have encountered any of the other two IFS variants yet. But you're prepared for when they do start arriving. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,647,036 RAC: 1,892 |
Thanks Richard, But there are plenty of oifs_43r3_ps 's and I did crunch them on the same machine |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
That why 'oifs_43r3_ps' is listed as a "Known application" It's only oifs_43r3 and oifs_43r3_bl which are unknown (yet). Your app_config will work as it is. <file_xfer> won't tell you anything useful until you get allocated work - I'd turn it off (it wastes a lot of space in the log). But <sched_op_debug> is useful, and relatively quiet. |
©2024 cpdn.org