Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected!
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
People playing around with app_config tend to know what they're doing, so this shouldn't impact casual crunchers who just install Boinc and add some projects because they sound interesting.I have bolded "tend" as I am sure encouraging people to play around with those files will lead to the odd one screwing things up. (Though to be fair, even when I started with CPDN and knew a lot less about Linux than I do now, I managed without doing that. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,322,658 RAC: 1,085 |
Richard Haselgrove wrote: So the most productive single change for CPDN might be to flip the default setting for 'leave applications in memory' to ON, and run a script to change all current settings in the database similarly. It won't be a simple query, because these things are stored in XML blobs, but it could be done.I may have misunderstood what has been suggested here, but: Project admins should never manipulate the users' settings (computing prefs, community prefs, or project prefs). |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,649,638 RAC: 12,396 |
It's about both as I said in previous message.That's exactly the problem, by changing one project's settings it might break the settings that CPDN needs for tasks to be successful. It seems to assume what's right for one project is right for all, which is a reasonable starting point but not generally the case.This is not about the projects. Work might get done faster if you let things in memory, and certainly it helps if apps don't handle checkpoints well or don't have any.It's the time-stepping & complex nature of the CPDN weather models that they work that way. For every checkpoint, the model needs to dump its working arrays in 64bit precision so it can do a bit reproducible restart. That's alot of I/O and alot of data, but if you don't want to keep it in memory, it'll have to restart from checkpoint more often than we currently allow for. That means much more I/O, filespace, wearing out SSDs etc. I could indeed allow the model to checkpoint more to cope with being in & out of memory frequently, but you'd pay a price on your drives instead of RAM, and a much slower throughput because of the added I/O. We (CPDN) have done alot of work on OpenIFS to make it work reasonably in a computing environment it was not designed for, including finding a balance in terms of computing resources for the volunteer. OpenIFS is very stable, it will restart fine if it has to. But you don't want it to do this if you want a decent throughput. In any case, I'm sure if the OS wanted the RAM for a bigger, non-niced process, the task can still be kicked out of RAM, even with that 'keep in memory' flag on. |
Send message Joined: 9 Dec 05 Posts: 116 Credit: 12,547,934 RAC: 2,738 |
And user can always select to use local preferences that will override web site preferences. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,647,864 RAC: 1,859 |
Work might get done faster if you let things in memory, and certainly it helps if apps don't handle checkpoints well or don't have any.It's the time-stepping & complex nature of the CPDN weather models that they work that way. For every checkpoint, the model needs to dump its working arrays in 64bit precision so it can do a bit reproducible restart. That's alot of I/O and alot of data, but if you don't want to keep it in memory, it'll have to restart from checkpoint more often than we currently allow for. That means much more I/O, filespace, wearing out SSDs etc. I could indeed allow the model to checkpoint more to cope with being in & out of memory frequently, but you'd pay a price on your drives instead of RAM, and a much slower throughput because of the added I/O. How often does the OpenIFS model need to checkpoint? Looking at my event log it seems every second? Is that normal? Wed 04 Jan 2023 11:37:46 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed Wed 04 Jan 2023 11:37:47 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed Wed 04 Jan 2023 11:37:48 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed Wed 04 Jan 2023 11:37:49 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed Wed 04 Jan 2023 11:37:50 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed Wed 04 Jan 2023 11:37:51 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed Wed 04 Jan 2023 11:37:52 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed Wed 04 Jan 2023 11:37:53 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed Wed 04 Jan 2023 11:37:54 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed |
Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853 |
I asked about the checkpoint "log spam" in the OpenIFS discussion thread on 23rd December but. given when I posted it, that message is now two pages back :-) For what it's worth, BOINC Manager on my machine doesn't seem to think the application checkpoints using the client checkpoint mechanism at all whilst it's running if you look at the task properties (there was never a last checkpoint time); that's consistent with my understanding of some of what Glenn has said about the matter, but it doesn't explain this -- is it something odd in the client libraries or something in the CPDN wrapper or main program? It would be nice if it didn't do this, and it would be interesting to know why it does do it! Cheers - Al. P.S. Please tell me it's not using something in the BOINC checkpoint mechanism as a 1 second timer :-) ... |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,649,638 RAC: 12,396 |
I have no idea about those messages but I notice they are referring to the results files not the model progress. OpenIFS will ignore any request to checkpoint from the boinc client. It knows best how to manage it's checkpointing & generation of restart files. It's a relatively expensive I/O operation, not something we want to happen too often. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
I notice they are referring to the results files not the model progress.That's a misinterpretation, I'm afraid. In BOINC-speak. 'result' is a synonym for (and early form of) 'task' - programmer-speak, rather than user-speak. The concept of 'checkpointing' a file doesn't really make sense. |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,649,638 RAC: 12,396 |
Thx for clearing that up.I notice they are referring to the results files not the model progress.That's a misinterpretation, I'm afraid. In BOINC-speak. 'result' is a synonym for (and early form of) 'task' - programmer-speak, rather than user-speak. The concept of 'checkpointing' a file doesn't really make sense. But as Alan in the original post said, it also doesn't make sense that the task is being checkpointed roughly every minute? What does boinc mean by 'checkpoint' in this context? Does it mean the client sent a 'you need to checkpoint' message to the task - regardless of whether the task did it or not? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
I was trying to work that out. In BOINC's case, 'checkpointed' should mean "BOINC has successefully written the files needed for a restart at ... [time]' The message is written by https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L1551, and seems to be controlled by old_time = atp->checkpoint_cpu_time; // the saved time of the last checkpoint if (old_time != atp->checkpoint_cpu_time) { // if they are different ...so you shouldn't see two messages with the same time. Edit - OK, so 1 second apart is indeed 'different'. But I can never unravel David Anderson's spaghetti code much beyond that. |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
OK, mine's doing it too: 04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0608_2007050100_123_976_12193252_0 checkpointed 04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0447_1987050100_123_956_12173091_1 checkpointed 04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0806_1998050100_123_967_12184450_1 checkpointed 04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0656_2008050100_123_977_12194300_0 checkpointed 04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0860_1987050100_123_956_12173504_1 checkpointed 04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0608_2007050100_123_976_12193252_0 checkpointed 04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0447_1987050100_123_956_12173091_1 checkpointed 04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0806_1998050100_123_967_12184450_1 checkpointed 04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0656_2008050100_123_977_12194300_0 checkpointed 04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0860_1987050100_123_956_12173504_1 checkpointedPossibly, the BOINC client is sending the science app a message "you can checkpoint now", and the app is replying "It's OK, I've done one now". I think the wisest thing is to turn off that log flag (it'll spam the system journal in no time), and stop worrying about it. |
Send message Joined: 31 Aug 04 Posts: 37 Credit: 9,581,380 RAC: 3,853 |
Richard, Thanks for having a look to see what's going on... The message is written by https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L1551, and seems to be controlled by But it isn't trying to checkpoint so unless something is writing a non-zero value to the task's checkpoint_cpu_time it should just do nothing (always zero) - or have I misread/misunderstood that section of code (quite likely; my opinion of David's code is much the same as yours!)? And from a later message... I think the wisest thing is to turn off that log flag (it'll spam the system journal in no time), and stop worrying about it. That's what I did, but as soon as WCG's GPU application returns I either forego some performance analysis I'm doing that needs to know where it checkpoints (to make some sort of sense of a GPU activity trace) or I forego CPDN (as I currently have no machines with the capacity to consider setting up a second BOINC client with different log behaviour...) Now, my tiny contribution wouldn't be missed (no sarcasm intended), but if someone could find out how to stop that spam... Cheers - Al. |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,806,015 RAC: 71,139 |
I was trying to work that out. In BOINC's case, 'checkpointed' should mean "BOINC has successefully written the files needed for a restart at ... [time]' I would hazard a guess that BOINC respects the initial "Request tasks to checkpoint at most every X seconds" set in the client but after that time has elapsed and because the checkpoint time is still showing zero it will then repeat the request every second as that code gets called every second (all being well). My next task isn't due to start until about 2am so have set the checkpoint time period to be 10 hours. If it isn't checkpointing when I wake up I'll know that bit has worked and reduce the figure to induce checkpointing. I thought the task/project could also specify a time period for checkpointing ? I know I have seen it being set inside LHC vboxwrapper code. If the project isn't using them internally can it not set a massively high period for their tasks to follow ? |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
BOINC doesn't control checkpoints, except in the sense of setting a minimum interval between checkpoints - default 60 seconds. And that's a global setting - it doesn't make any difference whether it's a 3-minute GPU task for WCG, or a 14-hour CPDN task. Looking at a CPDN task 'properties' in BOINC Manager, it always seems to say "CPU time since checkpoint ---". I think that means that CPDN - in this case, the CPDN wrapper app - is constantly writing "I've just checkpointed now" into the inter-process communications file: that might well be triggering the event log message. [I'll go downstairs and do some excavations in the filing system in a moment] If that turns out to be true, CPDN have chosen to do it wrongly. I'd suggest the possible options are: 1) Lie, and say it's never checkpointed - that would inhibit task switching, but would upset users who might like to know when would be a good time to shut down for the night. 2) Tell the truth, so the user knows what's going on, even if it doesn't explain why it isn't behaving the way he or she asked it to. 3) Try to fool the system, by making it report that "I have just checkpointed 10+ seconds into the future". Thus trying to invoke: // Normally this is called every second.(that possibility needs thorough checking) |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
OK, this is the excavation for CPDN: <active_task> <project_master_url>https://climateprediction.net/</project_master_url> <result_name>oifs_43r3_ps_0525_2009050100_123_978_12195169_0</result_name> <checkpoint_cpu_time>25677.370000</checkpoint_cpu_time> <checkpoint_elapsed_time>25712.590836</checkpoint_elapsed_time> <fraction_done>0.509490</fraction_done> <peak_working_set_size>4621619200</peak_working_set_size> <peak_swap_size>5215875072</peak_swap_size> <peak_disk_usage>2189510262</peak_disk_usage> </active_task>I'll have to switch to another machine for a comparison. Back in a mo. Well, I didn't expect that. <active_task> <project_master_url>http://numberfields.asu.edu/NumberFields/</project_master_url> <result_name>wu_sf7_DS-16x10_Grp400291of5000000_0</result_name> <checkpoint_cpu_time>2595.560000</checkpoint_cpu_time> <checkpoint_elapsed_time>2623.082207</checkpoint_elapsed_time> <fraction_done>0.583066</fraction_done> <peak_working_set_size>10067968</peak_working_set_size> <peak_swap_size>306024448</peak_swap_size> <peak_disk_usage>12271</peak_disk_usage> </active_task>Near enough the same. Yet NumberFields says: More excavation needed. Try these: <active_task> <project_master_url>http://numberfields.asu.edu/NumberFields/</project_master_url> <result_name>wu_sf7_DS-16x10_Grp400291of5000000_0</result_name> <active_task_state>1</active_task_state> <app_version_num>400</app_version_num> <slot>0</slot> <checkpoint_cpu_time>2399.857000</checkpoint_cpu_time> <checkpoint_elapsed_time>2426.131861</checkpoint_elapsed_time> <checkpoint_fraction_done>0.571426</checkpoint_fraction_done> <checkpoint_fraction_done_elapsed_time>2426.131861</checkpoint_fraction_done_elapsed_time> <current_cpu_time>2416.206000</current_cpu_time> <once_ran_edf>0</once_ran_edf> <swap_size>306024448.000000</swap_size> <working_set_size>10067968.000000</working_set_size> <working_set_size_smoothed>10067968.000000</working_set_size_smoothed> <page_fault_rate>0.000000</page_fault_rate> <bytes_sent>0.000000</bytes_sent> <bytes_received>0.000000</bytes_received> </active_task> <active_task> <project_master_url>https://climateprediction.net/</project_master_url> <result_name>oifs_43r3_ps_0525_2009050100_123_978_12195169_0</result_name> <active_task_state>1</active_task_state> <app_version_num>105</app_version_num> <slot>2</slot> <checkpoint_cpu_time>27035.320000</checkpoint_cpu_time> <checkpoint_elapsed_time>27074.658515</checkpoint_elapsed_time> <checkpoint_fraction_done>0.535951</checkpoint_fraction_done> <checkpoint_fraction_done_elapsed_time>27074.658515</checkpoint_fraction_done_elapsed_time> <current_cpu_time>27035.320000</current_cpu_time> <once_ran_edf>0</once_ran_edf> <swap_size>4426641408.000000</swap_size> <working_set_size>3926339584.000000</working_set_size> <working_set_size_smoothed>3556255585.502848</working_set_size_smoothed> <page_fault_rate>0.000000</page_fault_rate> <bytes_sent>0.000000</bytes_sent> <bytes_received>0.000000</bytes_received> </active_task>Now we're getting somewhere. I see <checkpoint_cpu_time>27035.320000</checkpoint_cpu_time> <current_cpu_time>27035.320000</current_cpu_time>Identical to 6 decimal places. I bet that's what's doing it. Those last two code comparisons come from the <active_task_set> in BOINC's client_state.xml |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,806,015 RAC: 71,139 |
BOINC doesn't control checkpoints, except in the sense of setting a minimum interval between checkpoints - default 60 seconds. And that's a global setting - it doesn't make any difference whether it's a 3-minute GPU task for WCG, or a 14-hour CPDN task. But it does observe what it is supposed to do with them. boinc_time_to_checkpoint returns true only when sufficient time has passed since the last checkpoint. This minimum interval is the maximum of: A user preference (e.g. laptop users might want to checkpoint infrequently). An optional application-supplied, specified by calling boinc_set_min_checkpoint_period(int nsecs); So if the wrapper/application calls boinc_set_min_checkpoint_period() with a number > the longest amount of time it would expect to take then the BOINC code shouldn't try to PS. Apart from the ignorant, who leaves the default as 60 seconds ! |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
... the BOINC code shouldn't try to request a checkpoint.BOINC can't request a checkpoint. The only options are 'allow' or 'deny'. |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,649,638 RAC: 12,396 |
I confess I am a little lost now in the discussion (which is also wildly off the topic of the thread but no matter..). That code blob points to the function 'get_msgs' and I think that's all the client is doing. It's the terminology that's confusing us here I think. To me a 'checkpoint' is when the model writes it's internal state to disk so it can restart after a client stop. But! To a boinc client, 'checkpoint' means what the fn 'get_msgs' does. The code at the end of the loop does: atp->get_trickle_up_msg(); atp->get_graphics_msg();Ignoring all the previous guff about checking task state/cpu-time etc, that's essentially all a checkpoint is to the client, get any trickle_up message (same as 'trickles' from CPDN?), and any messages from the graphics. Nothing to do with OpenIFS's checkpointing mechanism at all. Richard, you lost me at this bit: Richard wrote: Now we're getting somewhere. I see 'I bet that's what's doing it' - what is 'it'? The client/wrapper doing what? There's a comment in that get_msgs function that its usually called every sec, so there's the time difference you see in the logs. I have checked the OpenIFS wrapper code that talks to the client. There are no explicit calls to any boinc functions with 'checkpoint' in their name. As far as I can see, the wrapper can't be sending anything to the client about having completed a checkpoint. And checkpoint probably means different things anyway to model & boinc client. It would be straightforward to send a checkpoint message though. We know how often the model will checkpoint and the wrapper monitors the model's step count, so we can use that (strictly speaking we should check for the presence of the files too but for now...). Am I anywhere close to understanding this? (then again, I leave the boinc interface stuff to Andy who I should probably go talk to..) |
Send message Joined: 1 Jan 07 Posts: 1061 Credit: 36,748,059 RAC: 5,647 |
Richard, you lost me at this bit:The original problem was that the Event Log was reporting that the model was checkpointing every second. It should not be saying that. It should be saying that the model has checkpointed if and only if a real life checkpoint - all those restart files - has been completed in the last second. My interpretation is now that the model (or, probably more accurately, the wrapper) reports the current state of play to the BOINC client every second - timings, progress made, changes in status, anything like that. The client will analyse that, store what needs storing, and process all those changes in state. The client then has a snapshot of the overall status, and can respond when the Manager asks - again every second - for a summary fit for display to the user. The bits of XML I posted are a small fraction of all that. For the current bug-hunt, I think the critical data points are "checkpoint_cpu_time" and "current_cpu_time" - both are the number of seconds of CPU work done since the task started. If they are identical, I'm suggesting that the client will notice that fact, and interpret it as "a checkpoint has happened in the last second", and report that fact in the Event Log, and pass it to the Manager for display to the user. The model/wrapper should only report a change in checkpoint_cpu_time once per restart file dump. |
Send message Joined: 29 Nov 17 Posts: 82 Credit: 15,806,015 RAC: 71,139 |
If it isn't checkpointing when I wake up I'll know that bit has worked and reduce the figure to induce checkpointing.It was checkpointing. So yes, BOINC needs fixing. |
©2024 cpdn.org