Message boards : Number crunching : HadCM3s post-completion artifacts
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
I've recently reconnected to this project. The last 2 models that I've run were HadCM3s models. After completion and successful reporting, each model left behind a folder structure (in BOINC data directory) of 850MB. This is quickly consuming the allowed BOINC disk space, and other projects complain that there's not enough free disk space to download more work. Can the models or BOINC be configured to clean up this disk space after successful reporting? Or do I have to manually clear out these folders? ./boinc/projects/climateprediction.net/hadcm3s_a17h_1996_2_009904602 ./boinc/projects/climateprediction.net/hadcm3s_9srm_1986_2_009893663 |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
This is a known problem. Some of the hadCM3s don�t clean up after themselves even when they finish normally. There is nothing to do about it except check the ProgramData folder occasionally for left-overs and clean them out. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,008,987 RAC: 21,524 |
Certainly this is a known problem on Linux. Several people have reported that the tasks are deleting the folders after reporting on Windows boxes. On a recent clean up I cleared almost 20GB of data from completed tasks of this type from my laptop. Only got one of the short tasks running on it now and the longer tasks clean up after themselves except with some types of crashes. |
Send message Joined: 28 May 14 Posts: 34 Credit: 705,936 RAC: 0 |
This is a known problem. Some of the hadCM3s don�t clean up after themselves even when they finish normally. There is nothing to do about it except check the ProgramData folder occasionally for left-overs and clean them out. I hadn't heard about this until this thread. Which file types are ok to delete? I'm seeing a couple "Compressed (zipped) Folder" and "Application" files for each task that has already been completed. Can i delete both of those for each completed task? Or are they kinda like cookies that i may need later if i run another task from that batch? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There are a lot of files that are reusable. The folders (with files), that can be deleted, are the folders for the models. These will have a 4 letter code name, similar to those that are currently running. They're mostly the hadcm3s models, and any of any type that crash. Note down the 4 letter code of all running models, and delete those that aren't among them. |
Send message Joined: 28 May 14 Posts: 34 Credit: 705,936 RAC: 0 |
There are a lot of files that are reusable. Sorry, maybe i wasn't clear. The file types i named all have the model codes like you posted. There are usually a couple files/folders associated with each model a "Compressed (Zipped)" a "Application" and sometimes an "XML file". For instance, for a single completed task i have Name: hadam3p_pnw_data_7.27_windows_intelx86 Type: Compressed (zipped) folder Name: hadam3p_pnw_se_7.27_windows_intelx86 Type: Compressed (zipped) folder Name: hadam3p_pnw_um_7.27_windows_intelx86 Type: Application Notice that one is a pnw_DATA_7.27, one is pnw_SE_7.27, and the last is pnw_UM_7.27. Can all of those be deleted as long as the task has been completed? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Those files contain programs and/or data. They can be deleted if you want to. They'll be downloaded next time they're needed. |
Send message Joined: 21 Jul 15 Posts: 8 Credit: 575,541 RAC: 0 |
Sorry if this is a noob question, but I started computing on this project about two months ago and have been noticing an issue with memory usage. I find that if I don't reboot after climateprediction models have completed it appears that the processes do not close out gracefully staying in memory and over a period of days I end up experiencing low memory problems that force my system to reboot. I have 24GB of memory on this system and have found that it is all used up with multiple hadXXXX processes still in place even after reporting a successful completion. Is the answer to simply reboot every few days? I tried that once by suspending the climateprediction project and waiting for the threads to stop and then rebooting, but then upon restarting BOINC I received processing errors for the climateprediction projects that were underway and suspended prior to the reboot. I don't want to lose work unnecessarily, is there a way to gracefully pause computation for a reboot? Or even better, is there something I can do differently to prevent memory issues to begin with? Thanks in advance... |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Welcome to the project and boards, SAK. New machine with a pair of heavy-duty GPUs and we can't look at your settings. I have no experience with GPUs and can't offer specific advice but suggest a thorough review of your settings. Reboots "cure" many problems in Windows operations but not everything. (You have 8.1, which is slightly better than 8.) That said, I've never seen the symptoms you describe. Typical is for CPDN to leave piles of detritus when tasks fail (though not in memory) but it usually does a good job cleaning up after itself when tasks complete cleanly. The boards are populated with proper way for CPDN shut-down. The reason is time to save the many open files required to run these complex main-frame models. (Yes, they are mainframe models shoe-horned into PC capabilities, no small accomplishment.) So, manually suspend the tasks before exiting CPDN, else the OS might, in a burst of impatience, shut down before all active files are saved. That results in a mismatch on startup -- a fatal error. This isn't your problem because tasks apparently resume from where they were. Checking graphics, are all timesteps completed? Does CPU time continue to increase when tasks are "completed"? How many days? (It takes a lot of days to complete enough tasks to use-up 24 Gig of RAM.) As you can tell, I'm grasping at straws here looking for symptoms to explain the end you describe. [Edited to remove a redundancy.] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 21 Jul 15 Posts: 8 Credit: 575,541 RAC: 0 |
Thank you astroWX, I will try to give you a bit more information. I am pretty confident this is a BOINC / ClimatePrediction.net issue because if I run just SETI@Home I don't have the memory creep issue. As far as my computing, I am only running SETI@Home and ClimatPrediction.net. I believe I have my settings configured to give ClimatePrediction.net priority, and that appears to be the case when I look at the Tasks tab in the BOINC Manager. The only thing using the GPUs is SETI@Home, so basically my system is running SETI@Home on the GPUs and ClimatePrediction.net on all the CPU cores unless there is no work to do and then it falls back to SETI@Home on the CPU. Below are my settings: And my BOINC related processes currently running in memory: And lastly a grab of my current processing tasks in BOINC Manager: Not sure what else I can give you at this point to help. It looks like I will have the currently processing round of ClimatePrediction.net units complete tonight so I will report back tomorrow morning with where my memory stands. Right now my memory snapshot looks like this: Thanks in advance for any help you can provide, and if you have any setting suggestions please let me know. |
Send message Joined: 21 Jul 15 Posts: 8 Credit: 575,541 RAC: 0 |
Forgot to answer one of your questions... astroWX said: Checking graphics, are all timesteps completed? Does CPU time continue to increase when tasks are "completed"? How many days? (It takes a lot of days to complete enough tasks to use-up 24 Gig of RAM.) Afraid I am not sure what you mean by "checking graphics". Completely new to this, sorry. I find the memory use grows to consume my 24GB within 4-5 days typically. After a reboot and restarting BOINC, my memory use is at a baseline 4GB or so and then increases in jumps that seem to correspond to new work units starting until I have to eventually reboot again. Thanks again for your help. I notice in your signature you say "Greetings from Coastal Washington State"... I am in South Puget Sound. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Thanks for the many visuals. Good job. Some of your failed tasks were checked (something I should have done yesterday but was short on time) and found two possible problems, both illustrated in this sequence: Suspended CPDN Monitor - Suspend request from BOINC... Most of the failures have pages of "Suspended CPDN Monitor - Suspend request..."; that usually suggests settings don't include the option to leave tasks in memory when suspended. We strongly recommend that tasks be left in memory when suspended. The mainframe million lines of UK Met.Office Fortran code were not written with boinc-scheduler interruptions in mind. We can get away with it much of the time but . . . In the third snapshot you posted, "disk and memory usage," note the item at bottom of the page: "Leave applications in memory when suspended." We recommend that it be checked-on. "Can't acquire lockfile" typically means something external to boinc grabbed the file and locked it -- probably an anti-virus program. We recommend excluding boinc Program & Data folders from security scans. (I'm not aware of any negative issues with doing this. On the other hand, CPDN files have, over time, given a few false-positives and that's troublesome -- I had to report one to "Avast!".) Your anti-virus program should have a way to exclude files, folders, etc. Re. Graphics, in your other message: In CPDN's Task Manager, click the 'Tasks' tab. You'll see a list of tasks in your queue, some running. Click a running task and the "Show graphics" button on the left will activate. Click that and, on the left, is an alternating display of "how goes it" information. There are also several options to view, real time (expensive to run) graphics of what your model is doing. (It has an option to make the display full-screen.) None of this is guaranteed to eliminate the memory creep problem. Many of your failed tasks are the "Short" type. They have problems and some folks stopped running them, me included. So, it might be back to the drawing board and I'll have to call for help. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653 |
Many of your failed tasks are the "Short" type. They have problems and some folks stopped running them, me included. So, it might be back to the drawing board and I'll have to call for help. I am sure we all cringe at the thought of the shorts. But this latest batch is looking good. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1371296 |
Send message Joined: 21 Jul 15 Posts: 8 Credit: 575,541 RAC: 0 |
Thanks again astroWX, your help is much appreciated. I have enabled the setting to leave the application in memory when suspended. I forgot I had disabled that when I was trying to troubleshoot my memory issues. Regarding the antivirus, I had already disabled scanning of the \ProgramData\BOINC folder but just now excluded the \Program Files\BOINC folder as well. Regarding the graphics, I had not noticed that functionality because whenever I select one of the seven short units currently running, the Show Graphics option in BOINC is grayed out. I am able to use it on the one Australia New Zealand 6.10 unit that is also running, very cool! Does the fact that the Show Graphics tab is not available for the short units indicate a problem and if so, should I abort them? Thanks again... |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
The graphics button is grayed out because the hadcm3s tasks don�t have any graphics. Fewer and fewer WU types have them now. |
Send message Joined: 21 Jul 15 Posts: 8 Credit: 575,541 RAC: 0 |
Thanks for the info on the graphics Jim, I won't worry about units that don't have the Show Graphics option. Going back to my original problem, the last round of work units finished processing about nine hours ago and I am about to hit the memory wall again. First the BOINC Manager listing the currently processing units: And my current memory status: And the processes still in memory, including completed CPDN work units: Any thoughts appreciated... am I really the only person experiencing this? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
am I really the only person experiencing this? You're the first person that I can recall ever posting about memory retention problems. Everyone else just has problems with data files left over. |
Send message Joined: 21 Jul 15 Posts: 8 Credit: 575,541 RAC: 0 |
I guess I should feel special... Hopefully the graphics above prove I am not making this stuff up! If there is anything else I can do to help troubleshoot please let me know. Thanks again for the help and advice. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192 |
Note that there is nothing obviously wrong with the set of processes reported: each PNW and ANZ task reported in BOINC Manager will have three processes reported by Windows - an AM3/UM, an RM3/UM and an AM3. There may well be some increase in memory consumed as tasks progress, so perhaps there are just too many tasks for that machine's memory. |
Send message Joined: 21 Jul 15 Posts: 8 Credit: 575,541 RAC: 0 |
Thanks Ian. The thing that doesn't make sense to me is that if I am using 23GB of memory and suspend processing, reboot and then restart the eight work units, why does my memory usage fall back to about 4GB overall on the system instead of picking back up to the 23GB previously being used? If it only needs a fraction of the 23GB when restarted, why is it using so much memory over time without releasing unused memory? Is the fix for me to somehow limit the number of CPUs that CPDN can use so that it throttles back the number of concurrent work units? I still find it strange that I seem to be the only person experiencing this. Don't other folks processing on multi-core processors run all cores simultaneously? Thanks again for the input, appreciate it. |
©2024 cpdn.org