climateprediction.net (CPDN) home page
Thread 'HadCM3s post-completion artifacts'

Thread 'HadCM3s post-completion artifacts'

Message boards : Number crunching : HadCM3s post-completion artifacts
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 52382 - Posted: 6 Aug 2015, 4:14:12 UTC

I've recently reconnected to this project. The last 2 models that I've run were HadCM3s models. After completion and successful reporting, each model left behind a folder structure (in BOINC data directory) of 850MB. This is quickly consuming the allowed BOINC disk space, and other projects complain that there's not enough free disk space to download more work.

Can the models or BOINC be configured to clean up this disk space after successful reporting? Or do I have to manually clear out these folders?

./boinc/projects/climateprediction.net/hadcm3s_a17h_1996_2_009904602

./boinc/projects/climateprediction.net/hadcm3s_9srm_1986_2_009893663
ID: 52382 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 52383 - Posted: 6 Aug 2015, 5:46:20 UTC

This is a known problem. Some of the hadCM3s don�t clean up after themselves even when they finish normally. There is nothing to do about it except check the ProgramData folder occasionally for left-overs and clean them out.

ID: 52383 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 52384 - Posted: 6 Aug 2015, 6:46:34 UTC

Certainly this is a known problem on Linux. Several people have reported that the tasks are deleting the folders after reporting on Windows boxes. On a recent clean up I cleared almost 20GB of data from completed tasks of this type from my laptop. Only got one of the short tasks running on it now and the longer tasks clean up after themselves except with some types of crashes.
ID: 52384 · Report as offensive     Reply Quote
Andrew Sanchez
Avatar

Send message
Joined: 28 May 14
Posts: 34
Credit: 705,936
RAC: 0
Message 52400 - Posted: 10 Aug 2015, 0:41:55 UTC - in response to Message 52383.  

This is a known problem. Some of the hadCM3s don�t clean up after themselves even when they finish normally. There is nothing to do about it except check the ProgramData folder occasionally for left-overs and clean them out.


I hadn't heard about this until this thread. Which file types are ok to delete?
I'm seeing a couple "Compressed (zipped) Folder" and "Application" files for each task that has already been completed. Can i delete both of those for each completed task? Or are they kinda like cookies that i may need later if i run another task from that batch?
ID: 52400 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52401 - Posted: 10 Aug 2015, 1:15:44 UTC - in response to Message 52400.  

There are a lot of files that are reusable.

The folders (with files), that can be deleted, are the folders for the models. These will have a 4 letter code name, similar to those that are currently running.
They're mostly the hadcm3s models, and any of any type that crash.

Note down the 4 letter code of all running models, and delete those that aren't among them.


ID: 52401 · Report as offensive     Reply Quote
Andrew Sanchez
Avatar

Send message
Joined: 28 May 14
Posts: 34
Credit: 705,936
RAC: 0
Message 52409 - Posted: 11 Aug 2015, 1:40:55 UTC - in response to Message 52401.  

There are a lot of files that are reusable.

The folders (with files), that can be deleted, are the folders for the models. These will have a 4 letter code name, similar to those that are currently running.
They're mostly the hadcm3s models, and any of any type that crash.

Note down the 4 letter code of all running models, and delete those that aren't among them.




Sorry, maybe i wasn't clear. The file types i named all have the model codes like you posted. There are usually a couple files/folders associated with each model a "Compressed (Zipped)" a "Application" and sometimes an "XML file".
For instance, for a single completed task i have
Name: hadam3p_pnw_data_7.27_windows_intelx86 Type: Compressed (zipped) folder
Name: hadam3p_pnw_se_7.27_windows_intelx86 Type: Compressed (zipped) folder
Name: hadam3p_pnw_um_7.27_windows_intelx86 Type: Application

Notice that one is a pnw_DATA_7.27, one is pnw_SE_7.27, and the last is pnw_UM_7.27.
Can all of those be deleted as long as the task has been completed?
ID: 52409 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52412 - Posted: 11 Aug 2015, 4:27:22 UTC - in response to Message 52409.  

Those files contain programs and/or data.
They can be deleted if you want to. They'll be downloaded next time they're needed.

ID: 52412 · Report as offensive     Reply Quote
SAK

Send message
Joined: 21 Jul 15
Posts: 8
Credit: 575,541
RAC: 0
Message 52442 - Posted: 20 Aug 2015, 17:22:24 UTC

Sorry if this is a noob question, but I started computing on this project about two months ago and have been noticing an issue with memory usage. I find that if I don't reboot after climateprediction models have completed it appears that the processes do not close out gracefully staying in memory and over a period of days I end up experiencing low memory problems that force my system to reboot. I have 24GB of memory on this system and have found that it is all used up with multiple hadXXXX processes still in place even after reporting a successful completion.

Is the answer to simply reboot every few days? I tried that once by suspending the climateprediction project and waiting for the threads to stop and then rebooting, but then upon restarting BOINC I received processing errors for the climateprediction projects that were underway and suspended prior to the reboot. I don't want to lose work unnecessarily, is there a way to gracefully pause computation for a reboot? Or even better, is there something I can do differently to prevent memory issues to begin with?

Thanks in advance...
ID: 52442 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 52443 - Posted: 21 Aug 2015, 1:20:03 UTC - in response to Message 52442.  
Last modified: 21 Aug 2015, 1:25:18 UTC

Welcome to the project and boards, SAK.

New machine with a pair of heavy-duty GPUs and we can't look at your settings.

I have no experience with GPUs and can't offer specific advice but suggest a thorough review of your settings.

Reboots "cure" many problems in Windows operations but not everything. (You have 8.1, which is slightly better than 8.) That said, I've never seen the symptoms you describe. Typical is for CPDN to leave piles of detritus when tasks fail (though not in memory) but it usually does a good job cleaning up after itself when tasks complete cleanly.

The boards are populated with proper way for CPDN shut-down. The reason is time to save the many open files required to run these complex main-frame models. (Yes, they are mainframe models shoe-horned into PC capabilities, no small accomplishment.) So, manually suspend the tasks before exiting CPDN, else the OS might, in a burst of impatience, shut down before all active files are saved. That results in a mismatch on startup -- a fatal error. This isn't your problem because tasks apparently resume from where they were.

Checking graphics, are all timesteps completed? Does CPU time continue to increase when tasks are "completed"? How many days? (It takes a lot of days to complete enough tasks to use-up 24 Gig of RAM.)

As you can tell, I'm grasping at straws here looking for symptoms to explain the end you describe.

[Edited to remove a redundancy.]
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 52443 · Report as offensive     Reply Quote
SAK

Send message
Joined: 21 Jul 15
Posts: 8
Credit: 575,541
RAC: 0
Message 52449 - Posted: 21 Aug 2015, 6:47:41 UTC - in response to Message 52443.  

Thank you astroWX, I will try to give you a bit more information. I am pretty confident this is a BOINC / ClimatePrediction.net issue because if I run just SETI@Home I don't have the memory creep issue. As far as my computing, I am only running SETI@Home and ClimatPrediction.net. I believe I have my settings configured to give ClimatePrediction.net priority, and that appears to be the case when I look at the Tasks tab in the BOINC Manager. The only thing using the GPUs is SETI@Home, so basically my system is running SETI@Home on the GPUs and ClimatePrediction.net on all the CPU cores unless there is no work to do and then it falls back to SETI@Home on the CPU. Below are my settings:









And my BOINC related processes currently running in memory:


And lastly a grab of my current processing tasks in BOINC Manager:


Not sure what else I can give you at this point to help. It looks like I will have the currently processing round of ClimatePrediction.net units complete tonight so I will report back tomorrow morning with where my memory stands. Right now my memory snapshot looks like this:


Thanks in advance for any help you can provide, and if you have any setting suggestions please let me know.
ID: 52449 · Report as offensive     Reply Quote
SAK

Send message
Joined: 21 Jul 15
Posts: 8
Credit: 575,541
RAC: 0
Message 52450 - Posted: 21 Aug 2015, 7:03:42 UTC - in response to Message 52443.  

Forgot to answer one of your questions...

astroWX said:
Checking graphics, are all timesteps completed? Does CPU time continue to increase when tasks are "completed"? How many days? (It takes a lot of days to complete enough tasks to use-up 24 Gig of RAM.)


Afraid I am not sure what you mean by "checking graphics". Completely new to this, sorry. I find the memory use grows to consume my 24GB within 4-5 days typically. After a reboot and restarting BOINC, my memory use is at a baseline 4GB or so and then increases in jumps that seem to correspond to new work units starting until I have to eventually reboot again.

Thanks again for your help. I notice in your signature you say "Greetings from Coastal Washington State"... I am in South Puget Sound.
ID: 52450 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 52451 - Posted: 21 Aug 2015, 23:54:07 UTC - in response to Message 52449.  

Thanks for the many visuals. Good job.

Some of your failed tasks were checked (something I should have done yesterday but was short on time) and found two possible problems, both illustrated in this sequence:
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
08:17:37 (18936): Can't acquire lockfile (32) - waiting 35s
08:18:12 (18936): Can't acquire lockfile (32) - exiting
08:18:12 (18936): Error: The process cannot access the file because it is being used by another process.

(0x20)
09:25:50 (24192): Can't acquire lockfile (32) - waiting 35s
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...

Most of the failures have pages of "Suspended CPDN Monitor - Suspend request..."; that usually suggests settings don't include the option to leave tasks in memory when suspended. We strongly recommend that tasks be left in memory when suspended. The mainframe million lines of UK Met.Office Fortran code were not written with boinc-scheduler interruptions in mind. We can get away with it much of the time but . . .

In the third snapshot you posted, "disk and memory usage," note the item at bottom of the page: "Leave applications in memory when suspended." We recommend that it be checked-on.


"Can't acquire lockfile" typically means something external to boinc grabbed the file and locked it -- probably an anti-virus program. We recommend excluding boinc Program & Data folders from security scans. (I'm not aware of any negative issues with doing this. On the other hand, CPDN files have, over time, given a few false-positives and that's troublesome -- I had to report one to "Avast!".)

Your anti-virus program should have a way to exclude files, folders, etc.

Re. Graphics, in your other message: In CPDN's Task Manager, click the 'Tasks' tab. You'll see a list of tasks in your queue, some running. Click a running task and the "Show graphics" button on the left will activate. Click that and, on the left, is an alternating display of "how goes it" information. There are also several options to view, real time (expensive to run) graphics of what your model is doing. (It has an option to make the display full-screen.)


None of this is guaranteed to eliminate the memory creep problem. Many of your failed tasks are the "Short" type. They have problems and some folks stopped running them, me included. So, it might be back to the drawing board and I'll have to call for help.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 52451 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 52452 - Posted: 22 Aug 2015, 1:31:19 UTC - in response to Message 52451.  

Many of your failed tasks are the "Short" type. They have problems and some folks stopped running them, me included. So, it might be back to the drawing board and I'll have to call for help.

I am sure we all cringe at the thought of the shorts. But this latest batch is looking good.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1371296
ID: 52452 · Report as offensive     Reply Quote
SAK

Send message
Joined: 21 Jul 15
Posts: 8
Credit: 575,541
RAC: 0
Message 52453 - Posted: 22 Aug 2015, 3:14:14 UTC - in response to Message 52451.  

Thanks again astroWX, your help is much appreciated. I have enabled the setting to leave the application in memory when suspended. I forgot I had disabled that when I was trying to troubleshoot my memory issues. Regarding the antivirus, I had already disabled scanning of the \ProgramData\BOINC folder but just now excluded the \Program Files\BOINC folder as well.

Regarding the graphics, I had not noticed that functionality because whenever I select one of the seven short units currently running, the Show Graphics option in BOINC is grayed out. I am able to use it on the one Australia New Zealand 6.10 unit that is also running, very cool! Does the fact that the Show Graphics tab is not available for the short units indicate a problem and if so, should I abort them?

Thanks again...
ID: 52453 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 52454 - Posted: 22 Aug 2015, 4:37:38 UTC

The graphics button is grayed out because the hadcm3s tasks don�t have any graphics. Fewer and fewer WU types have them now.

ID: 52454 · Report as offensive     Reply Quote
SAK

Send message
Joined: 21 Jul 15
Posts: 8
Credit: 575,541
RAC: 0
Message 52456 - Posted: 23 Aug 2015, 4:07:35 UTC

Thanks for the info on the graphics Jim, I won't worry about units that don't have the Show Graphics option.

Going back to my original problem, the last round of work units finished processing about nine hours ago and I am about to hit the memory wall again. First the BOINC Manager listing the currently processing units:



And my current memory status:



And the processes still in memory, including completed CPDN work units:



Any thoughts appreciated... am I really the only person experiencing this?
ID: 52456 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52457 - Posted: 23 Aug 2015, 4:25:14 UTC - in response to Message 52456.  

am I really the only person experiencing this?

You're the first person that I can recall ever posting about memory retention problems.
Everyone else just has problems with data files left over.

ID: 52457 · Report as offensive     Reply Quote
SAK

Send message
Joined: 21 Jul 15
Posts: 8
Credit: 575,541
RAC: 0
Message 52458 - Posted: 23 Aug 2015, 4:30:24 UTC - in response to Message 52457.  

I guess I should feel special...

Hopefully the graphics above prove I am not making this stuff up! If there is anything else I can do to help troubleshoot please let me know.

Thanks again for the help and advice.
ID: 52458 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 52461 - Posted: 23 Aug 2015, 11:38:58 UTC

Note that there is nothing obviously wrong with the set of processes reported: each PNW and ANZ task reported in BOINC Manager will have three processes reported by Windows - an AM3/UM, an RM3/UM and an AM3. There may well be some increase in memory consumed as tasks progress, so perhaps there are just too many tasks for that machine's memory.
ID: 52461 · Report as offensive     Reply Quote
SAK

Send message
Joined: 21 Jul 15
Posts: 8
Credit: 575,541
RAC: 0
Message 52462 - Posted: 23 Aug 2015, 17:49:03 UTC - in response to Message 52461.  

Thanks Ian. The thing that doesn't make sense to me is that if I am using 23GB of memory and suspend processing, reboot and then restart the eight work units, why does my memory usage fall back to about 4GB overall on the system instead of picking back up to the 23GB previously being used? If it only needs a fraction of the 23GB when restarted, why is it using so much memory over time without releasing unused memory?

Is the fix for me to somehow limit the number of CPUs that CPDN can use so that it throttles back the number of concurrent work units? I still find it strange that I seem to be the only person experiencing this. Don't other folks processing on multi-core processors run all cores simultaneously?

Thanks again for the input, appreciate it.
ID: 52462 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : HadCM3s post-completion artifacts

©2024 cpdn.org