Message boards : Number crunching : Memory Allocation Failure
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
I\'ve just had a slab model (hadsm3fub_0316_005909008_6) which has failed with a Memory Allocation Failure. I have been struggling to resuscitate it for several days, restoring it from several different backups, but it fails at exactly the same place each time: in the course of the Post Processing at the end of Phase 2. I have rerun it to the fail point 4 times and on the last occasion have allowed it to report. The only difference is that sometimes it appears to freeze at the fail, showing 66.666% complete and other times it goes to 100%. I have also tried it on Intel and AMD systems. Does anybody have any ideas about what might make a difference if I were to try it a fifth time? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Highlighting Memory Allocation Failure in your post and then clicking the Google button on my browser gives a large number of \'hits\'. Just reading the summary of a few seems to indicate that your climate program keeps running out of memory. I remember that you posted a while back about lack of memory. So perhaps if you make sure that nothing else is running, that work units for any other project are suspended, and that you have set your prefs option to give the maximum amount of memory to the program when the computer is not in use, it will succeed. Better still, suspend the climate model, and let other project WUs complete to get them out of the way. This obviously means first of all setting everything in the Projects tab to No new tasks. Your computer specs say: 991.48 MB, which is less than 1 Gig, (1024Mb), so, unless you have an odd amount of memory installed, there is some \'missing\'. Is this becaus you don\'t have a separate graphics card? If so, the onboard chips may be using a fair bit of memory at a crucial point, for some reason. As for BOINC sometimes showing 100%, this happens when BOINC loses contact with the model. It then thinks that it\'s because the model has completed, and so sets it\'s percentage number to 100%, to show this \'completion\'. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Thanks, Les. I had assumed the error message was a BOINC/CPDN specific message, so didn\'t do what you\'ve done. The 991.48 ?!? relates to the AMD Sempron machine I tried it on last. Mostly, this model runs on a 2Gb Core 2 Duo, alongside a Coupled Model which still has about 3 months still to go, but otherwise with nothing else much running (apart from email). Normal practice is Network is Off, Screensaver is Off, Windows and Antivirus updates are Off. The message I posted about re Lack of Memory some time ago was for a completely different slab model running on a 512Mb laptop - which incidentally is still running. I can try it one more time when I get back to my Dual in a few days, as you suggest, and change the memory in Preferences? See what that does. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
That\'s a very interesting crash, first time I\'ve ever seen one like it. A lot of debug info. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6932855 ... The most likely reason for a malloc failure is an out-of-memory condition as Les indicates. You could also get an error if the memory allocation pool has been corrupted somehow, and also if the pool has been fragmented to the point where malloc can\'t get a single block of memory large enough to satisfy a request. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Thanks, MikeMars. I\'ve tried it again this morning after a reboot and pushing the Memory Preference up from 50% to 70%, albeit on my 1Gb single core AMD system, but it just froze at the 66.666% point, i.e. Phase 2 end. Freezing is the most common symptom. If I suspend the offending slab model and resume my aoupled model, the CM carries on just fine. Do slabs have a larger memory requirement? |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Slab uses 50MB for the worker process, with another 2MB for the controller process (increasing to 12MB if you open the graphics window). Coupled uses 100MB for the worker and 15MB for the controller after opening the graphics window. As far as freezing at 66.66% goes, that\'s the end of the second phase. The controller process does lots of post-processing at that point. If you\'re monitoring the percentage completed in BOINC manager slab models can appear to be stuck at that point for a few minutes before phase 3 starts and the percentage starts increasing again. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Slab uses 50MB for the worker process, with another 2MB for the controller process (increasing to 12MB if you open the graphics window). Coupled uses 100MB for the worker and 15MB for the controller after opening the graphics window. Thanks, Thyme Lawn. Trouble is, the Slab froze at 66.666% for about 7 hours, i.e. overnight, and Windows Task Manager showed 0% mill being used. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Is there still a directory BOINC\\projects\\climateprediction.net\\hadsm3fub_0316_005909008\\dataout on that system, and if so check the names of the 0316ba.* files. If post-processing completed they should all have been given the suffix .x2.nc but if something went wrong you\'ll have some without that suffix. If something did go wrong have a look at the BOINC manager messages tab (or the file BOINC/stdoutdae.txt) to check if anything strange happened around the timestamp on the newest file with the .x2.nc suffix (e.g. did a benchmark force the application to be removed from memory). "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Is there still a directory BOINC\\projects\\climateprediction.net\\hadsm3fub_0316_005909008\\dataout on that system, and if so check the names of the 0316ba.* files. If post-processing completed they should all have been given the suffix .x2.nc but if something went wrong you\'ll have some without that suffix. Thanks Thyme Lawn. The directory is still there and the suffices are all .x2.nc . stdoutdae shows no messages at the relevant time. |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
Thanks, MikeMars. I\'ve tried it again this morning after a reboot and pushing the Memory Preference up from 50% to 70%, albeit on my 1Gb single core AMD system, but it just froze at the 66.666% point, i.e. Phase 2 end. Freezing is the most common symptom. If I suspend the offending slab model and resume my aoupled model, the CM carries on just fine. Do slabs have a larger memory requirement? Malloc errors would not be related to BOINC settings, as CPDN requests memory from the OS, not BOINC. So, BOINC preferences should make no difference in this case. I would be very curious to see how much memory the Windows kernel is using and how much free memory is available on the system after a reboot with CPDN running. You can find kernel memory usage in the Task Manager under Performance tab. |
Send message Joined: 10 Jun 05 Posts: 10 Credit: 4,863 RAC: 0 |
Malloc errors would not be related to BOINC settings, as CPDN requests memory from the OS, not BOINC. So, BOINC preferences should make no difference in this case. if you mean CPDN requests memory from the OS when told by BOINC that it is allowed to ask for more memory, then I would approve of it. If CPDN completely ignores the user\'s prefences, that would be bad. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Boinc itself monitors memory usage and when it gets too high will suspend enough tasks to reduce usage to below the limit. As DJStarfox says, CPDN asks the OS directly for memory. However, Boinc does not know how much system memory is free, and how much is used by non-Boinc tasks. It just uses the fixed percentages of the total memory. If you are running something else which uses an excessive amount of memory, or if you run out of disk space on the drive which holds your swap file (virtual memory), then the system will be in big trouble but Boinc won\'t realise. Various things I\'d suggest trying: * Run a memory tester (memcheck86), or failing that a stress tester with a large memory size set (Prime95\'s torture test) for 24 hours. * As Thyme says, look in the task manager on the processes tab after the system has been running for a while to see if there is anything using a lot of memory. You may need to add the \'peak memory usage\' column to the display. * I don\'t know how much memory is used during the post-processing phase. If you watch in the task manager as above when post-processing is taking place, then this may give you additional information. * How long does the system stay up at a stretch? If this is too long (weeks), you may suffer memory fragmentation after a while. Similarly, if sonething has a memory leak (such as vsmon), it may make the system crash with memory related errors. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
The directory is still there and the suffices are all .x2.nc . stdoutdae shows no messages at the relevant time. Which indicates that post-processing completed. There are 10 hours between the last phase 2 trickle and the model crashing, but going by the distribution of your earlier trickles it might have been significantly longer. Check the file BOINC\\projects\\climateprediction.net\\hadsm3fub_0316_005909008.xml. It will indicate exactly where you got to in the processing. Post the lines from <PH> to <RSYT> if you need any help decyphering it. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
The directory is still there and the suffices are all .x2.nc . stdoutdae shows no messages at the relevant time. That\'s very interesting if true. I thought the problem was post-processing, which on my linux system more than doubles its memory requirement (for just that step). If the error happens afterward, then I\'m not sure his system is really out of memory. I have a slab at 50% now....if it\'s still relevant in this post when my model hits 66%, then I\'ll observe the memory usage more closely. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
I have now had the opportunity to rerun the slab model (hadsm3fub_0316_005909008_6) from a backup to the 66.666% point once more, using my Core2 Duo with 2 Gb memory. All other BOINC tasks were detached and the other coupled model deleted befre system reboot. No other apps were running apart from AVG antivrus. Memory preferences were set to \"use 90%\". A new CPID was allocated at the start of the run. It has now frozen again at 66.666%. CPU usage has dropped to 0-1%, so processing has materially ceased. There does not appear to be a Malloc failure this run. Once again, all the dataout/----0316da.* files have the .x2.nc suffix . The lines in file hadsm3fub_0316_005909008.xml which Thyme Lawn enquired about are as follows <PH>2</PH> <TS>259248</TS> <DAY>2</DAY> <MTH>12</MTH> <YR>1840</YR> <HR>0</HR> <MIN>0</MIN> <SEC>0</SEC> <CSF>372</CSF> <TR>259248</TR> <ST>1</ST> <RS>3</RS> <RSC>0</RSC> <RSDT>259200</RSDT> <RSMT>259056</RSMT> <RSYT>259056</RSYT> I\'m now off to restore the coupled model from its backup and see if I can continue with it. Thanks everyone for your assistance. |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
Those values indicate that it hasn\'t progressed into phase 3 (the <PH> tag is still set to 2). If you sort the dataout folder by date the oldest compressed working data file should be 0316ba.pa26c10.x2.nc, with a set of more recent files generated at the end of post-processing in the following increasing age order: 0316ba.pa.8yac.x2.nc 0316aa.pc.8yac.x2.nc (yes, it is \'aa\' rather than \'ba\'!) 0316ba.pa.gmts.x2.nc 0316ba.pa.rmts.x2.nc 0316ba.pd.gmts.x2.nc 0316ba.pd.rmts.x2.nc 0316ba.pe.gmts.x2.nc 0316ba.pe.rmts.x2.nc 0316ba.pf.gmts.x2.nc 0316ba.pf.rmts.x2.nc 0316ba.pg.gmts.x2.nc 0316ba.pg.rmts.x2.nc 0316ba.pw.8yac.x2.nc "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
I\'ve just watched three slab models run through the Phase 1 Zip upload (at 33.333%). This may, of course, differ from the Phase 2 situation. The sequence of events was as follows: 1. At 33.327% there is a processing drop, which possibly coincides with a checkpoint. 2. At 33.333% the \'hadsm3_um_5.06_windows_intelx86.exe\' process stops. The memory used drops by ~49 MB, which is approximately the reported \'Mem Usage\' and \'VM Size\' for that process. The BOINC Manager progress percentage doesn\'t change. 3. Processing remains at very low levels for about 12 minutes, during which time the \'hadsm3_5.06_windows_intelx86.exe\' process writes about 1 GB, with occasional interventions by the \'hadsm3_se_5.06_windows_intelx86.exe\' process. Numerous trickles are uploaded. Memory occasionally increases by a few MB (presumably the \'_se_\' process running) but never approaches the original amount. 4. After using about 80-90 seconds of CPU, \'hadsm3_5.06_windows_intelx86.exe\' finishes writing, a new \'hadsm3_um_5.06_windows_intelx86.exe\' process starts, the progress percentage begins to increment, and the Zip file is uploaded. 5. The memory used settles back at about the original value plus ~ 2 MB. 6. Quitting/resuming BOINC Manager returns the memory used to the original value. So, CPU usage dropping to approx. zero is usual - but Lockleys\' model should evidently restart for Phase 3 ... [Edit: Are the folder permissions OK? You can safely clear any read-only flags and also ensure that the running user has \'full control\'.] |
©2024 cpdn.org