Thread 'hadcm3n Full Res Ocean out of memory error'

Author	Message
MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 50584 - Posted: 22 Oct 2014, 2:18:12 UTC Last modified: 22 Oct 2014, 2:36:45 UTC All these tasks are failing with: - exit code -529697949 (0xe06d7363) </message> <stderr_txt> Unhandled Exception Detected... - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception)(0xe06d7363) at address 0x7688C42D Engaging BOINC Windows Runtime Debugger... When I check the workunits, all other PCs tasks are failing with the same error. For a sample of mine see task 17232786 Just in case anyone wonders, I am not out of memory (32GB). Well not on the PC anyway ;-) ID: 50584 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50586 - Posted: 22 Oct 2014, 3:12:20 UTC - in response to Message 50584. We were told about this earlier, so I've made a new thread in the Science section. Here You are now right at the leading edge of climate research, so buckle your seat belts and prepare for a wild ride. :) ID: 50586 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50588 - Posted: 22 Oct 2014, 8:47:11 UTC Got a couple more short models to get through before the one I picked up starts crunching. Will keep a close eye on it when it does. Good to have a heads up on this. ID: 50588 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50591 - Posted: 22 Oct 2014, 22:08:30 UTC - in response to Message 50584. Last modified: 23 Oct 2014, 3:46:32 UTC Hi Martin I suspect that the memory mentioned is "stack memory". I found a post about that error number, which suggest that it's a registry problem. Edited as suggested. ID: 50591 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 50593 - Posted: 23 Oct 2014, 1:49:57 UTC - in response to Message 50591. Last modified: 23 Oct 2014, 1:54:55 UTC I would definitely NOT recommend to use ANY so called registry cleaner and Les you may want to consider removing the link if you can edit that post. I'll come back to why in a moment, as I want to address the error issues first. Yes it could be the allocated memory that is an issue, but it is not just my PC it is all the others that I have looked at - haven't seen one successful one yet (although some of mine are now up to 50 hrs running - fingers crossed). Mac and Linux boxes are also failing. Your link to the new experiments suggests early failure with Invalid Theta. None of the tasks I looked at have Invalid Theta. All seem to run for 15k - 35k secs CPU time, Win PCs fail with the memory error stated in my earlier post, Macs and Linux boxes with different variants of 'process exited with code 193' Mac sample, Linux sample (Hi Eric). An interesting BOINC related post here from 2008 on the 'exit code -529697949' error and it suggests memory leakage by a wrapper - starting to get a bit technical for me although I understand the purpose of a wrapper. I was thinking of doing a memory test when I get back from the long weekend, but I note this was done in the link and no errors found. It would be nice to know if these are expected errors in addition to the Invalid Theta. I guess the researchers might come back at some point and let us know. Registry cleaners. Personally I don't think they are a good idea and most independent writers advise against them. Microsoft link, Microsft MVP comment link Firstly I would NEVER EVER run anything that tinkered with the registry that wasn't well documented and proven. Secondly when I went to the linked page, I immediately had a download file box asking if I wanted to save or run the file - AND I hadn't clicked anything on the page! I didn't realise that was possible - definitely get me out of here time. ID: 50593 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 50594 - Posted: 23 Oct 2014, 2:58:36 UTC - in response to Message 50593. Just looked at my tasks in progress at 10+ hrs and compared those to my failures. All the failures happened at around 23000 CPU secs, and this just happens to be around the time of the first upload trickle at timestep 25,920 on my rig. None of the failures had any trickles. Interesting? ID: 50594 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 50595 - Posted: 23 Oct 2014, 4:39:24 UTC Situation similar on my Linux boxes as Martin noted. I've seen 5 failures and no successes already. Failures all, like Martin said, just before first trickle. All are on the new batch of hadcm3n models, specifically those with "r" starting their 4-character part of their name. (the "s" ones have got past the first trickle or farther so far.) The error message is <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> terminate called after throwing an instance of 'St9bad_alloc' what(): std::bad_alloc SIGABRT: abort called Stack trace ( frames): Exiting... </stderr_txt> ]]> Which could likely be a problem with space for stack memory allocation, but I'm not certain about that. Also I have an unclear memory of something similar a while (more than a year, less than a decade) back ID: 50595 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50596 - Posted: 23 Oct 2014, 4:48:40 UTC - in response to Message 50595. I was just about to post the same error messages for my Linux machine. :) Good point about the s Vs r series. One lot may be a control. I'm running some of each. ID: 50596 · Reply Quote

Pete B Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424	Message 50597 - Posted: 23 Oct 2014, 9:40:44 UTC All 3 of mine, PC 827263, 2 'r' and 1 's' have gone with the "Invalid Theta". WU's 9221758, 9224048 and 9222404. It's interesting that the third one had already failed twice on other PC's before mine, one with 'Invalid Theta' and one with 'out of memory'. Because I suspended all the various other WU's in the queue to force the CM3n's to run first, the computer will not now run any more till I can reset them later. ID: 50597 · Reply Quote

Alan K Send message Joined: 22 Feb 06 Posts: 491 Credit: 31,084,581 RAC: 14,886	Message 50598 - Posted: 23 Oct 2014, 21:12:55 UTC - in response to Message 50597. Last modified: 23 Oct 2014, 21:13:48 UTC Getting something similar: Unhandled Exception Detected... - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x75E5C42D Engaging BOINC Windows Runtime Debugger... ****************** BOINC Windows Runtime Debugger Version 6.13.0 Dump Timestamp : 10/22/14 23:30:14 Install Directory : C:\Program Files (x86)\BOINC\ Data Directory : C:\ProgramData\BOINC Project Symstore : LoadLibraryA( C:\Program Files (x86)\BOINC\\dbghelp.dll ): GetLastError = 8 LoadLibraryA( dbghelp.dll ): GetLastError = 8 * Dump of the Process Statistics: * - I/O Operations Counters - Read: 0, Write: 0, Other 0 - I/O Transfers Counters - Read: 0, Write: 0, Other 0 - Paged Pool Usage - QuotaPagedPoolUsage: 0, QuotaPeakPagedPoolUsage: 0 QuotaNonPagedPoolUsage: 0, QuotaPeakNonPagedPoolUsage: 0 - Virtual Memory Usage - VirtualSize: 0, PeakVirtualSize: 0 - Pagefile Usage - PagefileUsage: 0, PeakPagefileUsage: 0 - Working Set Size - WorkingSetSize: 0, PeakWorkingSetSize: 0, PageFaultCount: 0 * Dump of thread ID 2712 (state: Initialized): * - Information - Status: Base Priority: Normal, Priority: Normal, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 0.000000 - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x75E5C42D * Debug Message Dump ** * Foreground Window Data *** Window Name : Window Class : Window Process ID: 0 Window Thread ID : 0 Exiting... Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6264, selfPID=6264, iMonCtr=1 Again looksa s if its just before the first trickle. ID: 50598 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50600 - Posted: 24 Oct 2014, 7:55:53 UTC As I am using BOINC7.4.22 which enables me to set debugging uptions from the gui are there any particular flags it would be worth my setting whenthe first of my two HADAM3CN models start later today? (running Linux) ID: 50600 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 50601 - Posted: 24 Oct 2014, 11:04:47 UTC Unlikely that BOINC debug options will help. Might could. It is very clear that the originally posted "out of memory error" is a stack overflow error both on Windows and Linux. I've refreshed my personal memory, and there is no reasonable doubt that stack overflow is not the endpoint that the researchers were looking for according to what Les posted http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7935 about this project. "Stack overflow" is almost always a programming error. If you are looking for "INVALID THETA" and get "stack overflow" -- does not compute. BUT the "stack overflow" errors do not originate in the hadcm3n model -- for sure -- because the hadcm3n model is in FORTRAN - and the errors come from the Windows and Linux C++. WTF? Do the "out of memory" errors come from -- where? Something in BOINC, something in "wrappers" - whatever -- The problem for me is -- FOR EVERY ONE OF THESE models that fails, there's another one that wants to run for 3 weeks. Really. Like for every 'r..." that fails there is a companion 'S...' that need to run for two or three weeks to prove -- yeah, you understand. Or maybe I misunderstand, but I don't think so -- I've got the "s..." models already backing up on my machines queues -- the corresponding "r..." all failed - and not in the expected way, but for every "r..." failure that took a few hours there's 5 "s..." models that will take 3 weeks for each. ?? ID: 50601 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 50602 - Posted: 24 Oct 2014, 11:08:19 UTC Meanwhile, I let the models run -- being almost sure that there's a problem with the config or the design or that the plan to test the cm3n model actually tested the BOINC framework and found it wanting -- or -- who knows -- I'll keep on crunching. ID: 50602 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50603 - Posted: 24 Oct 2014, 11:23:56 UTC Last modified: 24 Oct 2014, 11:35:33 UTC I wonder if there is a howto around somewhere for using the diagnostic flags? Tried searching this board without finding anything. will do a more general search and report back. Edit: This seems to have most of what I want to know even if I do have to look up some of the acronyms. ID: 50603 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50608 - Posted: 24 Oct 2014, 19:30:48 UTC - in response to Message 50601. Hi Eric The main modelling program is FORTRAN, but there's also at least one C++ program. I've no idea what this is for, but my guess is that it's a feeder program between the main program and all of the data files. A bit like the "wrapper" that some projects use to interact with BOINC. Also, I think that this is the one that the project people have to wrestle with each time a researcher comes up with a new idea for modelling the climate at some time and place on the planet. And it would be why we have to do beta testing of new modelling ideas. We've passed on to Andy various error messages gleaned from various failures, so he knows that we think that there's a problem. ************** After going through 4 "r"s, all of which failed at 5 hours 20 minutes, I've now got 4 "s"s, the newest at a bit over 20 hours, and 2 originals at 48 hours. On with the research! :) ID: 50608 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50609 - Posted: 24 Oct 2014, 19:46:42 UTC - in response to Message 50603. Dave Those flags will only be good for anything going wrong with the BOINC side of things. For info on the current problem, something like MS's Process Explorer will be needed. I was using one for a while a couple of months back, but I don't remember what it was called, and it wasn't as useful as the MS one. ID: 50609 · Reply Quote

ed2353 Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,377,018 RAC: 12,908	Message 50610 - Posted: 24 Oct 2014, 22:52:04 UTC - in response to Message 50608. Likewise, all my "r..." models are failing with the "out of Memory" message (just?) before the first trickle on my Windows 8.1 box. The "s..." do not seem to have this failure ID: 50610 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50612 - Posted: 25 Oct 2014, 6:12:51 UTC Last modified: 25 Oct 2014, 6:46:45 UTC Thanks Les, will try procexp and play though given my level of technical knowledge I expect it will be a steep learning curve. Edit: And that curve includes just getting it to install. ID: 50612 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50616 - Posted: 25 Oct 2014, 22:17:05 UTC - in response to Message 50612. And the first of my 2 r tasks has now failed with <core_client_version>7.4.22</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... 21:23:06 (20715): No heartbeat from core client for 30 sec - exiting terminate called after throwing an instance of 'St9bad_alloc' what(): std::bad_alloc SIGABRT: abort called Stack trace ( frames): Exiting... </stderr_txt> ]]> ID: 50616 · Reply Quote

brown Send message Joined: 24 Feb 06 Posts: 10 Credit: 10,142,658 RAC: 0	Message 50631 - Posted: 27 Oct 2014, 0:35:12 UTC I seem to be a magnet for these 2.5% 10 hour 'r' Failures. Should I continue to let to run or abort them? Thanks. ID: 50631 · Reply Quote