climateprediction.net (CPDN) home page
Thread 'hadcm3n Full Res Ocean out of memory error'

Thread 'hadcm3n Full Res Ocean out of memory error'

Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50584 - Posted: 22 Oct 2014, 2:18:12 UTC
Last modified: 22 Oct 2014, 2:36:45 UTC

All these tasks are failing with:
- exit code -529697949 (0xe06d7363)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception)(0xe06d7363) at address 0x7688C42D

Engaging BOINC Windows Runtime Debugger...

When I check the workunits, all other PCs tasks are failing with the same error.

For a sample of mine see task 17232786

Just in case anyone wonders, I am not out of memory (32GB). Well not on the PC anyway ;-)
ID: 50584 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50586 - Posted: 22 Oct 2014, 3:12:20 UTC - in response to Message 50584.  

We were told about this earlier, so I've made a new thread in the Science section.

Here

You are now right at the leading edge of climate research, so buckle your seat belts and prepare for a wild ride. :)


ID: 50586 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50588 - Posted: 22 Oct 2014, 8:47:11 UTC

Got a couple more short models to get through before the one I picked up starts crunching. Will keep a close eye on it when it does. Good to have a heads up on this.
ID: 50588 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50591 - Posted: 22 Oct 2014, 22:08:30 UTC - in response to Message 50584.  
Last modified: 23 Oct 2014, 3:46:32 UTC

Hi Martin

I suspect that the memory mentioned is "stack memory".

I found a post about that error number, which suggest that it's a registry problem.

Edited as suggested.
ID: 50591 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50593 - Posted: 23 Oct 2014, 1:49:57 UTC - in response to Message 50591.  
Last modified: 23 Oct 2014, 1:54:55 UTC

I would definitely NOT recommend to use ANY so called registry cleaner and Les you may want to consider removing the link if you can edit that post. I'll come back to why in a moment, as I want to address the error issues first.

Yes it could be the allocated memory that is an issue, but it is not just my PC it is all the others that I have looked at - haven't seen one successful one yet (although some of mine are now up to 50 hrs running - fingers crossed). Mac and Linux boxes are also failing. Your link to the new experiments suggests early failure with Invalid Theta. None of the tasks I looked at have Invalid Theta. All seem to run for 15k - 35k secs CPU time, Win PCs fail with the memory error stated in my earlier post, Macs and Linux boxes with different variants of 'process exited with code 193' Mac sample, Linux sample (Hi Eric).

An interesting BOINC related post here from 2008 on the 'exit code -529697949' error and it suggests memory leakage by a wrapper - starting to get a bit technical for me although I understand the purpose of a wrapper. I was thinking of doing a memory test when I get back from the long weekend, but I note this was done in the link and no errors found.

It would be nice to know if these are expected errors in addition to the Invalid Theta. I guess the researchers might come back at some point and let us know.

Registry cleaners. Personally I don't think they are a good idea and most independent writers advise against them. Microsoft link, Microsft MVP comment link Firstly I would NEVER EVER run anything that tinkered with the registry that wasn't well documented and proven. Secondly when I went to the linked page, I immediately had a download file box asking if I wanted to save or run the file - AND I hadn't clicked anything on the page! I didn't realise that was possible - definitely get me out of here time.
ID: 50593 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50594 - Posted: 23 Oct 2014, 2:58:36 UTC - in response to Message 50593.  

Just looked at my tasks in progress at 10+ hrs and compared those to my failures. All the failures happened at around 23000 CPU secs, and this just happens to be around the time of the first upload trickle at timestep 25,920 on my rig. None of the failures had any trickles. Interesting?
ID: 50594 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 50595 - Posted: 23 Oct 2014, 4:39:24 UTC

Situation similar on my Linux boxes as Martin noted.
I've seen 5 failures and no successes already.
Failures all, like Martin said, just before first trickle.
All are on the new batch of hadcm3n models, specifically those with "r" starting their 4-character part of their name. (the "s" ones have got past the first trickle or farther so far.)

The error message is

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort called
Stack trace ( frames):

Exiting...

</stderr_txt>
]]>


Which could likely be a problem with space for stack memory allocation, but I'm not certain about that.
Also I have an unclear memory of something similar a while (more than a year, less than a decade) back



ID: 50595 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50596 - Posted: 23 Oct 2014, 4:48:40 UTC - in response to Message 50595.  

I was just about to post the same error messages for my Linux machine. :)

Good point about the s Vs r series. One lot may be a control.
I'm running some of each.


ID: 50596 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50597 - Posted: 23 Oct 2014, 9:40:44 UTC

All 3 of mine, PC 827263, 2 'r' and 1 's' have gone with the "Invalid Theta". WU's 9221758, 9224048 and 9222404. It's interesting that the third one had already failed twice on other PC's before mine, one with 'Invalid Theta' and one with 'out of memory'. Because I suspended all the various other WU's in the queue to force the CM3n's to run first, the computer will not now run any more till I can reset them later.
ID: 50597 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 31,036,409
RAC: 14,604
Message 50598 - Posted: 23 Oct 2014, 21:12:55 UTC - in response to Message 50597.  
Last modified: 23 Oct 2014, 21:13:48 UTC

Getting something similar:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x75E5C42D

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.13.0


Dump Timestamp : 10/22/14 23:30:14
Install Directory : C:\Program Files (x86)\BOINC\
Data Directory : C:\ProgramData\BOINC
Project Symstore :
LoadLibraryA( C:\Program Files (x86)\BOINC\\dbghelp.dll ): GetLastError = 8
LoadLibraryA( dbghelp.dll ): GetLastError = 8
*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 0, Write: 0, Other 0

- I/O Transfers Counters -
Read: 0, Write: 0, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 0, QuotaPeakPagedPoolUsage: 0
QuotaNonPagedPoolUsage: 0, QuotaPeakNonPagedPoolUsage: 0

- Virtual Memory Usage -
VirtualSize: 0, PeakVirtualSize: 0

- Pagefile Usage -
PagefileUsage: 0, PeakPagefileUsage: 0

- Working Set Size -
WorkingSetSize: 0, PeakWorkingSetSize: 0, PageFaultCount: 0

*** Dump of thread ID 2712 (state: Initialized): ***

- Information -
Status: Base Priority: Normal, Priority: Normal, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 0.000000

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x75E5C42D



*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

Exiting...
Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6264, selfPID=6264, iMonCtr=1

Again looksa s if its just before the first trickle.
ID: 50598 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50600 - Posted: 24 Oct 2014, 7:55:53 UTC

As I am using BOINC7.4.22 which enables me to set debugging uptions from the gui are there any particular flags it would be worth my setting whenthe first of my two HADAM3CN models start later today? (running Linux)
ID: 50600 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 50601 - Posted: 24 Oct 2014, 11:04:47 UTC

Unlikely that BOINC debug options will help. Might could.

It is very clear that the originally posted "out of memory error" is a stack overflow error both on Windows and Linux. I've refreshed my personal memory, and there is no reasonable doubt that stack overflow is not the endpoint that the researchers were looking for according to what Les posted
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7935 about this project.

"Stack overflow" is almost always a programming error. If you are looking for "INVALID THETA" and get "stack overflow" -- does not compute.
BUT the "stack overflow" errors do not originate in the hadcm3n model -- for sure -- because the hadcm3n model is in FORTRAN - and the errors come from the Windows and Linux C++. WTF?
Do the "out of memory" errors come from -- where?
Something in BOINC, something in "wrappers" - whatever --

The problem for me is --
FOR EVERY ONE OF THESE models that fails, there's another one that wants to run for 3 weeks. Really. Like for every 'r..." that fails there is a companion 'S...' that need to run for two or three weeks to prove -- yeah, you understand.

Or maybe I misunderstand, but I don't think so -- I've got the "s..." models already backing up on my machines queues -- the corresponding "r..." all failed - and not in the expected way, but for every "r..." failure that took a few hours there's 5 "s..." models that will take 3 weeks for each. ??
ID: 50601 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 50602 - Posted: 24 Oct 2014, 11:08:19 UTC

Meanwhile, I let the models run -- being almost sure that there's a problem with the config or the design or that the plan to test the cm3n model actually tested the BOINC framework and found it wanting -- or -- who knows -- I'll keep on crunching.
ID: 50602 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50603 - Posted: 24 Oct 2014, 11:23:56 UTC
Last modified: 24 Oct 2014, 11:35:33 UTC

I wonder if there is a howto around somewhere for using the diagnostic flags? Tried searching this board without finding anything. will do a more general search and report back.

Edit: This seems to have most of what I want to know even if I do have to look up some of the acronyms.
ID: 50603 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50608 - Posted: 24 Oct 2014, 19:30:48 UTC - in response to Message 50601.  

Hi Eric

The main modelling program is FORTRAN, but there's also at least one C++ program.
I've no idea what this is for, but my guess is that it's a feeder program between the main program and all of the data files. A bit like the "wrapper" that some projects use to interact with BOINC.
Also, I think that this is the one that the project people have to wrestle with each time a researcher comes up with a new idea for modelling the climate at some time and place on the planet. And it would be why we have to do beta testing of new modelling ideas.

We've passed on to Andy various error messages gleaned from various failures, so he knows that we think that there's a problem.

**************

After going through 4 "r"s, all of which failed at 5 hours 20 minutes, I've now got 4 "s"s, the newest at a bit over 20 hours, and 2 originals at 48 hours.

On with the research! :)


ID: 50608 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50609 - Posted: 24 Oct 2014, 19:46:42 UTC - in response to Message 50603.  

Dave

Those flags will only be good for anything going wrong with the BOINC side of things.
For info on the current problem, something like MS's Process Explorer will be needed.

I was using one for a while a couple of months back, but I don't remember what it was called, and it wasn't as useful as the MS one.

ID: 50609 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,338,894
RAC: 13,030
Message 50610 - Posted: 24 Oct 2014, 22:52:04 UTC - in response to Message 50608.  

Likewise, all my "r..." models are failing with the "out of Memory" message (just?) before the first trickle on my Windows 8.1 box.

The "s..." do not seem to have this failure
ID: 50610 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50612 - Posted: 25 Oct 2014, 6:12:51 UTC
Last modified: 25 Oct 2014, 6:46:45 UTC

Thanks Les, will try procexp and play though given my level of technical knowledge I expect it will be a steep learning curve.

Edit: And that curve includes just getting it to install.
ID: 50612 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50616 - Posted: 25 Oct 2014, 22:17:05 UTC - in response to Message 50612.  

And the first of my 2 r tasks has now failed with



<core_client_version>7.4.22</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
21:23:06 (20715): No heartbeat from core client for 30 sec - exiting
terminate called after throwing an instance of 'St9bad_alloc'
what(): std::bad_alloc
SIGABRT: abort called
Stack trace ( frames):

Exiting...

</stderr_txt>
]]>
ID: 50616 · Report as offensive     Reply Quote
brown

Send message
Joined: 24 Feb 06
Posts: 10
Credit: 10,142,658
RAC: 0
Message 50631 - Posted: 27 Oct 2014, 0:35:12 UTC

I seem to be a magnet for these 2.5% 10 hour 'r' Failures.
Should I continue to let to run or abort them?
Thanks.
ID: 50631 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error

©2024 cpdn.org