Thread 'hadcm3n Full Res Ocean out of memory error'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50632 - Posted: 27 Oct 2014, 7:56:56 UTC I am going to abort any re-issues of the r tasks especially if they have already failed on linux boxes. Will give maybe a couple more a go after the current one fails assuming it goes the way of the last one before stopping on them. My impression is that everyone is having the r series failing with what looks like a programming error and at the same point so there doesn't seem much to be gained from running more of them but as I say I will wait a little while longer before making that decision, partly to give those at Oxford a chance to respond. ID: 50632 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 50634 - Posted: 27 Oct 2014, 8:31:15 UTC Just thinking, as these crashed tasks don't clean up after themselves (at least the one on my linux box so far hasn't) is there likely to be any information that is of use in working out what the problem is or should I just delete the folder? ID: 50634 · Reply Quote

Pete B Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424	Message 50635 - Posted: 27 Oct 2014, 8:38:38 UTC All mine have so far failed, mainly 'r's but a couple of 's's too. On my system, they all seem to go just before the first checkpoint at the end of the sixth model day, all with the 'Invalid Theta Detected' error. They have all crashed on other machines too so any I get now that are re-issues after previous fails, I'll probably abort. I'll just see what happens with any WU's I get that are not previous failure re-issues before the fifth re-issue cutoff point. There can't be many more first issue WU's left. ID: 50635 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50641 - Posted: 27 Oct 2014, 21:10:07 UTC I've just made a News announcement about these. Basically, Abort the "r" series. ID: 50641 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 50642 - Posted: 27 Oct 2014, 22:37:00 UTC - in response to Message 50641. Thanks Les. My 's' series are running well, now up to 60% complete on the first ones, but as they have 300+ hours run time, it will be a while before I see if they complete OK. Given the high failure rate on the 'r' series, it seems a lack of testing on that one - guess these things happen & must be frustrating for the researchers who have more than enough to do anyway. ID: 50642 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 50647 - Posted: 28 Oct 2014, 3:15:00 UTC - in response to Message 50641. I've just made a News announcement about these. Basically, Abort the "r" series. hi Les . . . thank you for your post, and thank you for your volunteer work for climateprediction.net . . . your contrbution is greatly appreciated. I just want to double check . . . before I abort these tasks. in your post in News and Announcements you posted: "r" series of Hadam3n models did you mean: "r" series of Hadcm3n models or is there a difference between the a and the c I am sorry if I am reading this wrong. I have the following task in my BOINC queue: hadcm3n_r15o_1940_40_009092774 hadcm3n_r0aq_1940_40_009091660 hadcm3n_r117_1940_40_009092613 hadcm3n_r0yf_1940_40_009092513 should I abort the above tasks ? thanks ID: 50647 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50649 - Posted: 28 Oct 2014, 4:36:05 UTC - in response to Message 50647. Thanks Bryon. You're right, it should have been a C. For Coupled Ocean. Too many model types with similar names. One that we're testing at the moment has 14 letters and numbers in it's name. Which is a problem when typing in the file name in Linux to look at or delete old tests. Thank goodness for the storage of keyboard entries. And, yes, kill off those 4 models. ID: 50649 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 50651 - Posted: 28 Oct 2014, 6:01:00 UTC - in response to Message 50649. Just a thought Les, if we keep aborting the 'r' series, wont the re-runs keep being sent out until someone finally ends up running them? Given that the tasks kill themselves after a few minutes run, would we be better off running them so they quickly reach their max number of re-runs, or, does an Abort count as a re-run? ID: 50651 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50652 - Posted: 28 Oct 2014, 6:14:07 UTC - in response to Message 50651. It's the download attempt that counts. The number of times "sent" by the server. I'm trying to be general here, for those new to the project. The re-sends are a pain in this case, but good if it's something that crashes on, say, Windows, and one is running Linux. Which is the case with one of mine, which is on it's last chance. And it's good that they run for so long, because I can now "sit it out and wait for things to get better". I learnt that from my cats. :) ID: 50652 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,841,902 RAC: 5,047	Message 50656 - Posted: 28 Oct 2014, 11:02:38 UTC FWIW my protocol for dispatching HADCM3N r-series is: 1. Set "no new tasks" by default. 2. Change to "allow new tasks" for the dispatch. 3. Update. If an r-series task downloads, suspend the task before it has downloaded. If an s-series downloads, then goto #7. 4. After downloading, abort the task (this works through a suspension). 5. Update (this works through a "communication deferred"). 6. Goto #1. 7. Run s-series task. Why? Because aborted models don't seem to tidy up completely if they've unzipped and started running: aborting the unzipped tasks reduces the number of "reset/remove" actions necessary to ultimately reclaim disk space. NB This is for a Mac, which can only run HADCM3N or EU/ANZ because of "Error 9". ID: 50656 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 50673 - Posted: 29 Oct 2014, 21:48:53 UTC - in response to Message 50651. Martin Andy has removed all of the unsent data sets, and is killing off the resends as he finds them. Which isn't easy, with computers having large numbers of processors, and a cache set to suck in large numbers of data sets, which means that they don't stay long in the queue before being resent. ID: 50673 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 50676 - Posted: 29 Oct 2014, 23:08:31 UTC - in response to Message 50673. Thanks Les, for me as I have a full quota of 's' series running it's just a matter of knocking the 'r' series off once a day, but great that it is being sorted out at the server end. For those with lesser bandwidth I'm sure it will be welcome. Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd. ID: 50676 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 50684 - Posted: 30 Oct 2014, 7:12:36 UTC - in response to Message 50676. Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd.[/quote] It is possible that they are editing the client state file to lowered the time remaining estimate to an artificially low figure. This would trick Boinc into giving they extra work. This would let them build up the number of WU�s waiting on there machine. Kind of like storing water in a reservoir against periodic droughts ID: 50684 · Reply Quote

ed2353 Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,377,018 RAC: 12,908	Message 50688 - Posted: 30 Oct 2014, 10:10:21 UTC - in response to Message 50676. Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd.[/quote] Actually you can have 20 days maximum work by setting the Minimum work buffer to 10 days AND the Maximum Additional Work buffer to 10 days. With a mix of tasks in my queue, I often choose to run the long running tasks first, so the Short tasks may sit in my queue for a week or so before they get to run. That may make them appear to take longer than the couple of days run-time that they actually need. With lots of Short tasks available I have around 120 in my queue at present, but my i7 can run 12 threads, so that is not excessive. ID: 50688 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 50706 - Posted: 2 Nov 2014, 3:52:46 UTC - in response to Message 50688. Last modified: 2 Nov 2014, 3:54:30 UTC OK, last off topic post here from me. OK, I see there are mechanisms, but really can't see the point. To me it's up to the researchers to allocate the tasks and when the task run out, they run out - time for some housekeeping. Given that on this PC I run 10-12 tasks, and at a reasonable pace, it is actually not all that often that I run out anyway. When I do, it's normally for less than a week - hey, I can live with that. So I'll just run with the default cache settings. ID: 50706 · Reply Quote