Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
I am going to abort any re-issues of the r tasks especially if they have already failed on linux boxes. Will give maybe a couple more a go after the current one fails assuming it goes the way of the last one before stopping on them. My impression is that everyone is having the r series failing with what looks like a programming error and at the same point so there doesn't seem much to be gained from running more of them but as I say I will wait a little while longer before making that decision, partly to give those at Oxford a chance to respond. |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Just thinking, as these crashed tasks don't clean up after themselves (at least the one on my linux box so far hasn't) is there likely to be any information that is of use in working out what the problem is or should I just delete the folder? |
Send message Joined: 26 Aug 04 Posts: 67 Credit: 10,299,683 RAC: 10,424 |
All mine have so far failed, mainly 'r's but a couple of 's's too. On my system, they all seem to go just before the first checkpoint at the end of the sixth model day, all with the 'Invalid Theta Detected' error. They have all crashed on other machines too so any I get now that are re-issues after previous fails, I'll probably abort. I'll just see what happens with any WU's I get that are not previous failure re-issues before the fifth re-issue cutoff point. There can't be many more first issue WU's left. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I've just made a News announcement about these. Basically, Abort the "r" series. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Thanks Les. My 's' series are running well, now up to 60% complete on the first ones, but as they have 300+ hours run time, it will be a while before I see if they complete OK. Given the high failure rate on the 'r' series, it seems a lack of testing on that one - guess these things happen & must be frustrating for the researchers who have more than enough to do anyway. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
I've just made a News announcement about these. hi Les . . . thank you for your post, and thank you for your volunteer work for climateprediction.net . . . your contrbution is greatly appreciated. I just want to double check . . . before I abort these tasks. in your post in News and Announcements you posted: "r" series of Hadam3n models did you mean: "r" series of Hadcm3n models or is there a difference between the a and the c I am sorry if I am reading this wrong. I have the following task in my BOINC queue: hadcm3n_r15o_1940_40_009092774 hadcm3n_r0aq_1940_40_009091660 hadcm3n_r117_1940_40_009092613 hadcm3n_r0yf_1940_40_009092513 should I abort the above tasks ? thanks |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Thanks Bryon. You're right, it should have been a C. For Coupled Ocean. Too many model types with similar names. One that we're testing at the moment has 14 letters and numbers in it's name. Which is a problem when typing in the file name in Linux to look at or delete old tests. Thank goodness for the storage of keyboard entries. And, yes, kill off those 4 models. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Just a thought Les, if we keep aborting the 'r' series, wont the re-runs keep being sent out until someone finally ends up running them? Given that the tasks kill themselves after a few minutes run, would we be better off running them so they quickly reach their max number of re-runs, or, does an Abort count as a re-run? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
It's the download attempt that counts. The number of times "sent" by the server. I'm trying to be general here, for those new to the project. The re-sends are a pain in this case, but good if it's something that crashes on, say, Windows, and one is running Linux. Which is the case with one of mine, which is on it's last chance. And it's good that they run for so long, because I can now "sit it out and wait for things to get better". I learnt that from my cats. :) |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,841,902 RAC: 5,047 |
FWIW my protocol for dispatching HADCM3N r-series is: 1. Set "no new tasks" by default. 2. Change to "allow new tasks" for the dispatch. 3. Update. If an r-series task downloads, suspend the task before it has downloaded. If an s-series downloads, then goto #7. 4. After downloading, abort the task (this works through a suspension). 5. Update (this works through a "communication deferred"). 6. Goto #1. 7. Run s-series task. Why? Because aborted models don't seem to tidy up completely if they've unzipped and started running: aborting the unzipped tasks reduces the number of "reset/remove" actions necessary to ultimately reclaim disk space. NB This is for a Mac, which can only run HADCM3N or EU/ANZ because of "Error 9". |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Martin Andy has removed all of the unsent data sets, and is killing off the resends as he finds them. Which isn't easy, with computers having large numbers of processors, and a cache set to suck in large numbers of data sets, which means that they don't stay long in the queue before being resent. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Thanks Les, for me as I have a full quota of 's' series running it's just a matter of knocking the 'r' series off once a day, but great that it is being sorted out at the server end. For those with lesser bandwidth I'm sure it will be welcome. Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd.[/quote] It is possible that they are editing the client state file to lowered the time remaining estimate to an artificially low figure. This would trick Boinc into giving they extra work. This would let them build up the number of WU�s waiting on there machine. Kind of like storing water in a reservoir against periodic droughts |
Send message Joined: 15 Feb 06 Posts: 137 Credit: 35,377,018 RAC: 12,908 |
Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd.[/quote] Actually you can have 20 days maximum work by setting the Minimum work buffer to 10 days AND the Maximum Additional Work buffer to 10 days. With a mix of tasks in my queue, I often choose to run the long running tasks first, so the Short tasks may sit in my queue for a week or so before they get to run. That may make them appear to take longer than the couple of days run-time that they actually need. With lots of Short tasks available I have around 120 in my queue at present, but my i7 can run 12 threads, so that is not excessive. |
Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
OK, last off topic post here from me. OK, I see there are mechanisms, but really can't see the point. To me it's up to the researchers to allocate the tasks and when the task run out, they run out - time for some housekeeping. Given that on this PC I run 10-12 tasks, and at a reasonable pace, it is actually not all that often that I run out anyway. When I do, it's normally for less than a week - hey, I can live with that. So I'll just run with the default cache settings. |
©2024 cpdn.org