climateprediction.net (CPDN) home page
Thread 'hadcm3n Full Res Ocean out of memory error'

Thread 'hadcm3n Full Res Ocean out of memory error'

Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50632 - Posted: 27 Oct 2014, 7:56:56 UTC

I am going to abort any re-issues of the r tasks especially if they have already failed on linux boxes. Will give maybe a couple more a go after the current one fails assuming it goes the way of the last one before stopping on them. My impression is that everyone is having the r series failing with what looks like a programming error and at the same point so there doesn't seem much to be gained from running more of them but as I say I will wait a little while longer before making that decision, partly to give those at Oxford a chance to respond.
ID: 50632 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50634 - Posted: 27 Oct 2014, 8:31:15 UTC

Just thinking, as these crashed tasks don't clean up after themselves (at least the one on my linux box so far hasn't) is there likely to be any information that is of use in working out what the problem is or should I just delete the folder?
ID: 50634 · Report as offensive     Reply Quote
ProfilePete B

Send message
Joined: 26 Aug 04
Posts: 67
Credit: 10,299,683
RAC: 10,424
Message 50635 - Posted: 27 Oct 2014, 8:38:38 UTC

All mine have so far failed, mainly 'r's but a couple of 's's too. On my system, they all seem to go just before the first checkpoint at the end of the sixth model day, all with the 'Invalid Theta Detected' error. They have all crashed on other machines too so any I get now that are re-issues after previous fails, I'll probably abort.

I'll just see what happens with any WU's I get that are not previous failure re-issues before the fifth re-issue cutoff point. There can't be many more first issue WU's left.
ID: 50635 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50641 - Posted: 27 Oct 2014, 21:10:07 UTC

I've just made a News announcement about these.
Basically, Abort the "r" series.

ID: 50641 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50642 - Posted: 27 Oct 2014, 22:37:00 UTC - in response to Message 50641.  

Thanks Les.

My 's' series are running well, now up to 60% complete on the first ones, but as they have 300+ hours run time, it will be a while before I see if they complete OK.

Given the high failure rate on the 'r' series, it seems a lack of testing on that one - guess these things happen & must be frustrating for the researchers who have more than enough to do anyway.

ID: 50642 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 50647 - Posted: 28 Oct 2014, 3:15:00 UTC - in response to Message 50641.  

I've just made a News announcement about these.
Basically, Abort the "r" series.

hi Les . . . thank you for your post, and thank you for your volunteer work for climateprediction.net . . . your contrbution is greatly appreciated.

I just want to double check . . . before I abort these tasks. in your post in News and Announcements you posted:

"r" series of Hadam3n models

did you mean:

"r" series of Hadcm3n models

or is there a difference between the a and the c

I am sorry if I am reading this wrong.

I have the following task in my BOINC queue:

hadcm3n_r15o_1940_40_009092774
hadcm3n_r0aq_1940_40_009091660
hadcm3n_r117_1940_40_009092613
hadcm3n_r0yf_1940_40_009092513

should I abort the above tasks ?

thanks

ID: 50647 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50649 - Posted: 28 Oct 2014, 4:36:05 UTC - in response to Message 50647.  

Thanks Bryon.

You're right, it should have been a C. For Coupled Ocean.
Too many model types with similar names.

One that we're testing at the moment has 14 letters and numbers in it's name. Which is a problem when typing in the file name in Linux to look at or delete old tests.
Thank goodness for the storage of keyboard entries.


And, yes, kill off those 4 models.

ID: 50649 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50651 - Posted: 28 Oct 2014, 6:01:00 UTC - in response to Message 50649.  

Just a thought Les, if we keep aborting the 'r' series, wont the re-runs keep being sent out until someone finally ends up running them? Given that the tasks kill themselves after a few minutes run, would we be better off running them so they quickly reach their max number of re-runs, or, does an Abort count as a re-run?
ID: 50651 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50652 - Posted: 28 Oct 2014, 6:14:07 UTC - in response to Message 50651.  

It's the download attempt that counts. The number of times "sent" by the server.
I'm trying to be general here, for those new to the project.

The re-sends are a pain in this case, but good if it's something that crashes on, say, Windows, and one is running Linux.
Which is the case with one of mine, which is on it's last chance.

And it's good that they run for so long, because I can now "sit it out and wait for things to get better". I learnt that from my cats. :)


ID: 50652 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,841,902
RAC: 5,047
Message 50656 - Posted: 28 Oct 2014, 11:02:38 UTC

FWIW my protocol for dispatching HADCM3N r-series is:

1. Set "no new tasks" by default.

2. Change to "allow new tasks" for the dispatch.

3. Update. If an r-series task downloads, suspend the task before it has downloaded. If an s-series downloads, then goto #7.

4. After downloading, abort the task (this works through a suspension).

5. Update (this works through a "communication deferred").

6. Goto #1.

7. Run s-series task.

Why? Because aborted models don't seem to tidy up completely if they've unzipped and started running: aborting the unzipped tasks reduces the number of "reset/remove" actions necessary to ultimately reclaim disk space.

NB This is for a Mac, which can only run HADCM3N or EU/ANZ because of "Error 9".
ID: 50656 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50673 - Posted: 29 Oct 2014, 21:48:53 UTC - in response to Message 50651.  

Martin

Andy has removed all of the unsent data sets, and is killing off the resends as he finds them.
Which isn't easy, with computers having large numbers of processors, and a cache set to suck in large numbers of data sets, which means that they don't stay long in the queue before being resent.


ID: 50673 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50676 - Posted: 29 Oct 2014, 23:08:31 UTC - in response to Message 50673.  

Thanks Les, for me as I have a full quota of 's' series running it's just a matter of knocking the 'r' series off once a day, but great that it is being sorted out at the server end. For those with lesser bandwidth I'm sure it will be welcome.


Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd.
ID: 50676 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 50684 - Posted: 30 Oct 2014, 7:12:36 UTC - in response to Message 50676.  

Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd.[/quote]


It is possible that they are editing the client state file to lowered the time remaining estimate to an artificially low figure. This would trick Boinc into giving they extra work. This would let them build up the number of WU�s waiting on there machine. Kind of like storing water in a reservoir against periodic droughts

ID: 50684 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 35,377,018
RAC: 12,908
Message 50688 - Posted: 30 Oct 2014, 10:10:21 UTC - in response to Message 50676.  

Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd.[/quote]

Actually you can have 20 days maximum work by setting the Minimum work buffer to 10 days AND the Maximum Additional Work buffer to 10 days.

With a mix of tasks in my queue, I often choose to run the long running tasks first, so the Short tasks may sit in my queue for a week or so before they get to run. That may make them appear to take longer than the couple of days run-time that they actually need.

With lots of Short tasks available I have around 120 in my queue at present, but my i7 can run 12 threads, so that is not excessive.
ID: 50688 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50706 - Posted: 2 Nov 2014, 3:52:46 UTC - in response to Message 50688.  
Last modified: 2 Nov 2014, 3:54:30 UTC

OK, last off topic post here from me. OK, I see there are mechanisms, but really can't see the point. To me it's up to the researchers to allocate the tasks and when the task run out, they run out - time for some housekeeping.

Given that on this PC I run 10-12 tasks, and at a reasonable pace, it is actually not all that often that I run out anyway. When I do, it's normally for less than a week - hey, I can live with that. So I'll just run with the default cache settings.
ID: 50706 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error

©2024 cpdn.org