climateprediction.net (CPDN) home page
Thread 'Iceworld Appeal'

Thread 'Iceworld Appeal'

Message boards : Number crunching : Iceworld Appeal
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 38091 - Posted: 11 Oct 2009, 11:20:04 UTC

Hi Iain, assume you are still collecting iceworlds?

Got first one for ages. Spotted it (phase 2, at 62.96%) when doing usual morning check
which is essential now that I\'m running 9 slabs (+ one good ol\' CM3). This one here.
It turned blue early this morning soon after phase 2 timestep 226,842 which is 62.5% progress.

Have restored from backup taken at 8am yesterday but have suspended the quad\'s
other 3 models to help the iceworld get to the point as quickly as possible. Currently at 56%.

Will send you the .cpdn file probably tomorrow (Monday). Just need your email address.

ID: 38091 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38092 - Posted: 11 Oct 2009, 16:20:21 UTC - in response to Message 38091.  
Last modified: 11 Oct 2009, 16:26:38 UTC

Hi Iain, assume you are still collecting iceworlds?
Sure am.

Just need your email address.
PM on its way.

Thanks.

It doesn\'t look like anyone else in that WU is going to produce anything useful...

... though it looks like you put your foot on the gas at trickle #3!
ID: 38092 · Report as offensive     Reply Quote
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 38093 - Posted: 11 Oct 2009, 17:48:25 UTC - in response to Message 38092.  


... though it looks like you put your foot on the gas at trickle #3!


Sort of. Transferred the quads from a slower m/c after #3.

ID: 38093 · Report as offensive     Reply Quote
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 38094 - Posted: 11 Oct 2009, 18:58:02 UTC - in response to Message 38092.  

P.S. Like the chart but can you explain the Y axis \"spot (relative)\" please?
ID: 38094 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38095 - Posted: 11 Oct 2009, 19:08:37 UTC - in response to Message 38094.  

P.S. Like the chart but can you explain the Y axis \"spot (relative)\" please?

Sorry: I use these charts to spot iceworlds - so it\'s really seconds/timestep vs trickle number. The seconds/timestep number on the BOINC Web site is cumulative so I take the difference between adjacent values to get a \"spot\" value, then divide by the seconds/timestep value for the first trickle to get a relative value. The graph then starts at 1.0 and all results within a work unit can be plotted on the same graph.

[PS It\'s pChart, which has a slightly odd franglais interface - but it does the job.]
ID: 38095 · Report as offensive     Reply Quote
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 38103 - Posted: 12 Oct 2009, 16:06:20 UTC - in response to Message 38095.  

Thanks for the explanation.

So the restored model now goes past the point when it turned blue!
Had even aborted it yesterday, but it has trickled up at the next
timestep. Will just keep an eye on it and take more frequent backups
in case it hits another wall further on.

Sorry not to supply another iceworld for your collection!
ID: 38103 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38106 - Posted: 12 Oct 2009, 19:00:09 UTC - in response to Message 38103.  
Last modified: 12 Oct 2009, 19:03:03 UTC

Sorry not to supply another iceworld for your collection!
Good news for you ...

Collecting statistics about iceworlds, I\'ve seen quite a few occasions where something that looks like an iceworld isn\'t confirmed by other computers. So, they are sometimes caused by something on the PC and a restored backup will carry on. Never happened to me though. :-(

A slam-dunk Windows/Intel iceworld looks like this on one of my charts:

... and, before you ask, this WU was repeated a number of times - so there are more than the usual number of repeats!

Here\'s a Windows/AMD iceworld:

... it recovers! As far as I can tell, any iceworld on any platform that starts in phase 2 or later will recover at the next phase. Phase 1 iceworlds are doomed. On Windows/Intel, where the model slows down, you would have to have the patience of a saint to persist. I don\'t.
ID: 38106 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 38111 - Posted: 13 Oct 2009, 4:42:35 UTC - in response to Message 38106.  

Sorry not to supply another iceworld for your collection!
Good news for you ...

Collecting statistics about iceworlds, I\'ve seen quite a few occasions where something that looks like an iceworld isn\'t confirmed by other computers. So, they are sometimes caused by something on the PC and a restored backup will carry on. Never happened to me though. :-(

A slam-dunk Windows/Intel iceworld looks like this on one of my charts:

... and, before you ask, this WU was repeated a number of times - so there are more than the usual number of repeats!

Here\'s a Windows/AMD iceworld:

... it recovers! As far as I can tell, any iceworld on any platform that starts in phase 2 or later will recover at the next phase. Phase 1 iceworlds are doomed. On Windows/Intel, where the model slows down, you would have to have the patience of a saint to persist. I don\'t.


Hi, Iain

Does this mean that in the (unlikely) event that I should have a fast processing iceworld on my AMD machine I should just keep running it instead of aborting?

ID: 38111 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38113 - Posted: 13 Oct 2009, 8:50:08 UTC - in response to Message 38111.  

Does this mean that in the (unlikely) event that I should have a fast processing iceworld on my AMD machine I should just keep running it instead of aborting?

I don\'t know the answer to that: it does mean that, practically speaking, you can keep running it. My general rule is that it isn\'t a good idea to try to guess what the project uses data for, so I assume they\'ll use anything and therefore finish anything that I can. So, if I had an AMD, I would finish an iceworld. However, unless a Windows/Intel iceworld is very near a phase boundary, the slowdown is such that it would usually be possible to run a number of complete models in the time it would take to finish one iceworld - and there I do make my own decision and abort.

As to likelihood, from what I can tell, iceworlds are as prevalent on Windows/AMD and Mac (and possibly Linux/AMD) as on Windows/Intel. Windows/Intel users tend to notice because their progress will stall (and their much maligned RAC will plummet). Iceworlds on other platforms don\'t get reported because they run fast and people just don\'t notice them. The rate is something like 15% - i.e. one in seven.

PS And if you do get an AMD iceworld, I would really like the \'.cpdn\' file. It would double the sample of Windows/AMD! ;-)
ID: 38113 · Report as offensive     Reply Quote
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 38133 - Posted: 16 Oct 2009, 17:45:12 UTC

So my restored iceworld (task 9964633) from last week eventually completed and duly sent up the final trickle -
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=9964633

As expected, since I had already aborted it before attempting the restore, we get the
\"Completed result (model name) refused: result already reported as error\" (in BOINC manager).

Assume that this can be ignored so the full set of results will be available to the scientists.

As you say Iain, this may have been a machine \"blip\". In the past I have tried a few restores but no luck.
Thereafter have always aborted and just take another one - i.e. the usual advice. Now I will always restore
an iceworld (don\'t like abandoning any model!) unless I\'ve been lazy and not backed up for several days.

Looking at my records (from our team stats), have 16 fails from 146 slabs (10%) and 8 from 28 mids (30%).
Presumably the majority, if not all, were iceworlds. Could do better!

ID: 38133 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38134 - Posted: 17 Oct 2009, 11:09:18 UTC - in response to Message 38133.  

[iansm wrote:] As expected, since I had already aborted it before attempting the restore, we get the
\"Completed result (model name) refused: result already reported as error\" (in BOINC manager).

Assume that this can be ignored so the full set of results will be available to the scientists.
That\'s my assumption too.

[iansm wrote:] Looking at my records (from our team stats), have 16 fails from 146 slabs (10%) and 8 from 28 mids (30%).
Presumably the majority, if not all, were iceworlds. Could do better!
Looking at those mid-holocene crashes, I suspect that some of them aren\'t iceworlds (in the sense that they\'re not repeatable - they may well have frozen at the time they crashed). Have a look at this sequence of the eight potential iceworlds here; use the \'previous\' and \'next\' links at the bottom of the page to move through the sequence. I would speculate that ki3v, ki7l and km6d crashed for some other reason and could be restored from backup. It depends whether you want three CPUs down while you run them to completion!

If they aren\'t reproducible iceworlds, then that brings your mid-holocene iceworld rate down from a very discouraging 8 in 28 to an almost bearable 5 in 28 (i.e. 18%).

Iain
ID: 38134 · Report as offensive     Reply Quote
old_user353238

Send message
Joined: 15 Mar 06
Posts: 41
Credit: 3,581,078
RAC: 0
Message 38135 - Posted: 17 Oct 2009, 13:45:18 UTC

Iain,
Aha, confession time. Found out. It\'s a fair cop.
Checked own records again and found another 8 SM3/SM3MH models that also
failed during a period when that Q6600 was breaking the speed limit.
s/TS rates of around 1.00 or less for those 11 models have now reminded me!
That\'s 7 MH\'s + 4 slabs. Another 4 MH\'s in the iceworld detector slideshow
(on the same 880110 m/c) are km3a, km9s, km39, kk9v. The other one kj30
is from another (stock!) Q6600 (990567) so that has definitely got to be a proper iceball.

Am a good driver now, having throttled back to a sensible SM3 speed this year (1.05 - 1.10).
Last week\'s iceworld was just a small \"blip\" (grin).

Your great iceworld detector reveals all.

Unfortunately, not possible to restore those old models as I only keep
short term backups. Pity, I\'d be happy to dedicate one machine to retry a few.
It would be good practice now to keep iceworlds in a \"BOINCDataIce\" folder to
retry when convenient - and send them to the appeal if get iced up a second time.
ID: 38135 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38136 - Posted: 17 Oct 2009, 14:17:37 UTC
Last modified: 17 Oct 2009, 14:19:52 UTC

Ha! I\'ve deleted most of my backups as well.

My backup procedure is now to download models when there\'s about 10 days to go on the currently running set (i.e. the maximum), then immediately suspend the newly downloaded models. The current set is then run to completion, leaving four (or whatever) suspended downloads. When the completed models have reported, BOINC is then stopped and a single backup taken. The advantage of this timing is that:

a) other people with faster machines get up to ten days ahead of me, so I can see the iceworlds coming (and turn on recording only when necessary)

b) the downloaded models are still only Zip files, so the backup is small and quick and can be moved to another machine without \'contamination\' by the download machine. (I do quite a bit of that to check iceworld repeatability.)

A reasonable \'raw\' backup history can now be kept without taking up too much space. If I\'m feeling nervous I take interim backups after phase changes (to prevent the re-uploading of Zip files), but throw them away when the new set starts and the Web site shows the right graphs for the old models.

The method works best for someone like me who doesn\'t expect many crashes, but it does make the whole backup business a bit less of a chore.
ID: 38136 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38178 - Posted: 24 Oct 2009, 16:51:13 UTC

The collection of iceworlds now amounts to 26, with recent additions by mo.v and Dibb Fosdyke (and two of mine) - for which, many thanks! The current batch of Windows/Intel iceworlds seem all to start in the same place, on the west coast of North America.

Here is the map again, together with the Mediterranean crashes.


West Coast


Western Med. - Straits of Gibraltar


Eastern Med. - Cyprus

Green and blue blobs are model grid points; the red blob is where the freeze started; a green tint indicates model land and a blue tint indicates model ocean.
ID: 38178 · Report as offensive     Reply Quote
Profileadrianxw
Avatar

Send message
Joined: 31 Aug 04
Posts: 145
Credit: 2,080,724
RAC: 753
Message 38198 - Posted: 28 Oct 2009, 6:38:33 UTC

I\'d have helped, (note past tense), but what I came to the board to do was report an ice world, hadsm3fub_jowe_006398408 this one. I have suspended it rather than aborting in case there is anything to be gained here, but I doubt it. I do not have any backups of the model. If there is anything to recover, let me know. (It crunched for 250+ hours before \"freezing\").
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 38198 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38200 - Posted: 28 Oct 2009, 9:15:39 UTC

Thanks, Adrian.

Only one other model is running in that work unit (6596415). When that catches up with yours then it will be apparent whether Windows/Intel computers suffer repeatable iceworlds in that work unit, or whether the iceworld on your PC was just a random freeze.

It\'s a bit of a problem having to know that the iceworld is coming before it actually freezes! An occasional backup is the simplest method and the backup can also be used to recover from the random crashes that occur from time to time.

Your iceworld is half way through the last phase, which means it has 12 trickles still to go. Windows/Intel slow-processing iceworlds can take about a week per trickle - so that would be 12 weeks to complete that model. You could probably do five or six other slabs in that time. Abort.

Iain
ID: 38200 · Report as offensive     Reply Quote
old_user582229

Send message
Joined: 12 Aug 09
Posts: 20
Credit: 3,063,648
RAC: 0
Message 38207 - Posted: 28 Oct 2009, 14:34:50 UTC

Iain,

I will gladly help with your Iceworld project, as I can.

I am currently running 19 models, and have set all to record mode. Only four are AMD. As I only have 400GB dedicated to BOINC per computer I will delete the .CPDN\'s after each phase change.

It pleases me to think some use will be made of my blues\'.


Is there anyway to set the models to automatically record when they start to run?

Cheers
David
ID: 38207 · Report as offensive     Reply Quote
ProfileIain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 38208 - Posted: 28 Oct 2009, 15:00:05 UTC
Last modified: 28 Oct 2009, 15:12:52 UTC

That\'s great, David. My iceworld rate is about one in seven, so you should find they start to come in pretty quickly from 19 models. Remember that AMD iceworlds are sneaky: they run fast - how you\'re going to spot those I don\'t know. (I could set up a logging script here, actually, and send you a PM heads-up; that\'s how I do my own.)

I also delete the \'.cpdn\' files after each phase change (having snoozed BOINC, just in case there\'s a clash between the BOINC client writing a file and the operating system trying to delete it). I sometimes make a backup at the phase changes anyway, so it\'s a good time to do some housekeeping. (Backing up a BOINC folder with 100 GB of \'.cpdn\' files is not a good idea!)

I\'ve not found any way of starting the recording automatically. However, there are command-line options for BOINC that some users know about - not me, though. Perhaps I should explore that a bit more thoroughly.

What I have found is that:

1. Recording survives a phase change, at which point it will start overwriting any \'.cpdn\' files with the same name (the filename includes the timestep but not the phase). The \'.cpdn\' file itself contains a record of the current phase, so I can \'disambiguate\' without needing any other information.

2. Recording of models is independent, so BOINC will record as many models as the machine can take. However, I have found that my Q9550 just cannot handle four models recording at the same time: it runs for a few days, then all four models crash with the \'no finished file\' error. The models then restart without needing to be restored from backup, but the recording does not restart. That\'s why I take phase-change backups and record. Belt and braces.

3. Other ways of crashing the models and stopping the recording include \'looking at the disk tab in BOINC Manager\', \'looking at the tmp folder while BOINC is running\' and \'anything unusual happening on the PC\'.

PS I don\'t know whether the new batch of slabs turn into iceworlds. I guess we\'ll find out.
ID: 38208 · Report as offensive     Reply Quote
Profileadrianxw
Avatar

Send message
Joined: 31 Aug 04
Posts: 145
Credit: 2,080,724
RAC: 753
Message 38209 - Posted: 28 Oct 2009, 15:00:19 UTC
Last modified: 28 Oct 2009, 15:03:52 UTC

Okay, done. Current wu Ctrl-Q\'d.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 38209 · Report as offensive     Reply Quote
old_user582229

Send message
Joined: 12 Aug 09
Posts: 20
Credit: 3,063,648
RAC: 0
Message 38210 - Posted: 28 Oct 2009, 16:08:25 UTC - in response to Message 38208.  

That\'s great, David. My iceworld rate is about one in seven, so you should find they start to come in pretty quickly from 19 models. Remember that AMD iceworlds are sneaky: they run fast - how you\'re going to spot those I don\'t know. (I could set up a logging script here, actually, and send you a PM heads-up; that\'s how I do my own.) Great thanks. From tomorrow I will be at 20 models on three computers, two i7\'s and the Phenom.

I could buy another screen and put the four AMD models on it so when I am working I would see a bluey when it happens. Belt and Braces, I know.


What I have found is that:

1. Recording survives a phase change, at which point it will start overwriting any \'.cpdn\' files with the same name (the filename includes the timestep but not the phase). The \'.cpdn\' file itself contains a record of the current phase, so I can \'disambiguate\' without needing any other information. Great, so if they overwrite I don\'t need to delete, after each phase, as I should not exceed 8 * ~30GB (240GB)of data. Is that a correct understanding?

This also brings up another question. Do the .tmp files stay after the model ends?


2. Recording of models is independent, so BOINC will record as many models as the machine can take. However, I have found that my Q9550 just cannot handle four models recording at the same time: it runs for a few days, then all four models crash with the \'no finished file\' error. The models then restart without needing to be restored from backup, but the recording does not restart. That\'s why I take phase-change backups and record. Belt and braces.
This could be a concern as I run two i7\'s,each with eight cores. I guess we will know in a couple of days how many it can record at a time.lol.


3. Other ways of crashing the models and stopping the recording include \'looking at the disk tab in BOINC Manager\', \'looking at the tmp folder while BOINC is running\' and \'anything unusual happening on the PC\'. Currently running version 6.10.16 but will avoid looking at the disk tab while models are running. Thanks for this tip

PS I don\'t know whether the new batch of slabs turn into iceworlds. I guess we\'ll find out.


ID: 38210 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : Iceworld Appeal

©2024 cpdn.org