climateprediction.net (CPDN) home page
Thread 'New model versions released'

Thread 'New model versions released'

Message boards : Number crunching : New model versions released
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 27823 - Posted: 11 Apr 2007, 18:17:15 UTC
Last modified: 11 Apr 2007, 18:20:09 UTC

The programmers in Oxford have announced here that they have released new climate model versions for Windows, Linux and Intel Macs. The new versions have a 92% reduction in disk I/O (input-output), which should make them more suitable for use on laptops. Laptop users should, however, still check that their machine does not overheat.

The new models are already being handed out. Please note that a BOINC version > 5.0 is required.

Members are asked to complete their current models before getting a new version.

So enjoy crunching!

You may also like to read Milo\'s comments here.


Cpdn news
ID: 27823 · Report as offensive     Reply Quote
Jord
Avatar

Send message
Joined: 5 Aug 04
Posts: 250
Credit: 93,274
RAC: 0
Message 27827 - Posted: 11 Apr 2007, 22:52:23 UTC

I see there\'s some redundancy built in as well? Initial replication is 4?
Does that mean the trickles won\'t get credit before they are validated against others?
Jord.
ID: 27827 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 27828 - Posted: 11 Apr 2007, 23:35:54 UTC
Last modified: 11 Apr 2007, 23:56:18 UTC

I don\'t think they\'d want to set up a quorum. Two possibilities: either they\'re using the initial ensemble parameter (several models started off with slightly different random seeds), or they\'re only issuing one at a time from each WU.

Looking at your work unit, I note that \'initial quorum\' is set to 1, and #success is also 1? (I\'m not sure of the significance of most of the WU settings since I\'ve always avoided projects with quorums > 1).

I also note that the parameters aren\'t showing for these results, so we can\'t see whether the ensemble parameter has been set or not.

-- Edit:

I\'ve had a scan through around 30 or so WUs, and I\'ve only found one with more than one issued result. But in this example, there was a download error. I also note that it created a new result rather than using one of the 3 available ones.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6022927

-- Edit:

Another 50 WUs later, and here\'s a second example. But this time there hasn\'t been a crashed model.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6022888
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 27828 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 27829 - Posted: 12 Apr 2007, 0:16:11 UTC

If I\'ve understood Carl\'s previous posts correctly....

On average about one in four cpdn models reach the end. So to be fairly sure of getting a completed result for each set of parameters, four copies of each are made. Usually the extra copies are not for the purpose of validation (though sometimes they do compare the same WU run on different machines) but simply because of the Crash to Success ratio.
Cpdn news
ID: 27829 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 27860 - Posted: 13 Apr 2007, 2:44:12 UTC
Last modified: 13 Apr 2007, 2:44:49 UTC

(I\'ve also posted this in the Linux section.)

For Linux users who have downloaded a new model in the last two days

The new Linux climate models, version 5.40, were within hours discovered to contain an error. A line of code inserted for testing purposes had not been removed. When a Linux 5.40 model contacts the server, it will receive a killer trickle to abort it. A new version 5.41 model will be downloaded to replace it.

Apologies from Oxford for the error. Fortunately not much crunching time will have been wasted on the flawed models.


Cpdn news
ID: 27860 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 27866 - Posted: 13 Apr 2007, 9:57:58 UTC

I\'d like to add that the 5.40 models should produce valid results, it\'s just that they will produce them at approximately half the speed of the 5.41 models.
ID: 27866 · Report as offensive     Reply Quote
robert.mouris

Send message
Joined: 16 Jul 05
Posts: 6
Credit: 31,694,022
RAC: 9,355
Message 27867 - Posted: 13 Apr 2007, 10:58:43 UTC - in response to Message 27828.  
Last modified: 13 Apr 2007, 11:22:41 UTC

I\'ve had a scan through around 30 or so WUs, and I\'ve only found one with more than one issued result. But in this example, there was a download error. I also note that it created a new result rather than using one of the 3 available ones.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6022927

-- Edit:

Another 50 WUs later, and here\'s a second example. But this time there hasn\'t been a crashed model.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6022888

When I read about the initial replication of 4, my first thought was to download already now 1 or 2 WUs on each PC, even if I\'m still busy crunching a climate model on each of them (progression between 22 and 86%). As I have no other project, these will all be finished in a few weeks. By then I could watch which fellow crunchers had a crash and run the WUs where other people had most problems. This sounds a bit selfish as it can\'t work if everyone behaves like that, but I thought that I could minimize the waste with multiple processing. But now I see that a failed WU triggers immediately a replacement WU, and this will go on until the first result for the WU is sent in and the quorum = 1 is met, thus marking the unsent replacement WUs as \"didn\'t need\".

I\'m not sure if I like this. Crunching for thousands of hours, and then be told that my WU goes to the trash bin, just because someone else was faster than me? Or, on the other hand, I would be the first one and make the other results worthless? Isn\'t there a way of not creating immediately replacement WUs, issuing only 1 result for each WU, and when they are all issued, then creating replacement WUs only for those which had crashed? Even if only 1 in 4 results is returned, this is just an average, not uniformly distributed, and I don\'t want to be in a team of 4 successful crunchers. This is too much waste.
ID: 27867 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 27868 - Posted: 13 Apr 2007, 12:13:43 UTC
Last modified: 13 Apr 2007, 12:17:08 UTC

Hi Robert

The situation has always been that result ......_0 (the first) was given out and normally only one computer would run it. If it crashed or failed to contact the server for more than about 6 weeks, it would be handed out again to another computer as result ......_1. I don\'t think crashed models are handed out again immediately; this gives the first owner time to restore their backup, try again and resume trickles.

I am currently crunching a ......_1 model. It crashed on another computer about 6 weeks before it was sent to me. That person has had about 150 models, so I know he\'s not duplicating my work!

If you look at the details of your workunit on your server web pages, you always see which computers have had the model in the past or now. As far as I know, a few models are run to completion on more than one computer for control purposes. Most workunits are not control models.

So if a model is sent to your computer, it\'s because the researchers need you to crunch it.

I don\'t think it\'s a good idea to store models for future use because the programmers sometimes make improvements to the parameter values etc. They then send out a new batch. The best thing is then for everybody to get a model from the newest batch (but only when they\'ve finished their previous model).

We don\'t need to select or reject models for any reason.


Cpdn news
ID: 27868 · Report as offensive     Reply Quote
robert.mouris

Send message
Joined: 16 Jul 05
Posts: 6
Credit: 31,694,022
RAC: 9,355
Message 27869 - Posted: 13 Apr 2007, 12:28:20 UTC
Last modified: 13 Apr 2007, 12:31:23 UTC

Thank you for your quick answer, mo.v!

Of course, if a WU is only reissued after a former result fails or is dead, then it\'s perfectly OK. It has always been that way.

So I understand that this is one of those control runs. It was this one that caused me trouble, especially as the third result that was sent out finished/failed, and immediately led to the creation of 6455180.

I will, as always, let my current WUs finish and then download 1 new WU.

Robert
ID: 27869 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 27878 - Posted: 13 Apr 2007, 19:38:29 UTC
Last modified: 13 Apr 2007, 19:40:30 UTC

It looks as if 5 copies of each workunit are being made (_0 _1 _2 _3 _4), not 4 as I previously said.

I can\'t find that workunit on your server web pages. Maybe it will show up tomorrow. It looks as if _0 and _4 haven\'t been sent to anyone. _3 was sent to Fionn, but she\'s had 181 models and needs to post for advice. _2 went to Grace P\'s mac, but I\'m not sure whether she can successfully run two models simultaneously with only 512Mb RAM, even on a mac. _1 went to Okita\'s Athlon. I don\'t think that\'s you.

Anyway, the important thing is just to look after one\'s current model and back up the contents of the boinc folder regularly, as this is the surest way to complete the 160 years.

If a model fails to download correctly, I think another copy can be sent to another computer immediately.
Cpdn news
ID: 27878 · Report as offensive     Reply Quote
robert.mouris

Send message
Joined: 16 Jul 05
Posts: 6
Credit: 31,694,022
RAC: 9,355
Message 27879 - Posted: 13 Apr 2007, 19:44:01 UTC - in response to Message 27878.  

I don\'t think that\'s you.

Anyway, the important thing is just to look after one\'s current model and back up the contents of the boinc folder regularly, as this is the surest way to complete the 160 years.


No, it\'s not me, I just came across this one while reading the discussion forum. My question was just if it would be sensible if I downloaded already now new WUs in anticipation. I have understood that I should not download anything at all before having finished each WU that is running on my computers.
ID: 27879 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 27904 - Posted: 14 Apr 2007, 17:22:39 UTC

Mo, here\'s an interesting case:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6024069

hadcm3inct_cn65_1920_160_05865070_2

It was issued on three successive days: 11/12/13 April. I received it first BUT have the \'nnn_2\' RunID. More than one curious thing about that...

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 27904 · Report as offensive     Reply Quote
Jord
Avatar

Send message
Joined: 5 Aug 04
Posts: 250
Credit: 93,274
RAC: 0
Message 27905 - Posted: 14 Apr 2007, 17:24:58 UTC - in response to Message 27904.  
Last modified: 14 Apr 2007, 17:27:20 UTC

I received it first BUT have the \'nnn_2\' RunID. More than one curious thing about that...

Even funnier, the last person got the _0 version.

Perhaps that it\'s initially split 4 times and the first person gets the 4th one?

I think that\'s it. Looking at my model, I first got it and have _3 ...
Jord.
ID: 27905 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 27909 - Posted: 15 Apr 2007, 1:14:49 UTC
Last modified: 15 Apr 2007, 1:18:35 UTC

I wonder whether _0 _1 _2 _3 and _4 are now issued in random order? My impression is that in the past, _0 was always issued first. From the scientific point of view of course it makes no difference because they\'re all identical.

If the distribution is now random, I wonder whether this is what the programmers intended, or whether they\'ve forgotten to include a previous command.
Cpdn news
ID: 27909 · Report as offensive     Reply Quote
ProfileMikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 27913 - Posted: 15 Apr 2007, 9:21:12 UTC
Last modified: 15 Apr 2007, 9:25:38 UTC

IIRC, originally only _0 was generated, and then if it crashed, _1 would be created to replace it, etc? So what we\'re seeing here is 4 times as many versions of the initial models within each WU.

On the other hand, SAP for example created 4 separate WUs for each parameter set (so effectively doing the same thing as we\'re seeing here, but with more WUs instead of more results within the WU. This would be a more efficient way of doing the same thing.

We\'re also not seeing the parameters for the models on the results page. And here\'s a new and unique error I\'ve never seen before (the other issued models look OK in the same set, so my guess is that this is the result of a download error).

From the crashed result in the following work unit:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6022966

Model crashed: umshell1.f: U_MODEL: Illegal combination of submodels A

Model crashed: umshell1.f: U_MODEL: Illegal combination of submodels A

Model crashed: umshell1.f: U_MODEL: Illegal combination of submodels A

Model crashed: umshell1.f: U_MODEL: Illegal combination of submodels A
Not a JPEG file: starts with 0x01 0xda

Model crashed: umshell1.f: U_MODEL: Illegal combination of submodels A

Model crashed: umshell1.f: U_MODEL: Illegal combination of submodels A
Sorry, too many model crashes! :-(


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 27913 · Report as offensive     Reply Quote
old_user91851

Send message
Joined: 8 Aug 05
Posts: 9
Credit: 46,744
RAC: 0
Message 28478 - Posted: 6 May 2007, 20:41:09 UTC

I have been running a CPDN model since August last year using version 5.15 of the application. I have a dual core CPU.

Today BOINC downloaded 5.40 and another model to run on the other core as it couldn\'t do any work on the other projects I was attached to. I caught this quick and aborted the model. But now my computer has version 5.40 on and I know it isn\'t a good idea to upgrade the app whilst working mid model. The old model seems to still be using 5.15 but will it keep on using it? Should I delete 5.40?
ID: 28478 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 28481 - Posted: 6 May 2007, 21:18:37 UTC

None of the cpdn models will change to a new version while they\'re running. Each model stays the same version from start to finish. I\'m still running a BBC model that\'s version 5.08 - I got it more than a year ago. The important thing is to try to finish the models you\'ve started. It doesn\'t matter what version they are, because scientifically they\'re all the same. You can run an older version on one core and a newer version on the other core, no problem.

You do have to think about the best moment to upgrade your version of boinc. I think it\'s a good idea to make a backup of the complete contents of the boinc folder before a boinc upgrade, just in case anything goes wrong. In fact regular backups are a good idea anyway with such long workunits.

You can avoid getting extra unwanted cpdn models (two is enough!) by going to the Projects tab, highlighting cpdn and clicking No new work. The day you do want a new model, you\'ll have to click the button again.
Cpdn news
ID: 28481 · Report as offensive     Reply Quote
old_user91851

Send message
Joined: 8 Aug 05
Posts: 9
Credit: 46,744
RAC: 0
Message 28482 - Posted: 6 May 2007, 21:25:05 UTC

Thanks for the info. I\'ve set it to no new work now so should be OK. Was panicking for a while because I have already had to restore form a back up once after a crash and wasn\'t sure how I would roll back to 5.15 if I had to. Good to know it will be OK.
ID: 28482 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 28486 - Posted: 6 May 2007, 23:47:40 UTC

The new model versions are better than the older ones because with each new version, certain problems with the way they run (bugs) are put right in Oxford. But the most important thing is to try to finish the models we\'ve started. Anyway, now you can relax!

Mo
Cpdn news
ID: 28486 · Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 14 Aug 06
Posts: 22
Credit: 6,487,725
RAC: 13,143
Message 30631 - Posted: 22 Sep 2007, 3:11:54 UTC - in response to Message 28486.  

The new model versions are better than the older ones because with each new version, certain problems with the way they run (bugs) are put right in Oxford. But the most important thing is to try to finish the models we\'ve started. Anyway, now you can relax!

Mo


[q] Are there in fact work units with a planned approximate 500 hours of CPU time to complete? I have read on message boards somewhere that the slab(?) units fall within the 500-hour CPU processing criteria. If this is true, how do I locate and specifically download such units. I anticipate I am wrong in this regard.
Bill[q]
ID: 30631 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : New model versions released

©2024 cpdn.org