Message boards : Number crunching : New model versions released
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The programmers in Oxford have announced here that they have released new climate model versions for Windows, Linux and Intel Macs. The new versions have a 92% reduction in disk I/O (input-output), which should make them more suitable for use on laptops. Laptop users should, however, still check that their machine does not overheat. The new models are already being handed out. Please note that a BOINC version > 5.0 is required. Members are asked to complete their current models before getting a new version. So enjoy crunching! You may also like to read Milo\'s comments here. Cpdn news |
Send message Joined: 5 Aug 04 Posts: 250 Credit: 93,274 RAC: 0 |
I see there\'s some redundancy built in as well? Initial replication is 4? Does that mean the trickles won\'t get credit before they are validated against others? Jord. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I don\'t think they\'d want to set up a quorum. Two possibilities: either they\'re using the initial ensemble parameter (several models started off with slightly different random seeds), or they\'re only issuing one at a time from each WU. Looking at your work unit, I note that \'initial quorum\' is set to 1, and #success is also 1? (I\'m not sure of the significance of most of the WU settings since I\'ve always avoided projects with quorums > 1). I also note that the parameters aren\'t showing for these results, so we can\'t see whether the ensemble parameter has been set or not. -- Edit: I\'ve had a scan through around 30 or so WUs, and I\'ve only found one with more than one issued result. But in this example, there was a download error. I also note that it created a new result rather than using one of the 3 available ones. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6022927 -- Edit: Another 50 WUs later, and here\'s a second example. But this time there hasn\'t been a crashed model. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6022888 I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
If I\'ve understood Carl\'s previous posts correctly.... On average about one in four cpdn models reach the end. So to be fairly sure of getting a completed result for each set of parameters, four copies of each are made. Usually the extra copies are not for the purpose of validation (though sometimes they do compare the same WU run on different machines) but simply because of the Crash to Success ratio. Cpdn news |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
(I\'ve also posted this in the Linux section.) For Linux users who have downloaded a new model in the last two days The new Linux climate models, version 5.40, were within hours discovered to contain an error. A line of code inserted for testing purposes had not been removed. When a Linux 5.40 model contacts the server, it will receive a killer trickle to abort it. A new version 5.41 model will be downloaded to replace it. Apologies from Oxford for the error. Fortunately not much crunching time will have been wasted on the flawed models. Cpdn news |
Send message Joined: 2 Mar 06 Posts: 253 Credit: 363,646 RAC: 0 |
I\'d like to add that the 5.40 models should produce valid results, it\'s just that they will produce them at approximately half the speed of the 5.41 models. |
Send message Joined: 16 Jul 05 Posts: 6 Credit: 31,694,022 RAC: 9,355 |
I\'ve had a scan through around 30 or so WUs, and I\'ve only found one with more than one issued result. But in this example, there was a download error. I also note that it created a new result rather than using one of the 3 available ones. When I read about the initial replication of 4, my first thought was to download already now 1 or 2 WUs on each PC, even if I\'m still busy crunching a climate model on each of them (progression between 22 and 86%). As I have no other project, these will all be finished in a few weeks. By then I could watch which fellow crunchers had a crash and run the WUs where other people had most problems. This sounds a bit selfish as it can\'t work if everyone behaves like that, but I thought that I could minimize the waste with multiple processing. But now I see that a failed WU triggers immediately a replacement WU, and this will go on until the first result for the WU is sent in and the quorum = 1 is met, thus marking the unsent replacement WUs as \"didn\'t need\". I\'m not sure if I like this. Crunching for thousands of hours, and then be told that my WU goes to the trash bin, just because someone else was faster than me? Or, on the other hand, I would be the first one and make the other results worthless? Isn\'t there a way of not creating immediately replacement WUs, issuing only 1 result for each WU, and when they are all issued, then creating replacement WUs only for those which had crashed? Even if only 1 in 4 results is returned, this is just an average, not uniformly distributed, and I don\'t want to be in a team of 4 successful crunchers. This is too much waste. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Hi Robert The situation has always been that result ......_0 (the first) was given out and normally only one computer would run it. If it crashed or failed to contact the server for more than about 6 weeks, it would be handed out again to another computer as result ......_1. I don\'t think crashed models are handed out again immediately; this gives the first owner time to restore their backup, try again and resume trickles. I am currently crunching a ......_1 model. It crashed on another computer about 6 weeks before it was sent to me. That person has had about 150 models, so I know he\'s not duplicating my work! If you look at the details of your workunit on your server web pages, you always see which computers have had the model in the past or now. As far as I know, a few models are run to completion on more than one computer for control purposes. Most workunits are not control models. So if a model is sent to your computer, it\'s because the researchers need you to crunch it. I don\'t think it\'s a good idea to store models for future use because the programmers sometimes make improvements to the parameter values etc. They then send out a new batch. The best thing is then for everybody to get a model from the newest batch (but only when they\'ve finished their previous model). We don\'t need to select or reject models for any reason. Cpdn news |
Send message Joined: 16 Jul 05 Posts: 6 Credit: 31,694,022 RAC: 9,355 |
Thank you for your quick answer, mo.v! Of course, if a WU is only reissued after a former result fails or is dead, then it\'s perfectly OK. It has always been that way. So I understand that this is one of those control runs. It was this one that caused me trouble, especially as the third result that was sent out finished/failed, and immediately led to the creation of 6455180. I will, as always, let my current WUs finish and then download 1 new WU. Robert |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
It looks as if 5 copies of each workunit are being made (_0 _1 _2 _3 _4), not 4 as I previously said. I can\'t find that workunit on your server web pages. Maybe it will show up tomorrow. It looks as if _0 and _4 haven\'t been sent to anyone. _3 was sent to Fionn, but she\'s had 181 models and needs to post for advice. _2 went to Grace P\'s mac, but I\'m not sure whether she can successfully run two models simultaneously with only 512Mb RAM, even on a mac. _1 went to Okita\'s Athlon. I don\'t think that\'s you. Anyway, the important thing is just to look after one\'s current model and back up the contents of the boinc folder regularly, as this is the surest way to complete the 160 years. If a model fails to download correctly, I think another copy can be sent to another computer immediately. Cpdn news |
Send message Joined: 16 Jul 05 Posts: 6 Credit: 31,694,022 RAC: 9,355 |
I don\'t think that\'s you. No, it\'s not me, I just came across this one while reading the discussion forum. My question was just if it would be sensible if I downloaded already now new WUs in anticipation. I have understood that I should not download anything at all before having finished each WU that is running on my computers. |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Mo, here\'s an interesting case: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6024069 hadcm3inct_cn65_1920_160_05865070_2 It was issued on three successive days: 11/12/13 April. I received it first BUT have the \'nnn_2\' RunID. More than one curious thing about that... "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Aug 04 Posts: 250 Credit: 93,274 RAC: 0 |
I received it first BUT have the \'nnn_2\' RunID. More than one curious thing about that... Even funnier, the last person got the _0 version. Perhaps that it\'s initially split 4 times and the first person gets the 4th one? I think that\'s it. Looking at my model, I first got it and have _3 ... Jord. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I wonder whether _0 _1 _2 _3 and _4 are now issued in random order? My impression is that in the past, _0 was always issued first. From the scientific point of view of course it makes no difference because they\'re all identical. If the distribution is now random, I wonder whether this is what the programmers intended, or whether they\'ve forgotten to include a previous command. Cpdn news |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
IIRC, originally only _0 was generated, and then if it crashed, _1 would be created to replace it, etc? So what we\'re seeing here is 4 times as many versions of the initial models within each WU. On the other hand, SAP for example created 4 separate WUs for each parameter set (so effectively doing the same thing as we\'re seeing here, but with more WUs instead of more results within the WU. This would be a more efficient way of doing the same thing. We\'re also not seeing the parameters for the models on the results page. And here\'s a new and unique error I\'ve never seen before (the other issued models look OK in the same set, so my guess is that this is the result of a download error). From the crashed result in the following work unit: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6022966
I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 8 Aug 05 Posts: 9 Credit: 46,744 RAC: 0 |
I have been running a CPDN model since August last year using version 5.15 of the application. I have a dual core CPU. Today BOINC downloaded 5.40 and another model to run on the other core as it couldn\'t do any work on the other projects I was attached to. I caught this quick and aborted the model. But now my computer has version 5.40 on and I know it isn\'t a good idea to upgrade the app whilst working mid model. The old model seems to still be using 5.15 but will it keep on using it? Should I delete 5.40? |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
None of the cpdn models will change to a new version while they\'re running. Each model stays the same version from start to finish. I\'m still running a BBC model that\'s version 5.08 - I got it more than a year ago. The important thing is to try to finish the models you\'ve started. It doesn\'t matter what version they are, because scientifically they\'re all the same. You can run an older version on one core and a newer version on the other core, no problem. You do have to think about the best moment to upgrade your version of boinc. I think it\'s a good idea to make a backup of the complete contents of the boinc folder before a boinc upgrade, just in case anything goes wrong. In fact regular backups are a good idea anyway with such long workunits. You can avoid getting extra unwanted cpdn models (two is enough!) by going to the Projects tab, highlighting cpdn and clicking No new work. The day you do want a new model, you\'ll have to click the button again. Cpdn news |
Send message Joined: 8 Aug 05 Posts: 9 Credit: 46,744 RAC: 0 |
Thanks for the info. I\'ve set it to no new work now so should be OK. Was panicking for a while because I have already had to restore form a back up once after a crash and wasn\'t sure how I would roll back to 5.15 if I had to. Good to know it will be OK. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The new model versions are better than the older ones because with each new version, certain problems with the way they run (bugs) are put right in Oxford. But the most important thing is to try to finish the models we\'ve started. Anyway, now you can relax! Mo Cpdn news |
Send message Joined: 14 Aug 06 Posts: 22 Credit: 6,487,725 RAC: 13,143 |
The new model versions are better than the older ones because with each new version, certain problems with the way they run (bugs) are put right in Oxford. But the most important thing is to try to finish the models we\'ve started. Anyway, now you can relax! [q] Are there in fact work units with a planned approximate 500 hours of CPU time to complete? I have read on message boards somewhere that the slab(?) units fall within the 500-hour CPU processing criteria. If this is true, how do I locate and specifically download such units. I anticipate I am wrong in this regard. Bill[q] |
©2024 cpdn.org