Message boards : Number crunching : sulphur seems slower than slab
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 27 Jun 05 Posts: 74 Credit: 199,198 RAC: 0 |
River |
Send message Joined: 27 Jun 05 Posts: 74 Credit: 199,198 RAC: 0 |
River Thanks.
and I guess I am saying the projects requirements have just increased a factor of three without warning and without thanks to those who were welcome a month ago. When we say \'my box can\'t cope with this\' it feels churlish when I am told to like it or leave. Ir does seem to me that a three-fold increase in the minimum acceptable committment to the project should be accompanied with, at least, a comment from the project recognising the fact that some users will inevitably therefore have to leave. The project\'s requirements are what they are - but they are not what they were a month ago.
Good point. And I guess I am saying that a separate experiment will be bound to have different requirements, and therefore it is unfair to simply port everyone across from the old experiment assuming they will accept the new requirements. Some kind of user opt in, or some kind of automated selection of the faster boxes was needed. Moore\'s law seems to apply to science - every three years we can expect the computing needs of the scientists to quadruple, and looked at in that light the increase from slab to sulphur is just what we\'d expect. I hope CPDN get a lot more grants in future; maybe they will need even faster boxes. If so, I hope the project team will put some thought into the responses from this time round, and figure out a more tactful way to break the news that the slower boxes are no longer considered helpful. What is exciting and new to those who can cope, is the end of participation for those who can\'t. And \"parting is such sweet sorrow\" as the bard put it...
I will, thanks Something may turn up. But finding where it is being talked about is the problem. Which is why my ideal solution is to have a preference setting saying I will accept work up to XXXX hours predicted run time based on my benchmarks. Then I can stay connected forever, accept that I may not get any work, but then one day I\'ll look at the GUI and say \"hey, I\'m helping CPDN save the planet again!\". But yes, I will keep checking back every few months or so. River~~ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
How about joining us on the community forum, <a href=\"http://www.climateprediction.net/board/index.php\"> here?</a> Even if you stop crunching, you can still read and discuss. To post, you have to sign up for the board. Sorry. And there was some discussion about the increase in model sizes ages ago. With a recruiting thread for spinup, which has ended up as a 10 page discussion. So far. Currently being given a lot of attention, is the massive explosions in Hemel Hempstead. One of our crunchers is 2.5 miles due west. This is in \"Movies, fun, etc.\" (Which is for misc items.) It\'s already up to 2 pages. edit And don\'t forget to read my post <a href=\"http://www.climateprediction.net/board/viewtopic.php?t=3376\"> here,</a> about why really powerfull computers may not be a good thing. |
Send message Joined: 23 Feb 05 Posts: 55 Credit: 240,119 RAC: 0 |
But yes, I will keep checking back every few months or so. ~~ Rome is not built in a day.... It takes time to convince people of needed changes. especially if they had one of those sleepless nights (guess I\'m right Les) With the RAC you have at the moment it would be a loss for CPDN to see you go and I know my machines will go the same way the next change. (Coupled spinup alphas require 3.2G) As a short term solution I see is to get your machines to max specs, provided you have the resources and authorisation to do that. For the long term Carl has to come up with a parallel cpu/machine version for 32 bit. If not than I can predict that CPDN is going to lose quite a lot of participants and crunchers. Recently a lot of crunchers have bought new, top of the range machines, but in the pace it is going they soon have to buy mini-supercomputers. The general computer user would not require a state of the art machine to do the work they have to do. Don\'t expect this user to buy such a machine or machines just for CPDN. Besides this, if this was the case than it would only move the cost for hardware from science to the public sector, including the downside that it is probably less cost and energy effective to crunch on all those public machines. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
...I know my machines will go the same way the next change. (Coupled spinup alphas require 3.2G) Yes, but we\'re making assumptions here. The current spinup supposedly requires a 3.2 GHz P4 or equivalent. But that is only because it is 200 years, and they need the oceans pretty fast. Experiment 2 uses the coupled model, but only needs to run 50 years. Sulphur model right now is 75 years. So experiment 2 will likely take about the same time as a sulphur model (less years, fewer optimizations, and additional (albeit relatively fast) ocean steps). For the long term Carl has to come up with a parallel cpu/machine version for 32 bit. If not than I can predict that CPDN is going to lose quite a lot of participants and crunchers. Recently a lot of crunchers have bought new, top of the range machines, but in the pace it is going they soon have to buy mini-supercomputers. One would think a parallel version of the model would work okay, and a 64 bit version, so there are potential speedups there. But the nice optimizations that Carl/Tolu compiled into the latest version of sulphur won\'t be there since they can\'t get the ocean stable with those optimizations. |
Send message Joined: 23 Feb 05 Posts: 55 Credit: 240,119 RAC: 0 |
Geophi, Thanks, indeed you remind me that 3.2G was due to time constraints.
I have been scetching a long term solution. I know Carl/Tolu have encountered problems, but have they given up on it (already?) Possibly it works well on 32 bits or running ocean on an networked machine, say sort of asynchrone but controlled by the main machine. Creative minds, create solutions. Frustrated minds throw up restrictions. |
Send message Joined: 7 Aug 04 Posts: 2187 Credit: 64,822,615 RAC: 5,275 |
I have been scetching a long term solution. I know Carl/Tolu have encountered problems, but have they given up on it (already?) Already? Although the compiler is complex, there are only so many options. And the result (instability with SSE2 optimizations turned on) was not unexpected. Others before Carl had seen the same thing trying to run a coupled model on a PC. It\'s one of the reasons why the original model was 64bit. From a post on the spinup forum... Basically the past 4-6 weeks I have uncovered a lot of serious issues pertaining to running a coupled model as a 32-bit job a la our old slab model. The main problem is the coupling between the atmosphere & ocean is very sensitive, and the atmosphere works fine in 32-bit but the ocean wants 64-bit; so when you couple that all hell can break loose! Most annoyingly things seemed to drift and go unstable but you wouldn\'t see it until timestep 20,000 or so! But we seem to have worked these issues out OK, the problem is in the higher optimizations such as the \"-ax\" options on the Intel Fortran compiler to use SSE2 extensions, Pentium IV stuff etc, will truncate things enough on the 32-bit floats so that eventually it wrecks things. So we are not using the \"-ax\" optimizations now; although we are able to use the version 9 compiler which seems pretty fast anyway. Possibly it works well on 32 bits or running ocean on an networked machine, say sort of asynchrone but controlled by the main machine. Well, you\'d have to sketch that out for me. The way the model runs is one day of atmosphere, then the same day of ocean, next day of atmosphere, next day of ocean, etc. It\'s done that way for speed purposes. Creative minds, create solutions. Frustrated minds throw up restrictions. True, but then again, the creative person coming up with the solution likely should have an in depth knowledge of compilers, and the intricacies of coupled climate models. |
Send message Joined: 27 Jun 05 Posts: 74 Credit: 199,198 RAC: 0 |
As a short term solution I see is to get your machines to max specs, provided you have the resources and authorisation to do that. Neither. Most of the sub-GHz boxes are mine but can\'t be stretched to make sulphur viable, in my opinion. A couple of the subGHz boxes belong to friends and I can\'t presume to occupy a firend\'s computer 24/7 for half a year ahead. The 2.8 GHz boxes belong to a local charity and CPDN is running with the permission of one individual who has the discretion to allow it but the organization would not contemplate running those boxes outside the working day. And I won\'t ask for fear of rocking the boat and losing what we already have from them.
I\'m with Les on this one - if the new science needs 64 bit or needs super-minis then that is what the new science needs. You can only break down a given computer task into so many chunks before it goes unstable and science that borders on the unstable is no good to anyone. To take an extreme example, no amount of parallelization could get slab running on an infinite network of ZX spectrums. You might get the code to run but the results would be pure noise. So if sulphur and the oceans need the faster boxes, my suggestion is that the scientists also find some different science that can usefully be done on the slower boxes and alongside the sulphur and ocean runs. One option might be to run yet more slabs in the short term and in the longer term to find other interesting questions that can be split into slab sized chunks. Running two sets of science at once, one for fast boxes and one for slower boxes, will make the best use of the resources the project already has. It entails (sorry to keep coming back to this) some kind of selection process between boxes, whether it is user driven by prefs or automated via benchmarks and %on stats. If not than I can predict that CPDN is going to lose quite a lot of participants and crunchers. Agreed - and your comment suggests a lever for funding apps - \"we have a huge computing resource that is zero cost to public funds, and which we will lose if it is not re-deployed at the end of the slab runs\" could attract funding for a small, separate, research grant - all the scientists need to do is to identify some important question that could be answered in such a proposal. And the beauty of BOINC is that even if those boxes are lost to CPDN, very few will be lost to science. Predictor, for example, can do useful work on boxes too slow for the slab model (though it does need 128M ram). And in a year or so Orbit will be online and current estimates are that orbit WU will be very short and so very suitable for slower boxes. Orbit might be totally unnecessary, but if it finds something it would have more [pun] impact on the planet [/pun] than CPDN. If Predictor find a cure for mad cow disease or cancer then those potential benefits are not trivial either. Einstein & LHC benefit human curiosity and you can see from my stats that I personally rate that as important too. And there are now other projects on the BOINC front page that will attract other CPDN retirees. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The fast machines are needed for the spinup simply because of their long, 200 year run, which was estimated to take 4 months on a P4 3.2G. When it started there was plenty of time. But the models weren\'t stable. By the time that the deadline was a bit over 4 months away, Carl thought that he had cracked it, and started asking for volunteers willing to run 24/7 on fast computers. And still the models were unstable. By the time he had it worked out, there was only a bit over 2 months to do the 4 month run. The headaches will be for the researchers as to how much of the run will suffice to get started, although I suspect that they may use the data from the simpler data sets, which have very few of their parameters adjusted. Then, when the more complex runs start finishing, they can generate more data sets using these more complex runs. Just my idle musing on the problem. BUT! I also think that the coupled ocean runs, while more complex, will not take much more time to complete than a sulphur model, and that the time to completion will be long, perhaps the usual year. Which means that if you can dedicate a computer to cpdn for a few months with nothing else running, they won\'t be a problem. It\'s just the testers who need very fast, dedicated machines. We\'ll see come February. |
Send message Joined: 26 Aug 04 Posts: 19 Credit: 446,376 RAC: 0 |
Given the extra length required for these sulphur models and the termination of slab model generation, wouldn\'t it make sense to update the technical requirements page. I don\'t believe that an 800Mhz machine will be able to finish one before the deadline (based on the fact that I have a 1Ghz box that is estimating exceeding the deadline by a few days or so). Or perhaps an update to provide model specific requirements? |
Send message Joined: 23 Feb 05 Posts: 55 Credit: 240,119 RAC: 0 |
Carl has never said that the Slabs where terminated, but he has stopped Slabs, because science wants more Sulphur-results coming their way. I have asked the status on slabs, but I\'m afraid Carl missed that question. As \'Time to complete\' is a bit dodgy nowadays one has to run the model for a while and calculate end-time from the progress made. As indication: on a 1Gig PIII I expect a sulphur to complete within 200 days. (this 24/7 and on wintel32) ((Don\'t like to say it, but it might take longer on the same box under Linux)) If Slabs are not needed anymore and science can\'t find shorter work a la slab, than the \'requirements page\' should indeed be altered. -800MHz till 1Gig would be the minimum for 24/7 crunchers. -Faster speed machines would be required for non 24/7 running machines. Personally, I would also add the requirement/advise to make regular backups, run \'Network activity suspended\' and hint to be carefull with other cpu intensive applications |
Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0 |
I have the situation where a sulfur model has put one of my comptuers into EDF: 2005-11-27 06:55:21 [climateprediction.net] Restarting result 47wx_200297153_0 using sulphur_cycle version 4.19 2005-11-27 06:55:21 [climateprediction.net] Restarting result 1txs_000106329_1 using hadsm3 version 4.13 2005-11-27 06:55:22 [---] Suspending work fetch because computer is overcommitted. 2005-11-27 06:55:22 [---] Using earliest-deadline-first scheduling because computer is overcommitted. I have like a -1,800,000 seconds built up because of this. Not sure if I am going to let that stand or not yet ... But, ya gotta be careful ... |
Send message Joined: 27 Jun 05 Posts: 74 Credit: 199,198 RAC: 0 |
I have the situation where a sulfur model has put one of my comptuers into EDF: There was a suggestion to cancel sulphur wu using app version before 4.22 -- but I don\'t know if this was endorsed by the project team or just thrown in by a participant and I can\'t find th epost I was thinking of -- would Carl or Tolu comment please? |
Send message Joined: 27 Jun 05 Posts: 74 Credit: 199,198 RAC: 0 |
Agree absolutely. If 800 is a sensible limit for slab then 2GHz would be about right for sulphur I don\'t believe that an 800Mhz machine will be able to finish one before the deadline (based on the fact that I have a 1Ghz box that is estimating exceeding the deadline by a few days or so). Or perhaps an update to provide model specific requirements? Depends if you want to be safe in all cases, or to avoid excluding machines that can do the work. Boxes at the same nominal speed vary cosiderably in their throughput. My 700 MHz PIII does over twice the throughout of my 700 MHz Celeron for example, and AMD always outdo Intel at the same clock speed. The 800 limit seems to have been chosen so that all 800 boxes were very comfortable with slab, and the equivalent limit for sulphur might be around 2GHz. I would suggest the following for sulphur a) advise a 2GHz limit b) automatically enforce a 1GHz limit so that boxes slower than that do not get given sulphur WU (this project has never enforced its advice in the past but BOINC does offer the facility to do so) |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
All of the FAQ / tech info / front end of the project, were written in pre-BOINC days. This info is still applicable to slab, as it used by the OU course, and will be for years to come. And Win 98SE still works there. As for the 800Meg limit, people still tried running on machines as slow as 231Mhz. Another problem is that some people don\'t allocate much time to cp. One person took well over a year on a fairly fast computer. Another, a similar time on a very slow machine, but running 24/7. Usefull results can be produced by \'slow\' computers, so I don\'t see a need to complicate things more than they are. It\'s more necessary to know WHY the limits are there. But it\'s a good subject for discussion. :) |
Send message Joined: 27 Jun 05 Posts: 74 Credit: 199,198 RAC: 0 |
All of the FAQ / tech info / front end of the project, were written in pre-BOINC days. **Exactly** And when increasing the minimum level of committment, the project *needs*, in my opinion, to update the words the describe the minimum committment
But on the BOINC incarnation of the project, slab is currently not being offered, so it makes no sense to still publish advice based on slab. By all means keep the classic advice for the classic platform, I agree that would be appropriate, but surely it is time for some separate words of advice for BOINC participants now that the minimum practical hardware is so different between classic-slab and BOINC-sulphur?
I\'ve successfully slabbed on 500MHz boxes, and on classic ran a model on a 200 MHz Celeron, but that one ended early due to instability and I always wondered if it would have done so on a box within the spec...
An issue, rather than a problem I\'d say. When updating the words, you address this issue by saying that the minimum spec machine is X GHz if on 24/7, and proportionally faster for machines that are only on part time.
The complication comes when people wwho have been running borderline machines successfully on slab, or even sub-spec machines, then meet a sulphur. Many users, myself included, take a great deal of effort to read up on a project just before and just after joining, then leave it running on autopilot once it seems happy. As the slab models come in you will continue to get people panicking when they see 6000 hour completion times, 22 million second completion times on the command line, etc etc. People will, as I did, detach such machines as a reflex action. Therefore an automated screen to prevent sulphur going to boxes below a certain threshold will save the project long waits for boxes that detach. Updated info will save users some anxiety even if they don\'t see it till after they have the first sulphur on board and wonder what is happening. In short, if sulphur is a new experiment, then new experimental guidelines are in order.
Agreed, and if the explanation relates to an expeirment (slab) that is no longer offered on the BOINC platform, then users/donors will \"know\" the wrong limits and will \"know\" only the outdated reasons.
I\'ve taken you at your word here... ;-) |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
While in Sulphur Beta test, we anticipated the necessity of expanded documentation for processing Sulphur Work Units. I started such a page and others added to it and made it better. Crandles honed it and put it in the Wiki. (We don\'t have access to the CPDN FAQ as far as I know.) Admittedly, it isn\'t \"up front\" and easy to find. It should be. However, I don\'t fault Carl or Tolu. They\'ve been \"drinking from a firehose\" since this project began. And if I may be forgiven another cliche, while they\'re up to their butts in alligators, it\'s hard for them to work on draining the (documentation) swamp. Edit: More on sulphur Work Unit requirements: http://boinc-doc.net/boinc-wiki/index.php?title=Sulphur_Cycle "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 30 Aug 04 Posts: 77 Credit: 1,785,934 RAC: 0 |
While my machines should have the power to complete the Sulphur Models, the \"Computer is overcommited\" factor has hit me hard when moving to BOINC 5.2.13 From what I read, can I actually expect that all affected machines will physically restrict themself to CPDN-only over the course of months ?? That would be a pity, I\'d like the other Projects to get their fair share :( (also, what will happen if the other Projects debt is 6 months worth of CPU time, would CPDN not run for the next 6 months then as the others work down their debt? Worst case, I would expect the Scheduling/Debt mechanism not to stabilize, but actually destabilize with ever-increasing debt cycles between CPDN and other Projects *ugh* ) PS. Any effective and safe tips to overcome that \"overcommited\" BOINC thinks it is? -- edit -- Just got a very nice hint from a Team Member : Since I didn\'t run BOINC for several months (finishing SETI Classic), the efficiency values for each Computer (\"% of time BOINC client is running\") had dropped close to 0% naturally. I completely forgot about that important scheduling factor. With a bit of luck, the Overcommited factor will vanish as soon as my Systems are back on >95% values :) Scientific Network : 44800 MHz - 77824 MB - 1970 GB |
Send message Joined: 31 Aug 04 Posts: 239 Credit: 2,933,299 RAC: 0 |
If this starts to happen there are a couple things you can try. One, force the issue with suspending various projects to force the system to spend some time doing other projects. The other way would be to edit the long term debt numbers. I have the potential for a similar problem on one machine. It has a LTD of 1,084,000 seconds built up ... this machine has had both a Slab and Sulfur model running on it. The good news is that both models are 80% done with one only having 5 more days to go (another month?). |
Send message Joined: 5 Aug 04 Posts: 172 Credit: 4,023,611 RAC: 0 |
While my machines should have the power to complete the Sulphur Models, the \"Computer is overcommited\" factor has hit me hard when moving to BOINC 5.2.13 You can edit the valued for active_frac and On_frac in the client_state.xml file to both be 1.0 if this is closer to reality now (they will fall a bit). Make certain that you shut BOINC down before opening the file, and you start BOINC again after you save the edit. Yes, if CPDN takes a year runing in EDF, then it would be expected not to have any work requested for (1 year / CPDN resource_frac) - 1 year. Ex: CPDN resource fraction of .5 on a host that takes a year running 24/7 would run CPDN for a year and other things for a year (unless the other project had no work available and the computer ran dry at some point - then a CPDN result would be downloaded thus delaying the balance). Repeat. BOINC WIKI |
©2024 cpdn.org