Message boards : Number crunching : WORTH THE TROUBLE????
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Another slant to, "worth the trouble" It was noted somewhere on this board recently that tasks continue to crunch for a while after the last zip file is uploaded. Are there any worthwhile trickles after this point or as there are over 4 hours of computing time left after the last zipfile is uploaded on hadamc3ns should I just abort and start another task? Would this mess up the system so that the, "unfinished" task would get sent out to another computer? Four hours is a wild overestimate as shortly after posting I closed browser down and the task in question had in half an hour gone from almost 5 hours left to, "ready to report." Almost 5 hours I would have thought about it - not for half an hour. |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
Another slant to, "worth the trouble" It was noted somewhere on this board recently that tasks continue to crunch for a while after the last zip file is uploaded. Are there any worthwhile trickles after this point or as there are over 4 hours of computing time left after the last zipfile is uploaded on hadamc3ns should I just abort and start another task? Would this mess up the system so that the, "unfinished" task would get sent out to another computer? There have been a few cases of models getting stuck in that final phase and that would be an appropriate situation in which to abort. Otherwise the model should be left to run: as you say, the model will be marked as a failure and, I think, sent out again to be needlessly re-computed. I don't know what the 'overshoot' is for. Perhaps it's just to give the Zips a chance to be generated before the model ends ... |
Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944 |
Thanks Ian, always good to have the information even if having discovered that the overshoot was only about half an hour's computing time max so I wouldn't have bothered aborting even if it didn't result in the task being sent out again. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The last zip file is generated when the model is part way through 1 Dec of its last year. It then crunches on until the end of 6 Dec. That's the next checkpoint because these models checkpoint every 6 days. At 00:00 hrs on 7 Dec the model finishes. There may well be a reason for the model soldiering on until that last checkpoint. I think all the model types have always finished at a checkpoint. My computer doesn't take nearly as long as half an hour to crunch those last few days. Another bizarre anomaly with these models is that they complete successfully without ever reaching the full number of timesteps. I haven't spent time calculating where the number of TSs listed in the graphics window would get the model to if they were all crunched, but I expect it would be to the end of 30 Dec. This discrepancy doesn't matter in the slightest. Cpdn news |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I finally have an answer to my question, is it worth the trouble. IT�S NO! The instability of the hadcm3n�s has struck again. Overnight I lost 2 more WU�s one on each of 2 different machines. I don�t think they are worth the time and electricity. Running them and wondering when they are going to fail (I no longer expect them to finish) is just not worth it. Hadcm3n _u5d6_2070_40_008336030_0 died at 99.74%. Graphics showed that the crash happened at 00:00 on 7/12. I guess this is the point it stopped to generate the zip file and found itself wanting. The second failure is something of a mystery. It failed after only 21 hours. This would have put it at appromx. 3.5%, nowhere near a decadal point. I plan to change my settings so as to exclude CM model and run other projects while waiting to pick up regional models whenever these are available. A word to the Scientists. If you are causing this by testing WU�s with extreme parameter sets that are expected to fail because they don�t yield viable climates, then you need to mix in more normal, stable ones. Otherwise after a while you are not going to have anyone running them. If I�m wrong about this then I apologize. |
Send message Joined: 7 Aug 04 Posts: 50 Credit: 548,730 RAC: 0 |
Hi JIM, Your hadcm3n_3d4i_1980_40_008349735_2 model failed with a new error that has been discussed in this thread and failed for your wingmen at the same 1st trickle point. I'm afraid there's nothing we can do about this until they find the cause. Your hadcm3n_u5d6_2020_40_008336020_0 model got past the 40th trickle so will have done its useful work. Time beyond this trickle has been described elsewhere as being like an athlete running past the finish line rather than stopping at the finish line itself so I'm pretty sure that the error it has reported is not relevent to the actual science of the result. The model effectively finishes on 1st Dec with final uploads being generated shortly thereafter and around 4th Dec (I think) and finally finishes at midnight 6th-7th, as mo.V has said below(above). |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Respectfully disagree, in part -- I'll keep on keeping on Statistics from my own small group of servers. Tasks running or ended this April (only hadcm3n) based on tasks page still running 33% completed since 01 April 36% failed with the cc+ memory thing 15% (one queued up will fail when it starts) failed with the misconfiged batch 7.5 % (it is a much worse problem for windows - a royal fu this batch but gone ) re-trys of borderline stable wu-s 10% Seems decently productive to me. One phantom, a waste of bandwidth on the CC+ blunder, and for me using linux, not much problem with those evil, I mean evil, botched wu's that die on linux and Darwin, but according to this board, can't be killed from server-side and block further downloads for the 95% running windows on intel. Shame. I will keep on downloading and running models. Including the rapid-rabit hadcm3n long-runners. Seems reasonable to me. I have moderate bandwidth. I suggested some months back - that it might be a good idea to try new batches on a local host at the project before sending them out to all volunteers -- just to see if a test run of the batch could succeed, didn't have gross configuration problems, didn't have impossible parameters, wouldn't waste users bandwidth. That suggestion was ignored. I thought it was a good idea to pre-test batches for obvious problems before sending them out to the volunteers. I was right about that. The powers at the project were wrong about that. I will keep on crunching. But throwing defective untested batches that can never work out to the web? How few minutes would it take to test a batch on a local PC to see if the batch is misconfigured ? How many compute-days and how much volunteer time is wasted by not taking this one tiny precaution of testing the batch before putting it out on the web? In other words, skipping a few-minute practical test on batches results in, not so important disgruntled volunteers, but as we have recently seen, large numbers of volunteers whose work will be wasted until their tasks time out in a few months. And most of them won't even know about the waste. So dear mods -- just where can I write to the overworked but "not quite with it" project? Long-timer - wondering about viability of fave dist comp project. I finally have an answer to my question, is it worth the trouble. IT�S NO! [quote][/quote] [quote][/quote] |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I agree with all your points, Eirik. About ten days ago I suggested to the programmers not for the first time that a handful of models from each new batch should be tried first on the Beta project. I was told that Oxford wants to deprecate the current Beta project and set up a new one where the only members would be invitees ie people whose computers are stable and who can be relied on to post on the forum about problems. The current Beta forum started off with open registration and the usual influx of people who want to join every project in existence but who in most cases never post and may even be running CPDN Beta with a low resource share. This situation is clearly a waste of time and effort. I suggested that for example everyone whose computers are hidden or who's never posted on the forum could have their computer's daily quota minussed. That would almost entirely limit new tasks to the computers of real testers. I don't know yet what will happen about that though. When the current Beta forum was set up it took a loooong time to get the transfer of Beta credits to this main project working properly. BOINC isn't really designed to do this and I feel some trepidation about a similar situation arising again. Anyway, the moderators aren't ignoring the situation. Cpdn news |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Thanks, mo About pre-testing batches -- that doesn't need a special site. Before submitting a whole batch to users, just take a few - like 5 or 50 models - make a mini-batch of them, put this mini-batch up for users to download from the main site. If they seem to be working after a day or so - good. If all this mini-test-batch go bad - reconsider, evaluate, fix. Does this idea make sense? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
This small test batch idea HAS been done a couple of times recently. As for "users", the mods would, quite frankly, like to be the ones to do the testing, as most other users are just "set and forget", so the project people are often left in the dark about what has happened. And some of these people are just causal crunchers, who feel that it's OK to run cpdn on very low resources and priority. But seem to get hold of the limited test runs. :( Plus, if my idea of what the latest problem is is correct, then a small batch wouldn't have found that there's a problem. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
Thank you everyone ... good informative posts ... here in this thread. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Eirik It's thought that the source of the FORTRAN errors has been found, so a small test batch was released. These were grabbed immediately, and are apparently running OK. Backups: Here |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Eirik Thanks for the information and your ongoing efforts. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Whatever the crew has done - bless their saintly competent hard-working souls . Current new batch of rapid-rapit or hadcm3n or whatever -- "rabid rabbits or speedy (tongue in cheek) bunnies" if that is what we call them now. Latest batch looking good so far. Seems to be a good bunch to download. Still a bunch out there to download. And commit to the 3 weeks or so of running the jobs. Thanks all. Keep on crunching. Worth the trouble? I think so. |
Send message Joined: 5 Jun 06 Posts: 28 Credit: 2,790,048 RAC: 0 |
Latest batch looking good so far. Seems to be a good bunch to download. Still a bunch out there to download. Server says no:
Tasks ready to send 0
|
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The Hadcm that crashed hadcm3n_zff5_1960_40_008335834_2 developed NEGATIVE PRESSURE CREATED which means it developed climate that's impossible in the real world and did the right thing ie aborted itself. A proportion of models do this as a result of the researchers pushing as far as possible the boundaries of the parameter values they test out. Inevitably some will prove impossible but it's not always possible to know which ones before trying them out. This isn't the only type of impossible climate that can develop in models. Models of this type haven't been a waste of time as they show the researchers what doesn't work. (I wonder what's caused all the Skype numbers in skgiven's post? I somehow don't think he's encouraging us all to Skype him now.) Cpdn news |
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038 |
... (I wonder what's caused all the Skype numbers in skgiven's post? I somehow don't think he's encouraging us all to Skype him now.)... that may be a problem at your end, Mo. Skype has an add-in for some browsers that attempts to convert text that could be a telephone number into a clickable link. You may have to turn something or other off in Skype or in the browser. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
You are right, Iain. All the workunit numbers have been turned into Skype phone numbers. If I do a copy and paste the proper workunit number shows up, not the corrupted number. The supposed phone number even shows up in blue with the blue Skype icon. I upgraded Skype the other day so the new version must contain this new thing. I refuse to call it a feature. What a cheek these people have. The other day something made me feel as if I'd been kidnapped by Facebook. Now the computer's been hijacked by Skype. This add-on can't even be accessed through the Skype window. It's a separate program, 75MB of it, that you have to uninstall through Control Panel. http://community.skype.com/t5/The-Skype-Lounge/How-to-Disable-or-Remove-the-Click-and-Call-Function/m-p/69602/thread-id/6466/message-uid/69602 Cpdn news |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Java is another one that tries to change things. It wants to change your browser, and to add another task bar to your browser. :( Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I've just been asked whether I wanted to update Adobe. I said Yes and it started installing. Before I knew what was happening the wizard said I was also getting Google Chrome and Google Toolbar. I know for sure I wasn't asked whether I wanted these and there was no way I could see to refuse. I aborted the update part way through and hope everything continues to work. The WU numbers in skgiven's post now display normally for me. Cpdn news |
©2024 cpdn.org