Thread 'WORTH THE TROUBLE????'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368	Message 46016 - Posted: 22 Apr 2013, 9:36:32 UTC Last modified: 22 Apr 2013, 9:42:02 UTC Another slant to, "worth the trouble" It was noted somewhere on this board recently that tasks continue to crunch for a while after the last zip file is uploaded. Are there any worthwhile trickles after this point or as there are over 4 hours of computing time left after the last zipfile is uploaded on hadamc3ns should I just abort and start another task? Would this mess up the system so that the, "unfinished" task would get sent out to another computer? Four hours is a wild overestimate as shortly after posting I closed browser down and the task in question had in half an hour gone from almost 5 hours left to, "ready to report." Almost 5 hours I would have thought about it - not for half an hour. ID: 46016 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192	Message 46019 - Posted: 22 Apr 2013, 11:10:20 UTC - in response to Message 46016. Another slant to, "worth the trouble" It was noted somewhere on this board recently that tasks continue to crunch for a while after the last zip file is uploaded. Are there any worthwhile trickles after this point or as there are over 4 hours of computing time left after the last zipfile is uploaded on hadamc3ns should I just abort and start another task? Would this mess up the system so that the, "unfinished" task would get sent out to another computer? Four hours is a wild overestimate as shortly after posting I closed browser down and the task in question had in half an hour gone from almost 5 hours left to, "ready to report." Almost 5 hours I would have thought about it - not for half an hour. There have been a few cases of models getting stuck in that final phase and that would be an appropriate situation in which to abort. Otherwise the model should be left to run: as you say, the model will be marked as a failure and, I think, sent out again to be needlessly re-computed. I don't know what the 'overshoot' is for. Perhaps it's just to give the Zips a chance to be generated before the model ends ... ID: 46019 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,011,472 RAC: 21,368	Message 46022 - Posted: 22 Apr 2013, 12:06:40 UTC - in response to Message 46019. Thanks Ian, always good to have the information even if having discovered that the overshoot was only about half an hour's computing time max so I wouldn't have bothered aborting even if it didn't result in the task being sent out again. ID: 46022 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46024 - Posted: 22 Apr 2013, 18:22:55 UTC The last zip file is generated when the model is part way through 1 Dec of its last year. It then crunches on until the end of 6 Dec. That's the next checkpoint because these models checkpoint every 6 days. At 00:00 hrs on 7 Dec the model finishes. There may well be a reason for the model soldiering on until that last checkpoint. I think all the model types have always finished at a checkpoint. My computer doesn't take nearly as long as half an hour to crunch those last few days. Another bizarre anomaly with these models is that they complete successfully without ever reaching the full number of timesteps. I haven't spent time calculating where the number of TSs listed in the graphics window would get the model to if they were all crunched, but I expect it would be to the end of 30 Dec. This discrepancy doesn't matter in the slightest. Cpdn news ID: 46024 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 46033 - Posted: 23 Apr 2013, 17:45:01 UTC I finally have an answer to my question, is it worth the trouble. IT�S NO! The instability of the hadcm3n�s has struck again. Overnight I lost 2 more WU�s one on each of 2 different machines. I don�t think they are worth the time and electricity. Running them and wondering when they are going to fail (I no longer expect them to finish) is just not worth it. Hadcm3n _u5d6_2070_40_008336030_0 died at 99.74%. Graphics showed that the crash happened at 00:00 on 7/12. I guess this is the point it stopped to generate the zip file and found itself wanting. The second failure is something of a mystery. It failed after only 21 hours. This would have put it at appromx. 3.5%, nowhere near a decadal point. I plan to change my settings so as to exclude CM model and run other projects while waiting to pick up regional models whenever these are available. A word to the Scientists. If you are causing this by testing WU�s with extreme parameter sets that are expected to fail because they don�t yield viable climates, then you need to mix in more normal, stable ones. Otherwise after a while you are not going to have anyone running them. If I�m wrong about this then I apologize. ID: 46033 · Reply Quote

Ray Murray Send message Joined: 7 Aug 04 Posts: 50 Credit: 548,730 RAC: 0	Message 46035 - Posted: 23 Apr 2013, 21:34:19 UTC - in response to Message 46033. Last modified: 23 Apr 2013, 21:40:50 UTC Hi JIM, Your hadcm3n_3d4i_1980_40_008349735_2 model failed with a new error that has been discussed in this thread and failed for your wingmen at the same 1st trickle point. I'm afraid there's nothing we can do about this until they find the cause. Your hadcm3n_u5d6_2020_40_008336020_0 model got past the 40th trickle so will have done its useful work. Time beyond this trickle has been described elsewhere as being like an athlete running past the finish line rather than stopping at the finish line itself so I'm pretty sure that the error it has reported is not relevent to the actual science of the result. The model effectively finishes on 1st Dec with final uploads being generated shortly thereafter and around 4th Dec (I think) and finally finishes at midnight 6th-7th, as mo.V has said below(above). ID: 46035 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 46062 - Posted: 27 Apr 2013, 8:37:30 UTC - in response to Message 46033. Last modified: 27 Apr 2013, 8:38:41 UTC Respectfully disagree, in part -- I'll keep on keeping on Statistics from my own small group of servers. Tasks running or ended this April (only hadcm3n) based on tasks page still running 33% completed since 01 April 36% failed with the cc+ memory thing 15% (one queued up will fail when it starts) failed with the misconfiged batch 7.5 % (it is a much worse problem for windows - a royal fu this batch but gone ) re-trys of borderline stable wu-s 10% Seems decently productive to me. One phantom, a waste of bandwidth on the CC+ blunder, and for me using linux, not much problem with those evil, I mean evil, botched wu's that die on linux and Darwin, but according to this board, can't be killed from server-side and block further downloads for the 95% running windows on intel. Shame. I will keep on downloading and running models. Including the rapid-rabit hadcm3n long-runners. Seems reasonable to me. I have moderate bandwidth. I suggested some months back - that it might be a good idea to try new batches on a local host at the project before sending them out to all volunteers -- just to see if a test run of the batch could succeed, didn't have gross configuration problems, didn't have impossible parameters, wouldn't waste users bandwidth. That suggestion was ignored. I thought it was a good idea to pre-test batches for obvious problems before sending them out to the volunteers. I was right about that. The powers at the project were wrong about that. I will keep on crunching. But throwing defective untested batches that can never work out to the web? How few minutes would it take to test a batch on a local PC to see if the batch is misconfigured ? How many compute-days and how much volunteer time is wasted by not taking this one tiny precaution of testing the batch before putting it out on the web? In other words, skipping a few-minute practical test on batches results in, not so important disgruntled volunteers, but as we have recently seen, large numbers of volunteers whose work will be wasted until their tasks time out in a few months. And most of them won't even know about the waste. So dear mods -- just where can I write to the overworked but "not quite with it" project? Long-timer - wondering about viability of fave dist comp project. I finally have an answer to my question, is it worth the trouble. IT�S NO! The instability of the hadcm3n�s has struck again. Overnight I lost 2 more WU�s one on each of 2 different machines. I don�t think they are worth the time and electricity. Running them and wondering when they are going to fail (I no longer expect them to finish) is just not worth it. Hadcm3n _u5d6_2070_40_008336030_0 died at 99.74%. Graphics showed that the crash happened at 00:00 on 7/12. I guess this is the point it stopped to generate the zip file and found itself wanting. The second failure is something of a mystery. It failed after only 21 hours. This would have put it at appromx. 3.5%, nowhere near a decadal point. I plan to change my settings so as to exclude CM model and run other projects while waiting to pick up regional models whenever these are available. A word to the Scientists. If you are causing this by testing WU�s with extreme parameter sets that are expected to fail because they don�t yield viable climates, then you need to mix in more normal, stable ones. Otherwise after a while you are not going to have anyone running them. If I�m wrong about this then I apologize. [quote][/quote] [quote][/quote] ID: 46062 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46075 - Posted: 27 Apr 2013, 18:13:07 UTC I agree with all your points, Eirik. About ten days ago I suggested to the programmers not for the first time that a handful of models from each new batch should be tried first on the Beta project. I was told that Oxford wants to deprecate the current Beta project and set up a new one where the only members would be invitees ie people whose computers are stable and who can be relied on to post on the forum about problems. The current Beta forum started off with open registration and the usual influx of people who want to join every project in existence but who in most cases never post and may even be running CPDN Beta with a low resource share. This situation is clearly a waste of time and effort. I suggested that for example everyone whose computers are hidden or who's never posted on the forum could have their computer's daily quota minussed. That would almost entirely limit new tasks to the computers of real testers. I don't know yet what will happen about that though. When the current Beta forum was set up it took a loooong time to get the transfer of Beta credits to this main project working properly. BOINC isn't really designed to do this and I feel some trepidation about a similar situation arising again. Anyway, the moderators aren't ignoring the situation. Cpdn news ID: 46075 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 46085 - Posted: 28 Apr 2013, 5:45:48 UTC Thanks, mo About pre-testing batches -- that doesn't need a special site. Before submitting a whole batch to users, just take a few - like 5 or 50 models - make a mini-batch of them, put this mini-batch up for users to download from the main site. If they seem to be working after a day or so - good. If all this mini-test-batch go bad - reconsider, evaluate, fix. Does this idea make sense? ID: 46085 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46086 - Posted: 28 Apr 2013, 6:24:04 UTC - in response to Message 46085. This small test batch idea HAS been done a couple of times recently. As for "users", the mods would, quite frankly, like to be the ones to do the testing, as most other users are just "set and forget", so the project people are often left in the dark about what has happened. And some of these people are just causal crunchers, who feel that it's OK to run cpdn on very low resources and priority. But seem to get hold of the limited test runs. :( Plus, if my idea of what the latest problem is is correct, then a small batch wouldn't have found that there's a problem. ID: 46086 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 46087 - Posted: 28 Apr 2013, 11:02:42 UTC Last modified: 28 Apr 2013, 11:46:05 UTC Thank you everyone ... good informative posts ... here in this thread. ID: 46087 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46134 - Posted: 30 Apr 2013, 20:25:00 UTC Eirik It's thought that the source of the FORTRAN errors has been found, so a small test batch was released. These were grabbed immediately, and are apparently running OK. Backups: Here ID: 46134 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 46139 - Posted: 1 May 2013, 2:10:46 UTC - in response to Message 46134. Eirik It's thought that the source of the FORTRAN errors has been found, so a small test batch was released. These were grabbed immediately, and are apparently running OK. Thanks for the information and your ongoing efforts. ID: 46139 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 46186 - Posted: 11 May 2013, 10:06:21 UTC Last modified: 11 May 2013, 10:17:46 UTC Whatever the crew has done - bless their saintly competent hard-working souls . Current new batch of rapid-rapit or hadcm3n or whatever -- "rabid rabbits or speedy (tongue in cheek) bunnies" if that is what we call them now. Latest batch looking good so far. Seems to be a good bunch to download. Still a bunch out there to download. And commit to the 3 weeks or so of running the jobs. Thanks all. Keep on crunching. Worth the trouble? I think so. ID: 46186 · Reply Quote

skgiven Send message Joined: 5 Jun 06 Posts: 28 Credit: 2,790,048 RAC: 0	Message 46191 - Posted: 11 May 2013, 23:14:03 UTC - in response to Message 46186. Last modified: 11 May 2013, 23:28:56 UTC Latest batch looking good so far. Seems to be a good bunch to download. Still a bunch out there to download. Server says no: Total Tasks ready to send 0 Got 5 on 7th May, 1 completed OK, two failed and two are still running (48% after 98h each), hadcm3n_n31j_1920_40_008334669_2 8485530 7 May 2013 17:18:22 UTC 7 Aug 2013 0:45:33 UTC In progress --- --- 5,598.72 5,598.72 UK Met Office Coupled Model Full Resolution Ocean v6.07 hadam3p_eu_qdyz_2007_1_008343896_2 8494757 7 May 2013 17:18:22 UTC 10 May 2013 6:01:42 UTC Completed 142,947.26 142,396.37 2,386.39 2,386.39 UK Met Office HADAM3P European Region v6.09 hadam3p_eu_qce6_2010_1_008341851_2 8492712 7 May 2013 17:18:22 UTC 7 May 2013 22:22:34 UTC Error while computing 6,487.24 6,446.66 0.00 --- UK Met Office HADAM3P European Region v6.09 hadcm3n_zff5_1960_40_008335834_2 8486695 7 May 2013 17:18:22 UTC 8 May 2013 14:30:38 UTC Error while computing 53,056.17 5.01 933.12 933.12 UK Met Office Coupled Model Full Resolution Ocean v6.07 hadcm3n_n3h0_1920_40_008335858_2 8486719 7 May 2013 17:18:22 UTC 7 Aug 2013 0:45:33 UTC In progress --- --- 5,598.72 5,598.72 UK Met Office Coupled Model Full Resolution Ocean v6.07 ID: 46191 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46193 - Posted: 12 May 2013, 9:56:09 UTC Last modified: 12 May 2013, 10:02:15 UTC The Hadcm that crashed hadcm3n_zff5_1960_40_008335834_2 developed NEGATIVE PRESSURE CREATED which means it developed climate that's impossible in the real world and did the right thing ie aborted itself. A proportion of models do this as a result of the researchers pushing as far as possible the boundaries of the parameter values they test out. Inevitably some will prove impossible but it's not always possible to know which ones before trying them out. This isn't the only type of impossible climate that can develop in models. Models of this type haven't been a waste of time as they show the researchers what doesn't work. (I wonder what's caused all the Skype numbers in skgiven's post? I somehow don't think he's encouraging us all to Skype him now.) Cpdn news ID: 46193 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,808,726 RAC: 5,192	Message 46194 - Posted: 12 May 2013, 18:08:34 UTC - in response to Message 46193. ... (I wonder what's caused all the Skype numbers in skgiven's post? I somehow don't think he's encouraging us all to Skype him now.) ... that may be a problem at your end, Mo. Skype has an add-in for some browsers that attempts to convert text that could be a telephone number into a clickable link. You may have to turn something or other off in Skype or in the browser. ID: 46194 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46195 - Posted: 12 May 2013, 21:47:52 UTC Last modified: 12 May 2013, 21:48:15 UTC You are right, Iain. All the workunit numbers have been turned into Skype phone numbers. If I do a copy and paste the proper workunit number shows up, not the corrupted number. The supposed phone number even shows up in blue with the blue Skype icon. I upgraded Skype the other day so the new version must contain this new thing. I refuse to call it a feature. What a cheek these people have. The other day something made me feel as if I'd been kidnapped by Facebook. Now the computer's been hijacked by Skype. This add-on can't even be accessed through the Skype window. It's a separate program, 75MB of it, that you have to uninstall through Control Panel. http://community.skype.com/t5/The-Skype-Lounge/How-to-Disable-or-Remove-the-Click-and-Call-Function/m-p/69602/thread-id/6466/message-uid/69602 Cpdn news ID: 46195 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46196 - Posted: 12 May 2013, 22:41:09 UTC Java is another one that tries to change things. It wants to change your browser, and to add another task bar to your browser. :( Backups: Here ID: 46196 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46197 - Posted: 13 May 2013, 0:02:56 UTC I've just been asked whether I wanted to update Adobe. I said Yes and it started installing. Before I knew what was happening the wizard said I was also getting Google Chrome and Google Toolbar. I know for sure I wasn't asked whether I wanted these and there was no way I could see to refuse. I aborted the update part way through and hope everything continues to work. The WU numbers in skgiven's post now display normally for me. Cpdn news ID: 46197 · Reply Quote