Thread 'DECLINING NUMBER OF TASKS'

Author	Message
JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 45677 - Posted: 21 Mar 2013, 18:18:25 UTC I just noticed that the �tasks in progress� is down to 95845. That is about as low as I have ever seen it. When you consider that many of these tasks are really lost and server just doesn�t know yet, it that means there are really a lot fewer tasks running. Maybe when the Hadam3p_ANZ are released in large numbers and the new Africa model comes on line this will change. ID: 45677 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 45680 - Posted: 21 Mar 2013, 22:51:03 UTC A lot of these models won't ever return useful results to the project. Some of these people could have uninstalled BOINC or the computer could have broken without its model(s) ever having reported as crashed to the server. Les once said he thought some people may give the computer away without the new owner even knowing that BOINC or CPDN are installed. We know how likely this is to lead to successful completion. We think some of the BBC people forgot about their models (that project and those models are shut down now though which is just as well for the lady whose computer sent up its first trickle after two years, with 159 trickles still to go). I don't know whether that number 95845 includes models that have timed out ie passed their deadline. Cpdn news ID: 45680 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 45691 - Posted: 23 Mar 2013, 4:55:54 UTC I think I saw that "tasks in progress" at 0 (null, zilch, nada) several weeks (or months) ago. Remember thinking at the time that someone must have done some housecleaning on one of the servers. But maybe it was a dream. At this moment "tasks in progress" is 96,163 . With a new mini-batch of rapid-rapit tasks on the way out and down to near 200 to send. Does anyone have any source for actual measures on distributed computing projects, like what fraction of the work units sent out is completed and returned? No doubt CPDN is an outlier amongst distributed computing projects on account of the [necessary] large size and long run-times . ID: 45691 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 45693 - Posted: 23 Mar 2013, 10:17:22 UTC It is certainly possible to obtain statistics of this type. When we were running the BBC project Carl, who was one of the programmers then, told us what % of computers had reached particular points in the model crunching. Two or three years ago Iain Inglis used some sort of program that calculated completion percentages for a particular model type with the figures analysed by the type of OS. The completion rates could be improved by solving particular known problems, but there is too much work for the two programmers we have. As a result they spend a lot of time troubleshooting and they cannot delay developing new model types and batches needed by the researchers. Cpdn news ID: 45693 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 45694 - Posted: 23 Mar 2013, 15:00:22 UTC The programmers definitely need to tackle some of the problems. Problem number 1 on that list needs to be the tendency of the Hadcm3 models to crash after being properly shut down (suspending the model first and then exiting the boinc manager) then and restarted. I don�t know about some of you, but, it is not realistic for me never to shut down my computer over a period of several weeks. Softwear updates or clearing some computer problem will make it necessary to reboot at least once. ID: 45694 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 45695 - Posted: 23 Mar 2013, 15:28:02 UTC - in response to Message 45694. I now never shut down a hadcm3 even to do a backup -- unless really really need to. The hadcm3n "rabit-rapit" are vulnerable to egvery interruption, glitch, whatever - but very worthwhile to run. My procedure for doing a backup of one the these hadcm3n or riibit-rapit -- I'm, so OCD and it takes at least several hours to even START a backup -- getting the wu's to close all files and quiesce -- takes a while and using the Unix lsof and whatever it takes a LONG time to get a clean backup. BUT - like we were saying, most wu-s go to sorry cpu-s that don't ever get nowhere. BUT - the wu-s that do get completed -- to my mind that is worth a lot - the wu-s that fall upon barren ground -- well - that's how it goes -- compute - adjust - keep on with it. ID: 45695 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4540 Credit: 19,039,635 RAC: 18,944	Message 45696 - Posted: 23 Mar 2013, 15:55:52 UTC I would love to know exactly why these units sometimes fail when exited by suspending computation then using the file>exit dialogue. Understanding might not help me prevent it but relieving my curiosity would be nice! On this computer, I shut down almost every night using the hibernate or suspend to disk function and have managed to complete my last two hadamc3n tasks despite doing this. I have two more that so far have survived. I wondered about this in another post somewhere and was told it was probably just luck. How many do I need to complete like this to start thinking that perhaps something has changed in the linux hibernate function? ID: 45696 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,827,799 RAC: 5,038	Message 45698 - Posted: 23 Mar 2013, 16:02:32 UTC - in response to Message 45691. Last modified: 6 May 2014, 23:03:00 UTC [Eirik Redd wrote:]... Does anyone have any source for actual measures on distributed computing projects, like what fraction of the work units sent out is completed and returned? No doubt CPDN is an outlier amongst distributed computing projects on account of the [necessary] large size and long run-times . I don't know about DC projects in general but, as Mo mentioned, here is a graph of completion rates for an old batch of HADCM3N: ... ... in which the decade crash-steps are clearly visible. The graph also shows that, on average, only ~40% ever get to the first trickle. If the graph is filtered to exclude those machines that failed to get to the first trickle then we get ... ... showing that about 30-50% of those that get to the first trickle succeed in getting get to the last trickle, which is pretty good given the size and length of the model. ID: 45698 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 45700 - Posted: 23 Mar 2013, 16:13:49 UTC I wish I had a "like" button for Iain Inglis's post! ID: 45700 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649	Message 45702 - Posted: 23 Mar 2013, 16:25:58 UTC - in response to Message 45698. Thanks much -- ID: 45702 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45708 - Posted: 23 Mar 2013, 21:13:11 UTC - in response to Message 45696. Dave It may be because this is one of the largest of the Met Office programs that we run. It wasn't easy to get the super computer programs to work on various types of desktops/laptops, and for some configurations it may be border line touchiness. Perhaps it's as simple as when to Suspend a model: in the Ocean or Atmosphere phase. Or it may be as complicated as a many line If-Else test. Backups: Here ID: 45708 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45709 - Posted: 23 Mar 2013, 21:16:41 UTC - in response to Message 45694. Jim The project programmers don't re-write the main program code, which is probably where the problem lies. They only change some of the auxiliary program(s), and adjust compiler options for building the final module. Backups: Here ID: 45709 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 45713 - Posted: 24 Mar 2013, 3:46:51 UTC I recently have started doing something that seems to help Hadcm3 WU�s survive shut down/start ups. Don�t be to fast to exit the boinc manager. After suspending the model, I wait 2 or 3 minutes before I exit the boinc manager. This provides time for it to write all the files that it needs to. This could be critical if the shut down is a little pokey. A half written file when the manager shuts could doom any start up. I do the same when I restart. I wait a minute or two after the boinc manager is restarted and the WU�s have reappeared in �tasks� before I unsuspend them just to make sure that the manager is fully restarted. Maybe this really works, maybe I am just kidding myself. ID: 45713 · Reply Quote