climateprediction.net (CPDN) home page
Thread 'DECLINING NUMBER OF TASKS'

Thread 'DECLINING NUMBER OF TASKS'

Message boards : Number crunching : DECLINING NUMBER OF TASKS
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 45677 - Posted: 21 Mar 2013, 18:18:25 UTC

I just noticed that the �tasks in progress� is down to 95845. That is about as low as I have ever seen it. When you consider that many of these tasks are really lost and server just doesn�t know yet, it that means there are really a lot fewer tasks running.

Maybe when the Hadam3p_ANZ are released in large numbers and the new Africa model comes on line this will change.

ID: 45677 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45680 - Posted: 21 Mar 2013, 22:51:03 UTC

A lot of these models won't ever return useful results to the project. Some of these people could have uninstalled BOINC or the computer could have broken without its model(s) ever having reported as crashed to the server. Les once said he thought some people may give the computer away without the new owner even knowing that BOINC or CPDN are installed. We know how likely this is to lead to successful completion. We think some of the BBC people forgot about their models (that project and those models are shut down now though which is just as well for the lady whose computer sent up its first trickle after two years, with 159 trickles still to go).

I don't know whether that number 95845 includes models that have timed out ie passed their deadline.
Cpdn news
ID: 45680 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 45691 - Posted: 23 Mar 2013, 4:55:54 UTC

I think I saw that "tasks in progress" at 0 (null, zilch, nada) several weeks (or months) ago. Remember thinking at the time that someone must have done some housecleaning on one of the servers. But maybe it was a dream.

At this moment "tasks in progress" is 96,163 . With a new mini-batch of rapid-rapit tasks on the way out and down to near 200 to send.

Does anyone have any source for actual measures on distributed computing projects, like what fraction of the work units sent out is completed and returned?

No doubt CPDN is an outlier amongst distributed computing projects on account of the [necessary] large size and long run-times .
ID: 45691 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45693 - Posted: 23 Mar 2013, 10:17:22 UTC

It is certainly possible to obtain statistics of this type. When we were running the BBC project Carl, who was one of the programmers then, told us what % of computers had reached particular points in the model crunching.

Two or three years ago Iain Inglis used some sort of program that calculated completion percentages for a particular model type with the figures analysed by the type of OS.

The completion rates could be improved by solving particular known problems, but there is too much work for the two programmers we have. As a result they spend a lot of time troubleshooting and they cannot delay developing new model types and batches needed by the researchers.
Cpdn news
ID: 45693 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 45694 - Posted: 23 Mar 2013, 15:00:22 UTC

The programmers definitely need to tackle some of the problems. Problem number 1 on that list needs to be the tendency of the Hadcm3 models to crash after being properly shut down (suspending the model first and then exiting the boinc manager) then and restarted.

I don�t know about some of you, but, it is not realistic for me never to shut down my computer over a period of several weeks. Softwear updates or clearing some computer problem will make it necessary to reboot at least once.

ID: 45694 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 45695 - Posted: 23 Mar 2013, 15:28:02 UTC - in response to Message 45694.  

I now never shut down a hadcm3 even to do a backup -- unless really really need to.

The hadcm3n "rabit-rapit" are vulnerable to egvery interruption, glitch, whatever - but very worthwhile to run.

My procedure for doing a backup of one the these hadcm3n or riibit-rapit -- I'm, so OCD and it takes at least several hours to even START a backup -- getting the wu's to close all files and quiesce -- takes a while and using the Unix lsof and whatever it takes a LONG time to get a clean backup.

BUT - like we were saying, most wu-s go to sorry cpu-s that don't ever get nowhere.

BUT - the wu-s that do get completed -- to my mind that is worth a lot - the wu-s that fall upon barren ground -- well - that's how it goes -- compute - adjust - keep on with it.


ID: 45695 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 45696 - Posted: 23 Mar 2013, 15:55:52 UTC

I would love to know exactly why these units sometimes fail when exited by suspending computation then using the file>exit dialogue. Understanding might not help me prevent it but relieving my curiosity would be nice! On this computer, I shut down almost every night using the hibernate or suspend to disk function and have managed to complete my last two hadamc3n tasks despite doing this. I have two more that so far have survived. I wondered about this in another post somewhere and was told it was probably just luck. How many do I need to complete like this to start thinking that perhaps something has changed in the linux hibernate function?
ID: 45696 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,827,799
RAC: 5,038
Message 45698 - Posted: 23 Mar 2013, 16:02:32 UTC - in response to Message 45691.  
Last modified: 6 May 2014, 23:03:00 UTC

[Eirik Redd wrote:]... Does anyone have any source for actual measures on distributed computing projects, like what fraction of the work units sent out is completed and returned?

No doubt CPDN is an outlier amongst distributed computing projects on account of the [necessary] large size and long run-times .

I don't know about DC projects in general but, as Mo mentioned, here is a graph of completion rates for an old batch of HADCM3N: ...

... in which the decade crash-steps are clearly visible. The graph also shows that, on average, only ~40% ever get to the first trickle.

If the graph is filtered to exclude those machines that failed to get to the first trickle then we get ...

... showing that about 30-50% of those that get to the first trickle succeed in getting get to the last trickle, which is pretty good given the size and length of the model.
ID: 45698 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 45700 - Posted: 23 Mar 2013, 16:13:49 UTC

I wish I had a "like" button for Iain Inglis's post!
ID: 45700 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 45702 - Posted: 23 Mar 2013, 16:25:58 UTC - in response to Message 45698.  

Thanks much --
ID: 45702 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45708 - Posted: 23 Mar 2013, 21:13:11 UTC - in response to Message 45696.  

Dave

It may be because this is one of the largest of the Met Office programs that we run.
It wasn't easy to get the super computer programs to work on various types of desktops/laptops, and for some configurations it may be border line touchiness.

Perhaps it's as simple as when to Suspend a model: in the Ocean or Atmosphere phase. Or it may be as complicated as a many line If-Else test.


Backups: Here
ID: 45708 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45709 - Posted: 23 Mar 2013, 21:16:41 UTC - in response to Message 45694.  

Jim

The project programmers don't re-write the main program code, which is probably where the problem lies. They only change some of the auxiliary program(s), and adjust compiler options for building the final module.


Backups: Here
ID: 45709 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 45713 - Posted: 24 Mar 2013, 3:46:51 UTC

I recently have started doing something that seems to help Hadcm3 WU�s survive shut down/start ups.

Don�t be to fast to exit the boinc manager. After suspending the model, I wait 2 or 3 minutes before I exit the boinc manager. This provides time for it to write all the files that it needs to. This could be critical if the shut down is a little pokey. A half written file when the manager shuts could doom any start up.

I do the same when I restart. I wait a minute or two after the boinc manager is restarted and the WU�s have reappeared in �tasks� before I unsuspend them just to make sure that the manager is fully restarted. Maybe this really works, maybe I am just kidding myself.

ID: 45713 · Report as offensive     Reply Quote

Message boards : Number crunching : DECLINING NUMBER OF TASKS

©2024 cpdn.org