Message boards : Number crunching : Computer wasting multiple models
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
I just got paired with this guy: 796839. Holy suit! |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Please keep in mind that the forums of this project are intended to be child friendly, and that moderators can enforce this by deleting posts. As it\'s not too obvious this time, hopefully this thread will just sink quickly. However, thanks for telling us about this. There\'s a thread somewhere, that is being used by people for reporting these out of control computers, and a method has be devised to deal with them. I\'ll pass it on to the project people. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I wonder whether these are phantom tasks that have never shown up in the member\'s BOINC Manager. This phenomenon happened once when something went very wrong on a CPDN server. But whereas those people got dozens or hundreds of phantom tasks in a few minutes, this computer\'s getting its full quota every day even though there\'s nothing completed, nothing crunched, nothing crashed. The server isn\'t detecting what\'s happening. Looking through a few of this member\'s workunits I\'ve found a couple of other computers that are also accumulating tasks, all with New status. No wonder the supply of HADAMs ran out recently and the work queue had to be replenished. Cpdn news |
Send message Joined: 3 Mar 06 Posts: 96 Credit: 353,185 RAC: 0 |
I wonder whether these are phantom tasks that have never shown up in the member\'s BOINC Manager. This phenomenon happened once when something went very wrong on a CPDN server. But whereas those people got dozens or hundreds of phantom tasks in a few minutes, this computer\'s getting its full quota every day even though there\'s nothing completed, nothing crunched, nothing crashed. The server isn\'t detecting what\'s happening. The first six computers in his hosts list show a few more tasks with New staus than one would normally expect but nothing too surprising if they\'re the shorter slab models. What I find a bit odd is that all of his hosts appear to be configured, probably via the ncpus option in cc_config.xml, to have 2X as many cores as they actually have. They are all AMD, not Intel, so they don\'t have hyperthreading. He might have micro-managed those first 6 hosts to cache 2 or 4 more tasks than normal, to keep the \"fake\" cores busy. The seventh computer in his list is the one where things are very bizarre. That one has accumulated more than 400 tasks with New status (I stopped counting after 20 pages). That\'s far more than most people would be willing to force into cache by manually micro-managing. He\'s running Linux which has plenty of software authoring tools which makes me wonder if he hasn\'t built a script or something that has gone awry and is now, automatically and unknown to him, micro-managing the client in a way that is causing the client to download new tasks every day. That happened to me a while back though it didn\'t involve CPDN so I know it can happen. Or he may be running one of the recent 6.x.x clients that have known scheduler problems that cause some projects to download way more tasks than are needed. This would be a new twist on the known bad behavior but it\'s possible. Whatever the cause is it certainly warrants a PM to the member. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Prompted by NewtonianRefractor, I checked my own travelling partners and found http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_user.php?userid=550526 who seems similarly to be downloading models but getting nowhere. |
Send message Joined: 8 Nov 06 Posts: 18 Credit: 2,425,895 RAC: 0 |
|
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Tolu has a script to detect computers that download more than a particular number of tasks per week/month. The number \'allowed\' varies according to the number of CPUs. Some computers may just be within these limits, avoiding detection by the script, while crashing lots of tasks. I\'ve found two other computers sharing tasks from Hipparchus\'s workunits and accumulating lots of tasks all with New status. This isn\'t normal. Occasionally on a properly functioning computer a task appears to download but never gets into the BOINC manager to be crunched. It stays on the computer\'s web page, classified as New for ever. But this shouldn\'t happen to scores of tasks on one computer. Hipparchus started downloading lots of models in early December while he was still crunching two models normally. I suspect, as Dagorath does, that he\'s overridden a BOINC setting, it\'s all gone wrong and he hasn\'t noticed. When this started he had BOINC 6.2.15, a normal version from Berkeley AFAIK; not one of the recent BOINC alpha versions with work fetch problems. I\'ve found 5 other computers crashing lots of models, in most cases with easily correctable errors. They\'re probably just inside the download limits that trigger Tolu\'s script. We\'ll look at the other computers members have kindly reported. All these people will probably receive the email that Tolu\'s script triggers. Milo can also send it manually. It invites the recipients to post for advice on the forums but we know from past experience that many won\'t. Their work quota will be curtailed. Members are welcome to post links to other computers using far too many models. NewtonianRefractor, I\'ll edit your thread title to make the discussion topic clearer. [Edit: done.] Cpdn news |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
I know there is a more recent thread about computers that crash too many work units, but I have been unable to locate it quickly as I am going to bed shortly. I have been paired with the following 4 hosts and they all appear to have problems 940827 has been in constant error mode on all work units since late Dec \'09. 1028360 has been in constant error mode since 1st week of Jan \'10. 980622 constant errors on over 200 work units. 1001262 well over 300 work unit errors with no successful results and zero total. Is it possible that they can be looked at please so they can be notified that things aren\'t working they way they should. Thanks Conan. |
Send message Joined: 2 Mar 06 Posts: 253 Credit: 363,646 RAC: 0 |
I\'ve notified the owners of those machines. |
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
My computer failed to download new models. The msg was failed to download... My computer has been very stable for a long time and other project work fine. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
|
Send message Joined: 15 Feb 06 Posts: 16 Credit: 7,341,604 RAC: 4,614 |
One here with 1166 tasks and only 60000 credits. Wasting several models each day. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=221382 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Thanks John. I\'ve passed it on. |
Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Have done a trawl and found several which are downloading many which don\'t progress. userids are 576014 611561 542015 208327 465460 602769 . Course, there could be good reasons for these, but they may need help. Rgds. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Thanks, Lockleys. A couple of these seem to have stopped trying to crunch for CPDN so I\'m leaving them but am passing the others on. Cpdn news |
Send message Joined: 6 Jul 06 Posts: 147 Credit: 3,615,496 RAC: 420 |
I have found a swag of hosts to add to the list of model wasters; 1041824 272 models zero points all compute errors 941509 49 models no successes yet 857475 Hundreds of Compute Errors 1031365 Constant Errors since the 27/12/09 over 100 1040276 259 models Zero total all Errors 1052894 142 Models Zero total all Errors 1028006 695 Models Zero Total All Errors 948530 All Models start then fail after one or two trickles, over 100 961803 Constant errors 947688 Over 700 Errors Six of these hosts are associated with WU 6654409 that is how I noticed them. Thanks Conan |
Send message Joined: 2 Mar 06 Posts: 253 Credit: 363,646 RAC: 0 |
I have found a swag of hosts to add to the list of model wasters; Thanks, Conan. I will deal with these, which means disconnecting them and sending a misconfiguration warning e-mail to the owners. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Thanks for the report, Conan. If you look at the web page for any of those computers you\'ll see that it\'s now allowed -1 tasks per day ie nothing. You\'d think that occasionally there\'d be a forum post saying: I\'ve received an email from CPDN saying my computer\'s misconfigured, so what should I do? But I\'ve never seen a post like that. Cpdn news |
Send message Joined: 31 Aug 04 Posts: 42 Credit: 547,031 RAC: 0 |
Thanks for the report, Conan. If you look at the web page for any of those computers you\'ll see that it\'s now allowed -1 tasks per day ie nothing. Why not sticky a thread where vigilant crunchers can report \"out of control computers\"? I check my progress more or less everyday so it would be no great effort. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The moderators already have a private forum thread with that exact title, but the title of this thread is also a good one. I did link to it in a reminder in a fairly recent News post. We could sticky the thread but I think occasional reports by more members will prevent it from sinking out of view. Milo\'s probably going to add in his email to the owners of these computers a link to a forum section where they can ask for advice. That\'s to make it easier for them to do something about it. Cpdn news |
©2024 cpdn.org