climateprediction.net (CPDN) home page
Thread 'Computer wasting multiple models'

Thread 'Computer wasting multiple models'

Message boards : Number crunching : Computer wasting multiple models
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 8 · Next

AuthorMessage
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 36033 - Posted: 27 Jan 2009, 5:45:02 UTC

I just got paired with this guy: 796839. Holy suit!
ID: 36033 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 36034 - Posted: 27 Jan 2009, 7:24:06 UTC

Please keep in mind that the forums of this project are intended to be child friendly, and that moderators can enforce this by deleting posts. As it\'s not too obvious this time, hopefully this thread will just sink quickly.

However, thanks for telling us about this. There\'s a thread somewhere, that is being used by people for reporting these out of control computers, and a method has be devised to deal with them.
I\'ll pass it on to the project people.

ID: 36034 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36035 - Posted: 27 Jan 2009, 14:56:11 UTC

I wonder whether these are phantom tasks that have never shown up in the member\'s BOINC Manager. This phenomenon happened once when something went very wrong on a CPDN server. But whereas those people got dozens or hundreds of phantom tasks in a few minutes, this computer\'s getting its full quota every day even though there\'s nothing completed, nothing crunched, nothing crashed. The server isn\'t detecting what\'s happening.

Looking through a few of this member\'s workunits I\'ve found a couple of other computers that are also accumulating tasks, all with New status.

No wonder the supply of HADAMs ran out recently and the work queue had to be replenished.
Cpdn news
ID: 36035 · Report as offensive     Reply Quote
old_user170894
Avatar

Send message
Joined: 3 Mar 06
Posts: 96
Credit: 353,185
RAC: 0
Message 36036 - Posted: 27 Jan 2009, 16:45:54 UTC - in response to Message 36035.  

I wonder whether these are phantom tasks that have never shown up in the member\'s BOINC Manager. This phenomenon happened once when something went very wrong on a CPDN server. But whereas those people got dozens or hundreds of phantom tasks in a few minutes, this computer\'s getting its full quota every day even though there\'s nothing completed, nothing crunched, nothing crashed. The server isn\'t detecting what\'s happening.

Looking through a few of this member\'s workunits I\'ve found a couple of other computers that are also accumulating tasks, all with New status.

No wonder the supply of HADAMs ran out recently and the work queue had to be replenished.


The first six computers in his hosts list show a few more tasks with New staus than one would normally expect but nothing too surprising if they\'re the shorter slab models. What I find a bit odd is that all of his hosts appear to be configured, probably via the ncpus option in cc_config.xml, to have 2X as many cores as they actually have. They are all AMD, not Intel, so they don\'t have hyperthreading. He might have micro-managed those first 6 hosts to cache 2 or 4 more tasks than normal, to keep the \"fake\" cores busy.

The seventh computer in his list is the one where things are very bizarre. That one has accumulated more than 400 tasks with New status (I stopped counting after 20 pages). That\'s far more than most people would be willing to force into cache by manually micro-managing. He\'s running Linux which has plenty of software authoring tools which makes me wonder if he hasn\'t built a script or something that has gone awry and is now, automatically and unknown to him, micro-managing the client in a way that is causing the client to download new tasks every day. That happened to me a while back though it didn\'t involve CPDN so I know it can happen. Or he may be running one of the recent 6.x.x clients that have known scheduler problems that cause some projects to download way more tasks than are needed. This would be a new twist on the known bad behavior but it\'s possible.

Whatever the cause is it certainly warrants a PM to the member.


ID: 36036 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 36037 - Posted: 27 Jan 2009, 17:50:19 UTC

Prompted by NewtonianRefractor, I checked my own travelling partners and found http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_user.php?userid=550526 who seems similarly to be downloading models but getting nowhere.
ID: 36037 · Report as offensive     Reply Quote
DaveG27

Send message
Joined: 8 Nov 06
Posts: 18
Credit: 2,425,895
RAC: 0
Message 36038 - Posted: 27 Jan 2009, 18:05:02 UTC

Checked my partners found these two on same W.U.
185 tasks
65 tasks
ID: 36038 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 36041 - Posted: 27 Jan 2009, 23:45:08 UTC
Last modified: 28 Jan 2009, 0:01:09 UTC

Tolu has a script to detect computers that download more than a particular number of tasks per week/month. The number \'allowed\' varies according to the number of CPUs. Some computers may just be within these limits, avoiding detection by the script, while crashing lots of tasks.

I\'ve found two other computers sharing tasks from Hipparchus\'s workunits and accumulating lots of tasks all with New status. This isn\'t normal. Occasionally on a properly functioning computer a task appears to download but never gets into the BOINC manager to be crunched. It stays on the computer\'s web page, classified as New for ever. But this shouldn\'t happen to scores of tasks on one computer.

Hipparchus started downloading lots of models in early December while he was still crunching two models normally. I suspect, as Dagorath does, that he\'s overridden a BOINC setting, it\'s all gone wrong and he hasn\'t noticed. When this started he had BOINC 6.2.15, a normal version from Berkeley AFAIK; not one of the recent BOINC alpha versions with work fetch problems.

I\'ve found 5 other computers crashing lots of models, in most cases with easily correctable errors. They\'re probably just inside the download limits that trigger Tolu\'s script.

We\'ll look at the other computers members have kindly reported.

All these people will probably receive the email that Tolu\'s script triggers. Milo can also send it manually. It invites the recipients to post for advice on the forums but we know from past experience that many won\'t. Their work quota will be curtailed.

Members are welcome to post links to other computers using far too many models. NewtonianRefractor, I\'ll edit your thread title to make the discussion topic clearer. [Edit: done.]
Cpdn news
ID: 36041 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 38836 - Posted: 2 Feb 2010, 12:04:27 UTC

I know there is a more recent thread about computers that crash too many work units, but I have been unable to locate it quickly as I am going to bed shortly.

I have been paired with the following 4 hosts and they all appear to have problems

940827 has been in constant error mode on all work units since late Dec \'09.
1028360 has been in constant error mode since 1st week of Jan \'10.
980622 constant errors on over 200 work units.
1001262 well over 300 work unit errors with no successful results and zero total.

Is it possible that they can be looked at please so they can be notified that things aren\'t working they way they should.

Thanks
Conan.
ID: 38836 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 38838 - Posted: 2 Feb 2010, 15:43:05 UTC - in response to Message 38836.  


Is it possible that they can be looked at please so they can be notified that things aren\'t working they way they should.


I\'ve notified the owners of those machines.
ID: 38838 · Report as offensive     Reply Quote
Steinar1965

Send message
Joined: 4 Sep 06
Posts: 79
Credit: 5,583,517
RAC: 0
Message 38853 - Posted: 3 Feb 2010, 17:22:54 UTC - in response to Message 36035.  

My computer failed to download new models. The msg was failed to download...
My computer has been very stable for a long time and other project work fine.
ID: 38853 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 38854 - Posted: 3 Feb 2010, 18:01:05 UTC

There\'s a thread about this here.

The problem is being looked into.


Backups: Here
ID: 38854 · Report as offensive     Reply Quote
ProfileJohnofWem
Avatar

Send message
Joined: 15 Feb 06
Posts: 16
Credit: 7,341,604
RAC: 4,614
Message 38884 - Posted: 9 Feb 2010, 0:28:37 UTC

One here with 1166 tasks and only 60000 credits. Wasting several models each day.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=221382
ID: 38884 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 38885 - Posted: 9 Feb 2010, 0:43:47 UTC

Thanks John.
I\'ve passed it on.
ID: 38885 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 38889 - Posted: 9 Feb 2010, 11:48:22 UTC

Have done a trawl and found several which are downloading many which don\'t progress. userids are 576014 611561 542015 208327 465460 602769 . Course, there could be good reasons for these, but they may need help. Rgds.
ID: 38889 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 38892 - Posted: 9 Feb 2010, 21:01:14 UTC

Thanks, Lockleys. A couple of these seem to have stopped trying to crunch for CPDN so I\'m leaving them but am passing the others on.
Cpdn news
ID: 38892 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 39168 - Posted: 7 Mar 2010, 9:07:10 UTC
Last modified: 7 Mar 2010, 9:09:11 UTC

I have found a swag of hosts to add to the list of model wasters;

1041824 272 models zero points all compute errors

941509 49 models no successes yet

857475 Hundreds of Compute Errors

1031365 Constant Errors since the 27/12/09 over 100

1040276 259 models Zero total all Errors

1052894 142 Models Zero total all Errors

1028006 695 Models Zero Total All Errors

948530 All Models start then fail after one or two trickles, over 100

961803 Constant errors

947688 Over 700 Errors

Six of these hosts are associated with WU
6654409 that is how I noticed them.

Thanks
Conan
ID: 39168 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39170 - Posted: 7 Mar 2010, 11:36:26 UTC - in response to Message 39168.  

I have found a swag of hosts to add to the list of model wasters;


Thanks, Conan. I will deal with these, which means disconnecting them and sending a misconfiguration warning e-mail to the owners.
ID: 39170 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39178 - Posted: 7 Mar 2010, 21:10:41 UTC

Thanks for the report, Conan. If you look at the web page for any of those computers you\'ll see that it\'s now allowed -1 tasks per day ie nothing.

You\'d think that occasionally there\'d be a forum post saying: I\'ve received an email from CPDN saying my computer\'s misconfigured, so what should I do? But I\'ve never seen a post like that.
Cpdn news
ID: 39178 · Report as offensive     Reply Quote
old_user5681

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 547,031
RAC: 0
Message 39179 - Posted: 8 Mar 2010, 0:19:48 UTC - in response to Message 39178.  

Thanks for the report, Conan. If you look at the web page for any of those computers you\'ll see that it\'s now allowed -1 tasks per day ie nothing.

You\'d think that occasionally there\'d be a forum post saying: I\'ve received an email from CPDN saying my computer\'s misconfigured, so what should I do? But I\'ve never seen a post like that.


Why not sticky a thread where vigilant crunchers can report \"out of control computers\"?

I check my progress more or less everyday so it would be no great effort.
ID: 39179 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39180 - Posted: 8 Mar 2010, 1:09:50 UTC

The moderators already have a private forum thread with that exact title, but the title of this thread is also a good one. I did link to it in a reminder in a fairly recent News post.

We could sticky the thread but I think occasional reports by more members will prevent it from sinking out of view.

Milo\'s probably going to add in his email to the owners of these computers a link to a forum section where they can ask for advice. That\'s to make it easier for them to do something about it.
Cpdn news
ID: 39180 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 8 · Next

Message boards : Number crunching : Computer wasting multiple models

©2024 cpdn.org