climateprediction.net (CPDN) home page
Thread 'Computer wasting multiple models'

Thread 'Computer wasting multiple models'

Message boards : Number crunching : Computer wasting multiple models
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

AuthorMessage
Starfire

Send message
Joined: 5 Feb 05
Posts: 17
Credit: 1,582,791
RAC: 0
Message 39183 - Posted: 8 Mar 2010, 8:12:43 UTC

I found a few too (all with me on WU 6668123)

These two seem to have all errors (zero CPU time, Average turnaround time 0 days, Avg. credit 0.00):

host 1048960: 43 tasks
host 922470: 850 tasks

These two produce lots of errors (with zero CPU time), but once in a while they report a task successfully:
host 991013: 867 tasks
host 1042139: 75 tasks
ID: 39183 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 39184 - Posted: 8 Mar 2010, 9:22:25 UTC - in response to Message 39183.  
Last modified: 8 Mar 2010, 9:23:21 UTC

Thanks for the report Starfire.
These two produce lots of errors (with zero CPU time), but once in a while they report a task successfully:
host 991013: 867 tasks

The task list is ordered by task id but that doesn\'t reflect the order tasks are sent out because CPDN creates large batches of work which is randomly sent out. It looks like the user fixed the problem with that one on 11th February.
host 1042139: 75 tasks

Similarly, this one was fixed on 3rd February.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 39184 · Report as offensive     Reply Quote
Starfire

Send message
Joined: 5 Feb 05
Posts: 17
Credit: 1,582,791
RAC: 0
Message 39185 - Posted: 8 Mar 2010, 10:44:02 UTC - in response to Message 39184.  

The task list is ordered by task id but that doesn\'t reflect the order tasks are sent out


Sorry, I completely forgot about that :( I\'ll check the time frame more closely in the future.
ID: 39185 · Report as offensive     Reply Quote
old_user5681

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 547,031
RAC: 0
Message 39187 - Posted: 8 Mar 2010, 11:49:49 UTC - in response to Message 39185.  
Last modified: 8 Mar 2010, 11:54:36 UTC

Here are three that seem to guzzle WU.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1044952

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=221382

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=793420

Quite happy to spend some time going through more if thats whats needed/wanted.
ID: 39187 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39188 - Posted: 8 Mar 2010, 12:43:28 UTC - in response to Message 39187.  

Here are three that seem to guzzle WU.


Thanks - I\'d already caught one of those but not the other two, and they have now been dealt with. If the max_results_day is set to \"-1\" then it means that they\'ve been done. When they\'re fixed I\'ll set that back to a proper value.
ID: 39188 · Report as offensive     Reply Quote
old_user5681

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 547,031
RAC: 0
Message 39190 - Posted: 8 Mar 2010, 13:55:11 UTC - in response to Message 39188.  

Here are three that seem to guzzle WU.


Thanks - I\'d already caught one of those but not the other two, and they have now been dealt with. If the max_results_day is set to \"-1\" then it means that they\'ve been done. When they\'re fixed I\'ll set that back to a proper value.


Thanks for the info. Last 5 for today.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=956437

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=950613

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1054067

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1026816

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1053686
ID: 39190 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,903,241
RAC: 2,063
Message 39219 - Posted: 11 Mar 2010, 17:22:29 UTC
Last modified: 11 Mar 2010, 19:31:51 UTC

Here\'s a dodgy one: 798604

... and more 813244, 940056, 1002470, 1019903, 1037709, 1039634, 1044352.

[Edit: Slight mess-up on the configuration management front, Milo - I\'ve posted additions over the list you\'ve already done. :-(]
ID: 39219 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39220 - Posted: 11 Mar 2010, 17:55:43 UTC

All done - thanks, Iain.
ID: 39220 · Report as offensive     Reply Quote
old_user5681

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 547,031
RAC: 0
Message 39222 - Posted: 11 Mar 2010, 18:14:45 UTC

A few more :

The first computer seems to have 60 wu in progress and the rest detached.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1001846

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=872876

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1040010

There seem to a lot of crunchers who only have downloaded say 10 or less WU, but all have failed. Do you want us to report these as well? This one for example:


http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1060272

ID: 39222 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39224 - Posted: 12 Mar 2010, 3:48:43 UTC

Hi Martin

We shouldn\'t report to Milo computers with failures over just a short period. For example, the computer #1060272 you asked about had a series of download errors with HadAM3P over two days, but there was something wrong at the server end and everybody was getting download failures with HadAM3P. Now that Milo\'s found a fix, the computer\'s crunching and has trickled.

Confusingly, the task page for that computer lists some of these download failures correctly but lists other identical failures as Error. So one doesn\'t see immediately what\'s going on.

We also need to give people time to notice the problem and try to sort it out. So we\'re reporting computers that are trashing models bigtime, longterm.

BTW, if you type [ url]paste web address here[/url] (leaving out the space inside the first two tags) you\'ll find your post contains a live link.
Cpdn news
ID: 39224 · Report as offensive     Reply Quote
old_user5681

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 547,031
RAC: 0
Message 39250 - Posted: 16 Mar 2010, 18:39:16 UTC - in response to Message 39224.  
Last modified: 16 Mar 2010, 18:48:39 UTC

Thanks for the rely Mo.v - its more laziness than ignorance when it comes to formatting posts. I\'ll make them clickable in the future though.

Just to say http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1001846 has wasted over twenty more WU since I pointed him out in my last post on this thread.

I do know your busy, and I don\'t won\'t to nag, but I thought it worth mentioning.:)
ID: 39250 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 39251 - Posted: 16 Mar 2010, 21:25:37 UTC - in response to Message 39250.  

Thanks for the rely Mo.v - its more laziness than ignorance when it comes to formatting posts. I\'ll make them clickable in the future though.

Just to say http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1001846 has wasted over twenty more WU since I pointed him out in my last post on this thread.

I do know your busy, and I don\'t won\'t to nag, but I thought it worth mentioning.:)

I put the 6 back in your URL in this reply. I clicked on it before and came up with a AMD \"Pentium\" PC from 2005. :)

That is a strange host as it is downloading, but never erroring or returning results. I wonder what the task list looks like in BOINC Manager. Not that the owner ever looks at it obviously...
ID: 39251 · Report as offensive     Reply Quote
old_user5681

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 547,031
RAC: 0
Message 39252 - Posted: 16 Mar 2010, 21:30:13 UTC - in response to Message 39250.  
Last modified: 16 Mar 2010, 21:57:26 UTC

I really haven\'t a clue whats going on with the internal links on this forum, but my last post refers to a member called \"marquexa\".
When I click on my link I get \"dolce\". I\'m confused, as the link is the same as in my original post.


Free Image Hosting by ImageBam.com

FWIW,the clickable \"this post\" url tags also didn\'t work for me.

Jeez, you wouldn\'t think I\'ve been on line for the better part of 15 years given the state of my posts :)
ID: 39252 · Report as offensive     Reply Quote
old_user5681

Send message
Joined: 31 Aug 04
Posts: 42
Credit: 547,031
RAC: 0
Message 39254 - Posted: 17 Mar 2010, 0:03:35 UTC - in response to Message 39252.  

Thanks geophi,

Please ignore my last post. I\'ll keep taking the pills and have a few more early nights...

Ho hum...
ID: 39254 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39256 - Posted: 17 Mar 2010, 10:28:29 UTC

I\'ve done \"marquexa\" as the host mentioned is indeed a bit odd.
I have also hacked the script that e-mails the owners of dodgy hosts so that it automatically cuts them off, to save me a bit of time. If they report on the relevant thread that they\'ve performed the steps asked of them to fix their host then I\'ll remove the block manually.
ID: 39256 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39271 - Posted: 19 Mar 2010, 20:41:55 UTC
Last modified: 19 Mar 2010, 21:18:19 UTC

It would be better if no more models were sent to

computer 984556
977058
1032269


1039628
1048252
1048246
1048245
1039633
1039629
1039641

The single owner of all the second group has already received an email about another of his computers but has not yet solved the problem or asked for advice.

I\'ve created a Trac ticket in the hope that if it\'s implemented a lot of red crash messages in the Boinc manager would alert a few more members with problem computers to the need for action.
Cpdn news
ID: 39271 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 12,124,839
RAC: 4,320
Message 39282 - Posted: 22 Mar 2010, 6:59:34 UTC

One typical SABOTEUR - 1006007.
ID: 39282 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39284 - Posted: 22 Mar 2010, 9:38:09 UTC

I\'ve done another pass on the machines mentioned above.
ID: 39284 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 12,124,839
RAC: 4,320
Message 39289 - Posted: 22 Mar 2010, 12:33:28 UTC

790895
944328 - 768 tasks since May 2009!!!
961748
970346 - 599 tasks!
1011012
1022149
1032402
1039336
To be continued...
ID: 39289 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39293 - Posted: 22 Mar 2010, 14:06:42 UTC

All done, thanks.
ID: 39293 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

Message boards : Number crunching : Computer wasting multiple models

©2024 cpdn.org