climateprediction.net (CPDN) home page
Thread 'Computer wasting multiple models'

Thread 'Computer wasting multiple models'

Message boards : Number crunching : Computer wasting multiple models
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39819 - Posted: 1 Jun 2010, 10:26:36 UTC - in response to Message 39816.  

I think that this computer may now be a problem.
He\'s trying to run the models in a strange way: here and here



At least he is already asking for help!
An eye will have to be kept on these various machines, I think.
ID: 39819 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 39873 - Posted: 6 Jun 2010, 15:44:25 UTC

possible that these computers owners are having difficulties and are in need of assistance ?

1059389
1033089
ID: 39873 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39875 - Posted: 6 Jun 2010, 18:38:23 UTC

Yes, I'd say both need the email.
Cpdn news
ID: 39875 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39957 - Posted: 17 Jun 2010, 22:33:16 UTC

The first computer Byron mentioned now has models sending trickles so we'd better leave it alone. The second does need the email.

1038404
997756
1055530
1044825
1076628
1034106
221382
594400
1070985
1067727
Cpdn news
ID: 39957 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39960 - Posted: 18 Jun 2010, 7:51:08 UTC

All done, thanks.
ID: 39960 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 39984 - Posted: 20 Jun 2010, 22:40:26 UTC

Milo, some of that last batch accidentally didn't get minussed. They're still downloading and crashing. I'm posting them again:

997756
1055530
1044825
1034106
1070985
1067727

Plus a few new ones:

1063530
1006798
1056839
951134
Cpdn news
ID: 39984 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 39985 - Posted: 21 Jun 2010, 7:44:13 UTC - in response to Message 39984.  

That's very strange. Anyway, all are done now.
ID: 39985 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40015 - Posted: 24 Jun 2010, 23:50:44 UTC

The first of these is wasting models bigtime.

845926
1074635
1072200
862535
1061054
1074854
1066771
1067084
1075433
1070985
Cpdn news
ID: 40015 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 40016 - Posted: 25 Jun 2010, 8:38:53 UTC

Done, thanks.
ID: 40016 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40021 - Posted: 26 Jun 2010, 6:19:02 UTC
Last modified: 26 Jun 2010, 6:31:34 UTC

Milo, there's something wrong with how your script minusses computers. When I look at the last computer list, some are minussed but some are not. I'm surprised that the script works in some cases but not others. I wonder whether your script is perfectly OK but perhaps the new Boinc server version isn't handling all computers' daily quotas correctly. If so, this problem could conceivably have the same root cause as the inability of some computers to fetch work.

Perhaps your script's minussing works at the time but the server's Boinc undoes the minus value immediately or later. That's just my speculation.

The email part of your script is definitely working; Marta's annoyance in the thread where these people are invited to post seems to indicate that she had received the misconfiguration email twice: the first time and then again when you redid her computer. Yet she's still got a daily quota of 1 and is still downloading, and of course crashing, more models.

I'm going to trawl through the lists of computers I posted, going back in time to see when this problem of the script not always minussing computers started. If I were you I wouldn't try the script again yet.

My list will be in the next post.
Cpdn news
ID: 40021 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40022 - Posted: 26 Jun 2010, 7:24:24 UTC
Last modified: 26 Jun 2010, 8:22:56 UTC

Computers not minussed, by date

24 June

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=845926
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1074635
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1072200
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1061054
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1066771
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1075433
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1070985

20 June

Done twice by Milo's script:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=997756
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1055530
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1044825
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1034106 (this is Marta)
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1070985
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1067727

Done once only

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1063530
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1056839
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=951134

30 May - 1 June

I'm not checking Byron's list as I don't know which of them Milo minussed.

28 May

All still minussed (Milo reenabled certain computers from this and some subsequent lists. I've gone through our records to avoid listing these.)

26 May

All still minussed

18 May

All still minussed

14 May

All still minussed

22 April

All still minussed

21 April

All still minussed

20 April

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=896382


That's far enough back. I haven't listed computers that Milo reenabled. I can't identify the anonymous person (20 April) from our records.

There's certainly a difference in the server's treatment of daily quotas before and after the Boinc upgrade, but its strange new behaviour only affects certain computers. It may be, as Thyme Lawn is wondering in a moderator thread, that quota allocation now works by application type. We'll need to wait patiently until he has time to dig into the code to find out what it does.
Cpdn news
ID: 40022 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 40023 - Posted: 26 Jun 2010, 8:16:45 UTC

Thanks, Mo. There is indeed something amiss here; I've been checking as I run the 'minus' script and it does indeed update the database, so I suspect that the server upgrade is indeed behind this.

As for the e-mails, they will have to continue as we should offer people an means to seek assistance rather than disconnecting them without warning. As I mentioned above, those who cannot or do not want to seek assistance would be best off detaching non-working machines as they will then not be e-mailed about them.
ID: 40023 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40024 - Posted: 26 Jun 2010, 8:33:29 UTC

Thanks for checking the script even on such a gloriously sunny Saturday morning.

I have a collection of links to quite a few more computers that crash all their models. I'll hold onto them for the time being to see whether the quota problem can be fixed before I post them.

Really, the computers' owners are as you say fortunate to receive the offer of assistance in the email and it's better for CPDN to have fewer models crashed by misconfigured computers.
Cpdn news
ID: 40024 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40399 - Posted: 21 Aug 2010, 11:00:11 UTC
Last modified: 21 Aug 2010, 11:00:45 UTC

Milo, now that you can again minus computers that waste lots of models I'm going to repost the IDs of computers you tried to minus earlier but couldn't. I'll omit any computers no longer crashing models.

The following people should all have received your email without being minussed but took no action. So another email and a -1 quota seem very reasonable.

845926
1072200
1066771
1055530
1067727
1063530
Cpdn news
ID: 40399 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 40402 - Posted: 23 Aug 2010, 10:02:53 UTC

Done - thanks.
ID: 40402 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40404 - Posted: 23 Aug 2010, 11:29:31 UTC

Thanks, Milo.

There's another group of computers I kept on hold while you were unable to minus them. They were getting the -226 code with the lockfile error. You sent them an email advising them to upgrade their Boinc to fix this. Unfortunately, many did not take your advice. I'm reporting members in this position who are still crashing models. They need the standard email this time and to be minussed.

I have great sympathy with the last member in the list who did upgrade his Boinc, after which another error type emerged. But he does need further advice.

911520
1007769
1006227
1012950
1072223
1005628
1014568
228135
997848

Cpdn news
ID: 40404 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 40405 - Posted: 23 Aug 2010, 12:51:00 UTC

Those are all done as well.
ID: 40405 · Report as offensive     Reply Quote
old_user92639

Send message
Joined: 13 Aug 05
Posts: 54
Credit: 117,227
RAC: 0
Message 40413 - Posted: 24 Aug 2010, 16:50:22 UTC

938765 -2 & -185 (0x36b1)
1075365 -226
1087957 -2
1092307 -2 & (0x5) & -108
1095034 22 (0x16)

WU 6856284

:)


ID: 40413 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40414 - Posted: 24 Aug 2010, 18:28:18 UTC

That's a disastrous workunit apart from one computer. We will have to let the last two members you listed (computers 1092307 and 1095034) try for longer to see whether they can fix the problems themselves, but if after two or three more weeks they still can't process anything Milo will have to minus them. (The last has dreadful error messages.)

The first three should definitely receive the email in my opinion.
Cpdn news
ID: 40414 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 40419 - Posted: 25 Aug 2010, 8:22:12 UTC

OK, the first three are done.
ID: 40419 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : Computer wasting multiple models

©2024 cpdn.org