climateprediction.net (CPDN) home page
Thread 'Computer wasting multiple models'

Thread 'Computer wasting multiple models'

Message boards : Number crunching : Computer wasting multiple models
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
old_user596405

Send message
Joined: 4 Oct 09
Posts: 73
Credit: 7,242,427
RAC: 0
Message 40422 - Posted: 26 Aug 2010, 7:09:48 UTC

This newcomer may need assistance. Started on the 21st but has lost the first 4 models after just a few trickles. All with the general exit code 1.

Computer Id = 1095430
ID: 40422 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 40423 - Posted: 26 Aug 2010, 8:36:07 UTC - in response to Message 40422.  

It's a laptop, and I think that error 1 is: turned off the computer without first exiting from BOINC. Possibly: closed the lid and hibernated.

ID: 40423 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40424 - Posted: 26 Aug 2010, 9:33:25 UTC
Last modified: 26 Aug 2010, 9:37:17 UTC

Here's what Jorden says. Are these explanations possible when the computer doesn't appear to have a CUDA card? I don't think they're relevant in this case.
Cpdn news
ID: 40424 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 40425 - Posted: 26 Aug 2010, 10:37:50 UTC

I've had a few exit code 1's in my time, but none for at least 2 years. Every one happened when the controller process stopped and left an orphaned worker process running. The error happened when the controller was restarted and tried to start a second worker.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 40425 · Report as offensive     Reply Quote
old_user596405

Send message
Joined: 4 Oct 09
Posts: 73
Credit: 7,242,427
RAC: 0
Message 40426 - Posted: 26 Aug 2010, 10:38:23 UTC - in response to Message 40423.  

It's a laptop, and I think that error 1 is: turned off the computer without first exiting from BOINC. Possibly: closed the lid and hibernated.


Maybe depends on hibernation settings. Just tested closing the lid. No issue (Win XP). But pulling the plug would almost certainly mess up the models.

Noted that the member appears to be running Malaria in rotation (being a single core CPU) and has also lost WUs there with exit 1. Did complete two WUs though!

Times of last trickle for each of the 4 crashed CPDN tasks are at random times during the day.




ID: 40426 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40428 - Posted: 26 Aug 2010, 14:10:29 UTC
Last modified: 26 Aug 2010, 14:11:21 UTC

In a future News post I'll add a reminder about exiting from Boinc before shutting down a computer. I suspect that hundreds of people don't exit first. Usually a shutdown without a Boinc exit leaves the model intact but sooner or later one will crash.

I don't think we can do anything about this person at the moment as he needs time to notice the crashes and take action. But in a month or so we should check what's happening on this computer and if necessary ask Milo to send him the email.
Cpdn news
ID: 40428 · Report as offensive     Reply Quote
old_user596405

Send message
Joined: 4 Oct 09
Posts: 73
Credit: 7,242,427
RAC: 0
Message 40429 - Posted: 26 Aug 2010, 14:32:01 UTC - in response to Message 40428.  

This member is in our team - we like to see newcomers getting smoothly off the mark! When posting the computer id earlier, had forgotten that our leader gets email addresses. Subsequently the member has now been emailed with the BOINC exit warning as a first suggestion.

Hopefully we shall get a response from the newcomer soon but will keep monitoring anyway. Will update this thread if we can help the member resolve the issue.

Meantime, thanks anyway for your and the team's support!
ID: 40429 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40430 - Posted: 26 Aug 2010, 14:46:24 UTC

Tell him how to exit completely as it wuld be very easy to think that closing the Boinc manager is enough.

Belonging to an active and keen team is of great help to people.
Cpdn news
ID: 40430 · Report as offensive     Reply Quote
ProfileStrathpeffer
Avatar

Send message
Joined: 9 Jan 07
Posts: 497
Credit: 342,899
RAC: 0
Message 40448 - Posted: 29 Aug 2010, 10:22:05 UTC
Last modified: 29 Aug 2010, 10:25:02 UTC

It might also be worth reminding people that it's a good idea to click on "No new tasks" when you've already got a task running (or as many tasks as your computer can cope with). When your task is completed and has reported, you can click on "Allow new tasks" to get another one.

By the way, when the server was down recently and I wanted to suspend network activity till it was working again, I couldn't remember how to do that and had quite a hard time finding out! It's in the "Activity" tab in BOINC Manager (which seemed obvious once I had found it).
Visit the Scotland team
ID: 40448 · Report as offensive     Reply Quote
ProfileStrathpeffer
Avatar

Send message
Joined: 9 Jan 07
Posts: 497
Credit: 342,899
RAC: 0
Message 40449 - Posted: 29 Aug 2010, 10:30:28 UTC - in response to Message 40430.  
Last modified: 29 Aug 2010, 10:40:08 UTC

mo.v wrote:
...
Belonging to an active and keen team is of great help to people.

Yes indeed - sometimes new people are hesitant about posting in open forums (I was myself, hard to believe now folks, isn't it!). But, if you're in a team, you can always email, or send a private message to, the team "founder". Or post in the team's own forum, if it has one.

And NOBODY should worry that they might ask a "silly" question and look foolish - we've all done that too, and CPDN people are very understanding.
Visit the Scotland team
ID: 40449 · Report as offensive     Reply Quote
Jord
Avatar

Send message
Joined: 5 Aug 04
Posts: 250
Credit: 93,274
RAC: 0
Message 40455 - Posted: 31 Aug 2010, 6:16:49 UTC
Last modified: 31 Aug 2010, 6:28:43 UTC

I was checking the computers I was paired against. Found 3 bad ones.

Someone please check hostID 565904. Nice that it's an anonymous host, but it's wasting models by the hundreds.

As is hostID 941450. And hostID 1021107.
Jord.
ID: 40455 · Report as offensive     Reply Quote
Jord
Avatar

Send message
Joined: 5 Aug 04
Posts: 250
Credit: 93,274
RAC: 0
Message 40457 - Posted: 31 Aug 2010, 6:27:22 UTC - in response to Message 40455.  

Whow, more..

hostID 975193 (340 bad)
hostID 1087147 is dubious. What is this person doing?
hostID 843760 keeps wasting them.
hostID 866343 needs a slap on the fingers.
hostID 1026857 does as well.

Jord.
ID: 40457 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40458 - Posted: 31 Aug 2010, 7:42:53 UTC

We'll have to leave #1087147 at least for the time being. The tasks may have got the detached designation by the owner restoring a Boinc Data folder backup; all the detached models can be crunched.

All the others in Ageless's two posts need the email and to be minussed though.
Cpdn news
ID: 40458 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 40460 - Posted: 31 Aug 2010, 10:37:29 UTC - in response to Message 40458.  


All the others in Ageless's two posts need the email and to be minussed though.


Done.
ID: 40460 · Report as offensive     Reply Quote
Starfire

Send message
Joined: 5 Feb 05
Posts: 17
Credit: 1,582,791
RAC: 0
Message 40506 - Posted: 1 Sep 2010, 18:20:34 UTC
Last modified: 1 Sep 2010, 18:21:16 UTC

I had some time so I took another look at some of WUs.

All of these computers have no successfully completed tasks listed:
975228
1063016
1080305

Last successfully completed a task on 13 Jan 2010:
644869

The last one I'm not so sure about:
1056295
This computer was only created in February and hasn't had a successfully completed task yet - these I've looked into failed with exit status -2 (Could not launch model process. Last Error=193).
However it belongs to a member who has more than 20 active computers and a high RAC. His other computers appear to be running smoothly. Maybe a simple info message about that particular computer would be enough in this case.

Starfire
ID: 40506 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40531 - Posted: 1 Sep 2010, 20:57:29 UTC

I agree completely about the first 4. Email and a -1 quota.

I think it would be better not to send the email to the owner of the last computer though as he hasn't downloaded any new models for two months. He has probably realised that there's a problem with this machine. As you say, he has lots of other problem-free computers to use for CPDN.
Cpdn news
ID: 40531 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40656 - Posted: 11 Sep 2010, 8:54:56 UTC

ID: 40656 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 40674 - Posted: 15 Sep 2010, 15:06:13 UTC

The most recent ones are done - apologies for the delay.
ID: 40674 · Report as offensive     Reply Quote
Profilemo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40685 - Posted: 16 Sep 2010, 17:25:48 UTC

ID: 40685 · Report as offensive     Reply Quote
ProfileMilo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 40688 - Posted: 17 Sep 2010, 7:43:42 UTC

All done, thanks.
ID: 40688 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : Computer wasting multiple models

©2024 cpdn.org