climateprediction.net (CPDN) home page
Thread 'What went wrong (crashed WU)'

Thread 'What went wrong (crashed WU)'

Message boards : Number crunching : What went wrong (crashed WU)
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user691968

Send message
Joined: 30 Dec 12
Posts: 4
Credit: 7,776
RAC: 0
Message 45455 - Posted: 15 Jan 2013, 11:30:46 UTC

stderr of Task 15527415

<core_client_version>7.0.42</core_client_version>
<![CDATA[
<message>
- exit code 193 (0xc1)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
00:11:14 (4080): Can't acquire lockfile (32) - waiting 35s
00:11:19 (7496): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
06:38:20 (8224): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
06:38:21 (8224): No heartbeat from core client for 30 sec - exiting
06:38:22 (8224): No heartbeat from core client for 30 sec - exiting
06:38:23 (8224): No heartbeat from core client for 30 sec - exiting
06:38:24 (8224): No heartbeat from core client for 30 sec - exiting
Suspended CPDN Monitor - Suspend request from BOINC...
09:54:20 (8800): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
CPDN Monitor - Quit request from BOINC...
16:35:15 (8244): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
17:43:20 (860): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
21:39:09 (8268): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
CPDN Monitor - Quit request from BOINC...
00:28:29 (8804): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
00:28:39 (8804): No heartbeat from core client for 30 sec - exiting
00:28:40 (8804): No heartbeat from core client for 30 sec - exiting
00:28:42 (8804): No heartbeat from core client for 30 sec - exiting
Suspended CPDN Monitor - Suspend request from BOINC...
03:56:02 (8824): No heartbeat from core client for 30 sec - exiting
Suspended CPDN Monitor - No 'heartbeat' from BOINC...
11:36:16 (4224): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
11:36:20 (4224): No heartbeat from core client for 30 sec - exiting
Atmos Hold Restart file rename failed on atmos_restart.hold
Suspended CPDN Monitor - Suspend request from BOINC...

</stderr_txt>
]]>

I haven't been able to run a single model to completion, and I've run 4 or 5 WUs by now...

Is a WU totally useless if it isn't completed or can the trickles be used to build a new WU where the old one left off?
ID: 45455 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 45456 - Posted: 15 Jan 2013, 12:31:54 UTC - in response to Message 45455.  

Some models crash because an impossible climate is generated e.g. -ve pressure. However the fact that all your models are crashing points to something else. It is worth making sure that your BOINC folder is excluded from any antivirus program as if BOINC tries to write to a file while the antivirus has an exclusive lock on it the task will crash. The current models available have a habit of crashing @ the 25,50,75 and 100% points, particularly if the computer is shut down and restarted around these points. When available the regional models are much less prone to this.

Before long one of the moderators will be along to fill in the bits I have missed out of which there are quite a few.
ID: 45456 · Report as offensive     Reply Quote
old_user691968

Send message
Joined: 30 Dec 12
Posts: 4
Credit: 7,776
RAC: 0
Message 45457 - Posted: 15 Jan 2013, 13:02:11 UTC

I've just exclude BOINC and ProgramData from my scanner. Looking at the log it appears the error occurred when BOINC tried to suspend the task while I was away from my computer. Doesn't seem like there's any reason for it to do so except for the scheduled project switching so I've set the project switching interval to 99999 minutes (1666 hours), hopefully long enough for one project to finish running in one go barring any computer downtime. Anything else to look for? Thanks!
ID: 45457 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 45458 - Posted: 15 Jan 2013, 13:37:20 UTC - in response to Message 45457.  

Don't know about project switching as when work is available I only run CPDN. The other thing I just remembered is that running another application that is very heavy on cpu time e.g. video rendering can cause it to crash and it is worth suspending tasks before doing that. Worth checking the times the other models crashed to see if they all do it while you are away. The last one to crash looks as if it may be at the 25% point when they are more vulnerable to crashing.
ID: 45458 · Report as offensive     Reply Quote
ProfileByron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45459 - Posted: 15 Jan 2013, 20:56:33 UTC - in response to Message 45456.  
Last modified: 15 Jan 2013, 21:16:22 UTC

-


Dave Jackson wrote:

<quote>

It is worth making sure that your BOINC folder is excluded from any antivirus program
as if BOINC tries to write to a file while the antivirus has an exclusive lock on it the task will crash.

</quote>


HI Dave,

I don't know how to do this.
Could you - or anyone - kindly give me some instructions on how I would do this ?
I'm running Windows 7 Ultimate x86 Edition, Service Pack 1, (06.01.7601.00)
BOINC 7.0.28 (x86) - running as a single instillation - (not as a service)
I'm using McAfee Anti Virus

on this Computer - My Computer # 1167855 - my fastest Computer - I only run CPDN

thanks in advance
Byron
ID: 45459 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45460 - Posted: 16 Jan 2013, 0:10:11 UTC - in response to Message 45459.  

Somewhere in the menu of your AV, there should be a place where you can specify exceptions.
In may be in an Options section, and the words used will most likely vary between AVs.

I've got separate logical drives for both parts of BOINC, so I just need to specify a drive letter, but others will need to have a longer string to define the locations. There may be a Browse option, which will allow you to hunt for the locations, and then click to specify them.

There may be 2 parts to this:
1) A regular, automated scan.
2) A manual scan.
Both need to be set if they're separate.


Backups: Here
ID: 45460 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,011,472
RAC: 21,368
Message 45461 - Posted: 16 Jan 2013, 9:07:04 UTC

All I know about sorting out windows problems with BOINC and CPDN is from reading here. Last time my own machines had window$ on them was 13 years ago. I have been all Linux since then so can't tell you any more than Les has.
ID: 45461 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45462 - Posted: 16 Jan 2013, 21:39:38 UTC

As for Joe's problem, there's an awful lot of BOINC suspensions.

I get the feeling that the setting for Suspend work if CPU usage is above
is still at the default of 25%, which means that BOINC, and the science apps, are constantly being stopped and started as Joe uses the computer.

Other project's work may not mind, but the Coupled Ocean models are too touchy for this. Sooner or later they usually fail. Especially if Leave tasks in memory while suspended? isn't set to Yes.



Backups: Here
ID: 45462 · Report as offensive     Reply Quote
Chris

Send message
Joined: 9 Apr 12
Posts: 10
Credit: 2,700,404
RAC: 0
Message 45463 - Posted: 16 Jan 2013, 23:00:26 UTC

Any idea why these are so much worse than the regional models?

They take 3 weeks to run, so there is a lot of time for things to happen, but is there no way to make sure they suspend uneventfully? Even the shorter running models take a lot of faith to run, and the big ones, where its possible, and even likely, that I'll loose a model after 20 days isn't so good.
ID: 45463 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45464 - Posted: 17 Jan 2013, 1:19:13 UTC - in response to Message 45463.  

To start with, the models are from among those created by and for the UK Met Office, where they run on supercomputers. i.e. No interruptions.

The long models are Coupled Ocean models, whereas the regional are 'slab ocean'. i.e. a fixed set of values for the ocean.

Some years ago, an attempt was made to make a version of 'FAMOUS' models that would run multi-core. This didn't even get out of the alpha (in house) testing, apparently because of the coupling between the ocean and the atmosphere components. They quickly became unstable.

So it's possible that the hadcm3 models are more finicky because of this ocean-atmosphere bit. If treated carefully, they (mostly) run OK. If treated as 'just another Windows program which can be interrupted whenever', then they have problems.
Perhaps because of having a lot of files open, and being interrupted just as one has been updated, and it's matching partner(s) haven't been yet?

All just guess work, so your ideas are as good as mine. :)


Backups: Here
ID: 45464 · Report as offensive     Reply Quote
Chris

Send message
Joined: 9 Apr 12
Posts: 10
Credit: 2,700,404
RAC: 0
Message 45469 - Posted: 18 Jan 2013, 3:29:24 UTC

Do you know which computers they run/ran on? With so many being multi-core it seems odd there is this much trouble trying to run them in parallel, either on cpu or gpu.

Wikipedia says the model at least a decade old, but supercomputers had many cores by then.

ID: 45469 · Report as offensive     Reply Quote
old_user691968

Send message
Joined: 30 Dec 12
Posts: 4
Credit: 7,776
RAC: 0
Message 45471 - Posted: 18 Jan 2013, 5:05:16 UTC - in response to Message 45462.  

As for Joe's problem, there's an awful lot of BOINC suspensions.

I get the feeling that the setting for Suspend work if CPU usage is above
is still at the default of 25%, which means that BOINC, and the science apps, are constantly being stopped and started as Joe uses the computer.

Other project's work may not mind, but the Coupled Ocean models are too touchy for this. Sooner or later they usually fail. Especially if Leave tasks in memory while suspended? isn't set to Yes.




I'd set the suspend work threshold to 0 (ie no threshold) after noticing once that BOINC seesawed between running and not running every ten seconds or so with me doing nothing at the computer. It didn't seem to affect my computer usage. But someone at the BOINC forums claimed that this may be one of my problems.

I've posted a lot about my efforts to eradicate the "Task exited with zero status but no 'finished' file" errors that I got in BOINC's log file corresponding with the time this CPDN model seemed to give up the ghost (11:36:20?), here: boinc.berkeley.edu/dev/forum_thread.php?id=8134&postid=47366

I'd be grateful if an expert from here took a look at that thread to see anything I've missed.

But I'd love to know if these errors are even the cause of the failure--or was it this line?
"Atmos Hold Restart file rename failed on atmos_restart.hold"

And was what I listed the complete stderr log or does it seem to be cut off in the middle?
ID: 45471 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45472 - Posted: 18 Jan 2013, 5:07:20 UTC - in response to Message 45469.  

ID: 45472 · Report as offensive     Reply Quote
old_user691968

Send message
Joined: 30 Dec 12
Posts: 4
Credit: 7,776
RAC: 0
Message 45473 - Posted: 18 Jan 2013, 5:09:45 UTC

A question about backups: is it enough to copy the \ProgramData\BOINC\projects\climateprediction.net project folder or is it necessary to copy the whole programdata directory as the tutorial says? If I did the latter wouldn't it mean turning back the clock on every task I ran (including other projects) back to the backup time?
ID: 45473 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 45474 - Posted: 18 Jan 2013, 6:05:31 UTC - in response to Message 45473.  

No. In order to make a usable backup you need to copy everything in the ProgramData/BOINC folder. Yes the clock will be reset on all projects so it is good to make backups every few days.

ID: 45474 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45475 - Posted: 18 Jan 2013, 7:47:21 UTC

As Jim says - EVERYTHING.
This is because there are things in the BOINC part, such as client_state.xml, which could also be called: BOINC's To Do list.
And it MUST be done after both the manager and the client has been shut down.

These days, backups are mostly a protection against power failures in their many forms, such as "the dog tripped over the power cable and pulled it out".

As for clock problems, if you do a search of these boards you'll find very little mention of it. Except for this thread. :)

As well as 'clock going backwards for all projects', there'll also be the problem of the computers getting new IDs for all projects. Unless you know the secret of altering the afore mentioned client_state.xml, 'which ain't easy'.

And speaking of 'clock going backwards for all projects', this will, depending on how bad it is, and the server settings for various projects, result in other projects aborting their WUs due to time problems.


Backups: Here
ID: 45475 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45477 - Posted: 18 Jan 2013, 19:47:33 UTC
Last modified: 18 Jan 2013, 19:48:14 UTC

Stepping back a bit I caught sight of the forest. So, some other thoughts.

"Task exited with zero status but no finished file" messages are "mostly harmless". People have been getting this on and off for years, and most of the time the model can restart OK. It's not something that's considered a worry.

But there are many other things that can cause a model to fail, including the very nature of the models. (Mentioned elsewhere.)

"No heartbeat ..." is another message that often shows up, but it just means that BOINC got too busy to detect something it needed, and then complained. Probably to itself. Although, as a result, it's possible that BOINC can get itself so muddled that it deletes WUs because it thinks it's "their fault" rather than it's own.

The computer clock being "out" (fast?), may be a worry when running other projects, but it would only be a problem here if it became "out" by several years. And even then I'm not sure what would happen, because most people's computers successfully resync regularly.

I suspect that a lot of cpdn failures in recent times may be due to people cramming as much as the can into a computer to run at the same time, and just plain overloading/overwhelming the hardware.
Backups: Here
ID: 45477 · Report as offensive     Reply Quote
Profiletullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 45528 - Posted: 1 Feb 2013, 15:28:49 UTC

This model (hadam3p_eu) failed after 34.8 s. Here is the reason:
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish

Another one is however running.
Tullio

ID: 45528 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45529 - Posted: 1 Feb 2013, 19:42:39 UTC - in response to Message 45528.  

I've had 2, and there's a few others as well.
I reported it earlier, and the project people are looking into it.

ID: 45529 · Report as offensive     Reply Quote

Message boards : Number crunching : What went wrong (crashed WU)

©2024 cpdn.org