climateprediction.net (CPDN) home page
Thread 'Sulpher model stopped running?'

Thread 'Sulpher model stopped running?'

Message boards : Number crunching : Sulpher model stopped running?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile[B^S] Paul@home

Send message
Joined: 31 Dec 04
Posts: 5
Credit: 191,799
RAC: 0
Message 19251 - Posted: 13 Jan 2006, 11:05:06 UTC
Last modified: 13 Jan 2006, 11:48:35 UTC

Hi folks,

I am currently crunching WorkUnit 844214. This host usually trickles about once every day but I have noticed that it has not trickeled in 2 days. I have take a look at the WU on the host and have noticed something odd...

BOINC has the model status= running yet the process is not in the taskmanager.

In projects\\\\climateprediction.net folder, the last modified file was from Jan 11th at about 10.44am. This is a few hours after my last trickle. And the time a benchmark was attempted by BOINC.

in climateprediction.net\\\\sulphur_dfit_000626645, the last modified file is stderr_um.txt, last modified at 04.25 on the 11th. This roughly corresponds to the last trickle time (give +1 hour UTC). In this file, there are a couple of warnings and then the last line of text is cut off:
OPEN:  File dataout/dfitba.da39810 Created on Unit 22
CLOSE: WARNING: Unit 66 Not Opened
OPEN:  File dataout/dfitba.pg39aug Created on Unit 66
CLOSE: WARNING: Unit 67 Not Opened
OPEN:  File dataout/dfitba.ph39aug Created on Unit 67
CLOSE: WARNING: Unit 68 Not Opened
OPEN:  File dataout/dfitba.pi39aug Created on Unit 68
OPEN:  File dataout/dfitba.da39840 Created on Unit 22
OPEN:  File dataout/dfitba.da39870 Created on Unit 22
OPEN:  File dataout/dfitba.da398a0 Created on Unit 22
OPEN:  File dataout/dfitba.da398d0 Created on Unit 22
OPEN:  File dataout/dfitba.da398g0 Created on Unit 22
OPEN:  File dataout/dfitba.da398j0 Created on Unit 22
OPEN:  File dataout/dfitba.da398m0 Created on Unit 22
OPEN:  File da


This is as-is from the file. I have not messed up the copy/paste!!

Does this indicate that the model had some sort of issue at around the time of the last trickle?

Has anyone seen this before?


As an extra bit of info, the Benchmarks (10ish am) failed due to \'Aborting CPU benchmarks, one or more active tasks are still running.\'. I believe this was because the CP model had already hung at this point and BOINC seemed to think it could not stop it.

Now the current state:
BOINC things CPDN is running
The sulpher app is not in taskmanger.
The load on the machine is \'missing\' BOINC science processes (host shuold always run 6 out of 8 processors, currently only 5 are busy)

I will stop/start BOINC in a while to see what happens but just wondering if there is any information that would be of use to to the project while the model is still in this state...

cheers,


Paul

Click my stats to visit BOINC Synergy site!

Join BOINC Synergy
ID: 19251 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 19255 - Posted: 13 Jan 2006, 13:03:17 UTC - in response to Message 19251.  
Last modified: 13 Jan 2006, 13:04:22 UTC

BOINC has the model status= running yet the process is not in the taskmanager.

As an extra bit of info, the Benchmarks (10ish am) failed due to \'Aborting CPU benchmarks, one or more active tasks are still running.\'. I believe this was because the CP model had already hung at this point and BOINC seemed to think it could not stop it.

Now the current state:
BOINC things CPDN is running
The sulpher app is not in taskmanger.
The load on the machine is \'missing\' BOINC science processes (host shuold always run 6 out of 8 processors, currently only 5 are busy)

It sounds like you\'re still running BOINC 4.45 which has a known benchmark problem corresponding exactly to what you\'re seeing. Restarting BOINC is the only way you can get out of that state.

Upgrading to the latest version of BOINC is the best long-term solution, but if you don\'t want to do that you should download Chris Sutton\'s fixed version, which Arnaud is very kindly hosting here.

The truncated stderr_um.txt file is normal - I suspect it\'s due to unflushed writes to the file.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 19255 · Report as offensive     Reply Quote
Profile[B^S] Paul@home

Send message
Joined: 31 Dec 04
Posts: 5
Credit: 191,799
RAC: 0
Message 19258 - Posted: 13 Jan 2006, 13:40:23 UTC

Yes I am running 4.45. I cannot switch to 5.x on this host because of problems with proxy authentication on all 5.x versions.

I have downloaded Chris Sutton\'s version of 4.45 and it seems to be running fine. The Sulpher model started right back up again as soon as I stop / started the BOINC service...

Thanks for the help!

Paul.

Click my stats to visit BOINC Synergy site!

Join BOINC Synergy
ID: 19258 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 19263 - Posted: 13 Jan 2006, 14:59:38 UTC
Last modified: 13 Jan 2006, 15:10:08 UTC

I expected this problem when I heard about cURL - FaD uses cURL and cannot get through Squid with Auth either.

It\'s not cURL\'s fault though, wget uses cURL and has no problem with our Squid, neither on Linux nor on Windows or AIX.


The error i always had with FaD has been HTTP/1.0 407 Proxy Authentication Required followed by a ERR_CACHE_ACCESS_DENIED, looks as if it didn\'t even try to authenticate
______________________________

Your problem with the running CPDN task sounds familiar, I had it sometimes when I stopped BOINC 4.13 while another program used a lot of CPU. The first CPDN task was gone but the worker task ( ..._um_....exe) was still there.

My workaround was to stop BOINC before I start anything that needs many ressources. I didn\'t have it anymore since then - caused by me beeing more careful or caused by a better handling in BOINC 4.19 - who knows.

Not sure if it\'s exactly like your problem though, yours seems to be just the other way wrong.
ID: 19263 · Report as offensive     Reply Quote
Profile[B^S] Paul@home

Send message
Joined: 31 Dec 04
Posts: 5
Credit: 191,799
RAC: 0
Message 19267 - Posted: 13 Jan 2006, 15:32:39 UTC

Yes there are a few people reporting problems with proxy authentication and 5.x

I have a bbug raised in relation to it and it is being looked at (hopefully!)

the patched 4.45 seems to be working nicely for me now. I will wait and see what happens next time BOINC does a benchmark to see if it craps out again!

Click my stats to visit BOINC Synergy site!

Join BOINC Synergy
ID: 19267 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 2,014,122
RAC: 399
Message 19282 - Posted: 14 Jan 2006, 8:39:37 UTC

When I had BOINC 4.45, and got the same problem, the patched version solved it.


Professor Desty Nova
Researching Karma the Hard Way
ID: 19282 · Report as offensive     Reply Quote
Profile[B^S] Paul@home

Send message
Joined: 31 Dec 04
Posts: 5
Credit: 191,799
RAC: 0
Message 19317 - Posted: 14 Jan 2006, 23:24:04 UTC - in response to Message 19282.  

When I had BOINC 4.45, and got the same problem, the patched version solved it.



Cool. Hopefully it will for me too... Have another machine running 4.45 at work so if it works on this one host, I will put it on the other too...

thanks again for the help guys! :)

Paul.

Click my stats to visit BOINC Synergy site!

Join BOINC Synergy
ID: 19317 · Report as offensive     Reply Quote
ProfileAnanas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 19319 - Posted: 14 Jan 2006, 23:58:34 UTC
Last modified: 15 Jan 2006, 0:00:37 UTC

Be careful with switching from 5.x to 4.x

Both are basically 4.x cores but while the 5.x core knows that it is no major version change, the 4.x will most likely assume that it\'s a major version change and reset all work units.

So from 4.x to 5.x no problem, but it lossless only in this direction.
ID: 19319 · Report as offensive     Reply Quote
Profile[B^S] Paul@home

Send message
Joined: 31 Dec 04
Posts: 5
Credit: 191,799
RAC: 0
Message 19364 - Posted: 16 Jan 2006, 15:10:24 UTC

Don\'t worry, 5.x is going nowhere near that host until the proxy issue is resolved!

just an update - Chris Sutton\'s fixed version of 4.45 seems to have doen the trick. Benchmarks ran this morning, the sulpher model was paused, removed from memory and restarted succesfully once the BM finsished.

Cheers!

Paul.

Click my stats to visit BOINC Synergy site!

Join BOINC Synergy
ID: 19364 · Report as offensive     Reply Quote

Message boards : Number crunching : Sulpher model stopped running?

©2024 cpdn.org