Message boards : Number crunching : Sulpher model stopped running?
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 04 Posts: 5 Credit: 191,799 RAC: 0 |
Hi folks, I am currently crunching WorkUnit 844214. This host usually trickles about once every day but I have noticed that it has not trickeled in 2 days. I have take a look at the WU on the host and have noticed something odd... BOINC has the model status= running yet the process is not in the taskmanager. In projects\\\\climateprediction.net folder, the last modified file was from Jan 11th at about 10.44am. This is a few hours after my last trickle. And the time a benchmark was attempted by BOINC. in climateprediction.net\\\\sulphur_dfit_000626645, the last modified file is stderr_um.txt, last modified at 04.25 on the 11th. This roughly corresponds to the last trickle time (give +1 hour UTC). In this file, there are a couple of warnings and then the last line of text is cut off: OPEN: File dataout/dfitba.da39810 Created on Unit 22 CLOSE: WARNING: Unit 66 Not Opened OPEN: File dataout/dfitba.pg39aug Created on Unit 66 CLOSE: WARNING: Unit 67 Not Opened OPEN: File dataout/dfitba.ph39aug Created on Unit 67 CLOSE: WARNING: Unit 68 Not Opened OPEN: File dataout/dfitba.pi39aug Created on Unit 68 OPEN: File dataout/dfitba.da39840 Created on Unit 22 OPEN: File dataout/dfitba.da39870 Created on Unit 22 OPEN: File dataout/dfitba.da398a0 Created on Unit 22 OPEN: File dataout/dfitba.da398d0 Created on Unit 22 OPEN: File dataout/dfitba.da398g0 Created on Unit 22 OPEN: File dataout/dfitba.da398j0 Created on Unit 22 OPEN: File dataout/dfitba.da398m0 Created on Unit 22 OPEN: File da This is as-is from the file. I have not messed up the copy/paste!! Does this indicate that the model had some sort of issue at around the time of the last trickle? Has anyone seen this before? As an extra bit of info, the Benchmarks (10ish am) failed due to \'Aborting CPU benchmarks, one or more active tasks are still running.\'. I believe this was because the CP model had already hung at this point and BOINC seemed to think it could not stop it. Now the current state: BOINC things CPDN is running The sulpher app is not in taskmanger. The load on the machine is \'missing\' BOINC science processes (host shuold always run 6 out of 8 processors, currently only 5 are busy) I will stop/start BOINC in a while to see what happens but just wondering if there is any information that would be of use to to the project while the model is still in this state... cheers, Paul Click my stats to visit BOINC Synergy site! Join BOINC Synergy |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
BOINC has the model status= running yet the process is not in the taskmanager. It sounds like you\'re still running BOINC 4.45 which has a known benchmark problem corresponding exactly to what you\'re seeing. Restarting BOINC is the only way you can get out of that state. Upgrading to the latest version of BOINC is the best long-term solution, but if you don\'t want to do that you should download Chris Sutton\'s fixed version, which Arnaud is very kindly hosting here. The truncated stderr_um.txt file is normal - I suspect it\'s due to unflushed writes to the file. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 31 Dec 04 Posts: 5 Credit: 191,799 RAC: 0 |
Yes I am running 4.45. I cannot switch to 5.x on this host because of problems with proxy authentication on all 5.x versions. I have downloaded Chris Sutton\'s version of 4.45 and it seems to be running fine. The Sulpher model started right back up again as soon as I stop / started the BOINC service... Thanks for the help! Paul. Click my stats to visit BOINC Synergy site! Join BOINC Synergy |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
I expected this problem when I heard about cURL - FaD uses cURL and cannot get through Squid with Auth either. It\'s not cURL\'s fault though, wget uses cURL and has no problem with our Squid, neither on Linux nor on Windows or AIX. The error i always had with FaD has been HTTP/1.0 407 Proxy Authentication Required followed by a ERR_CACHE_ACCESS_DENIED, looks as if it didn\'t even try to authenticate ______________________________ Your problem with the running CPDN task sounds familiar, I had it sometimes when I stopped BOINC 4.13 while another program used a lot of CPU. The first CPDN task was gone but the worker task ( ..._um_....exe) was still there. My workaround was to stop BOINC before I start anything that needs many ressources. I didn\'t have it anymore since then - caused by me beeing more careful or caused by a better handling in BOINC 4.19 - who knows. Not sure if it\'s exactly like your problem though, yours seems to be just the other way wrong. |
Send message Joined: 31 Dec 04 Posts: 5 Credit: 191,799 RAC: 0 |
Yes there are a few people reporting problems with proxy authentication and 5.x I have a bbug raised in relation to it and it is being looked at (hopefully!) the patched 4.45 seems to be working nicely for me now. I will wait and see what happens next time BOINC does a benchmark to see if it craps out again! Click my stats to visit BOINC Synergy site! Join BOINC Synergy |
Send message Joined: 19 Sep 04 Posts: 92 Credit: 2,011,637 RAC: 351 |
When I had BOINC 4.45, and got the same problem, the patched version solved it. Professor Desty Nova Researching Karma the Hard Way |
Send message Joined: 31 Dec 04 Posts: 5 Credit: 191,799 RAC: 0 |
When I had BOINC 4.45, and got the same problem, the patched version solved it. Cool. Hopefully it will for me too... Have another machine running 4.45 at work so if it works on this one host, I will put it on the other too... thanks again for the help guys! :) Paul. Click my stats to visit BOINC Synergy site! Join BOINC Synergy |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
Be careful with switching from 5.x to 4.x Both are basically 4.x cores but while the 5.x core knows that it is no major version change, the 4.x will most likely assume that it\'s a major version change and reset all work units. So from 4.x to 5.x no problem, but it lossless only in this direction. |
Send message Joined: 31 Dec 04 Posts: 5 Credit: 191,799 RAC: 0 |
Don\'t worry, 5.x is going nowhere near that host until the proxy issue is resolved! just an update - Chris Sutton\'s fixed version of 4.45 seems to have doen the trick. Benchmarks ran this morning, the sulpher model was paused, removed from memory and restarted succesfully once the BM finsished. Cheers! Paul. Click my stats to visit BOINC Synergy site! Join BOINC Synergy |
©2024 cpdn.org