climateprediction.net home page
Computation problem hadcm3s

Computation problem hadcm3s

Questions and Answers : Windows : Computation problem hadcm3s
Message board moderation

To post messages, you must log in.

AuthorMessage
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 51683 - Posted: 24 Mar 2015, 12:09:50 UTC
Last modified: 24 Mar 2015, 12:26:10 UTC

I've had to suspend computation here due to a series of errors.

Yesterday BOINC crashed for an unknown reason, twice in the space of a couple of hours. Computation errors on two hadcm3s units followed within a few minutes.

Suspecting BOINC was faulty, I suspended work and updated the BOINC software to the latest version. Due to download errors in the installation software, I was unable to download the new virtualbox software, and just updated BOINC.

I then restarted. Two more hadcm3s units began looping their calculations, advancing for about ten seconds before restarting at the previous position. After ten or fifteen minutes of this I aborted computation on these units.

Two more hadcm3s units then exhibited the same behaviour, starting from the beginning, then giving error:
Task hadcm3s_8gne_2002_2_009680232_0 exited with zero status but no 'finished' file before crashing.

Two hadam3p_eu units completed overnight and are ready to report, but I have four hadcm3s units on the system that I suspect I'm going to crash unless I fix the bug.

I've had a good completion record for several weeks before this, so it's a new problem.

Please advise.

EDIT: I know that this kind of error can be caused by interference from other software, such as antivirus. BOINC has been excluded from antivirus for some time.
ID: 51683 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 51684 - Posted: 24 Mar 2015, 12:46:41 UTC

I suspect it may be a problem with the models rather than anything to do with your computer. The task you link to is showing as still in progress but this may be that not everything is working fully after the database server being down earlier today.
ID: 51684 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 51686 - Posted: 24 Mar 2015, 13:39:08 UTC - in response to Message 51684.  

The event log still says "feeder not running". I have several units trying to report.
ID: 51686 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 51688 - Posted: 24 Mar 2015, 14:30:23 UTC - in response to Message 51686.  

The event log still says "feeder not running".


The feeder shows as running now.

ID: 51688 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 51697 - Posted: 25 Mar 2015, 14:12:55 UTC

Just had the same error with 4 hadam3p_pnw_wjXX_2008 series. I think this is starting to look like my problem again:

Task hadam3p_pnw_wjqa_2008_1_009705531_0 exited with zero status but no 'finished' file

Do I reset the project?
ID: 51697 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 51698 - Posted: 25 Mar 2015, 14:56:54 UTC - in response to Message 51697.  

If you have any files that are corrupted causing the problem resetting will cause them to be downloaded afresh. Any running tasks for CPDN will be lost if you do that so I would probably set the project to no new tasks and wait till all you have are either finished or errored out before doing the reset.

Also worth checking that you have carried out the advice from the sticky at the top of the preferences section of the message boards. If you have anything that is very processor intensive running at times this is especially so.
ID: 51698 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 51700 - Posted: 25 Mar 2015, 16:39:03 UTC

Now I'm getting really frustrated.

I did these things (well, I checked the settings, which were set to the recommended ones some time ago, and then reset the project.

Now I'm getting:
Task hadam3p_afr_uag9_2013_1_009442494_1 exited with zero status but no 'finished' file

This isn't funny. Any thoughts?
ID: 51700 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2183
Credit: 64,822,615
RAC: 5,275
Message 51701 - Posted: 25 Mar 2015, 17:14:00 UTC - in response to Message 51700.  

Hi,

Not sure what is going on, but the PNW's that crashed today at 1500 UTC/GMT still have 7.4.36 as your BOINC version in stderr whereas the latest is 7.4.42? Perhaps the install didn't take?

ID: 51701 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51702 - Posted: 25 Mar 2015, 18:40:40 UTC - in response to Message 51700.  

The 'If this happens repeatedly you may need to reset the project' tends to nearly always be a Red Herring; it normally never fixes it,
The causes of it are loss of connection between the app and BOINC.

So resetting the project is a waste of time. There's a problem with something on/in your computer.
e.g. Intensive use of the computer by you to do something else.

ID: 51702 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,661,594
RAC: 14,529
Message 51705 - Posted: 25 Mar 2015, 19:29:57 UTC

At which point I shall bow out of the discussion not having had a windows computer for over 10 years.
ID: 51705 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 51713 - Posted: 27 Mar 2015, 23:28:25 UTC

Okay, Thanks for that, everyone. I've just completed four hadcm3s units, and think it's probably fixed.

I did virus and hardware scans, both of which turned up clean.

Geophi was correct to suspect the BOINC update didn't install properly. I downloaded and installed the software again. This made it run, but very slowly.

I had another look at the settings. Under Computing preferences-Processor usage-other options I changed the settings to allow it to use 100% of the processors and 100% of the CPU time, NOT 0% (no restriction): this seems to do two different things, even though it looks like it shouldn't.

The timings on the units then reset to zero, and I thought I was about to lose the next set of work units.

The system is running quite warm, and I'm having to watch the heat, which is fine in my den in March, but may not be come July. I may have to tell it to leave one processor alone, just to keep the heat down.

Anyway, it works again. Thanks again.


ID: 51713 · Report as offensive     Reply Quote

Questions and Answers : Windows : Computation problem hadcm3s

©2024 cpdn.org