Questions and Answers :
Windows :
Computation problem hadcm3s
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
I've had to suspend computation here due to a series of errors. Yesterday BOINC crashed for an unknown reason, twice in the space of a couple of hours. Computation errors on two hadcm3s units followed within a few minutes. Suspecting BOINC was faulty, I suspended work and updated the BOINC software to the latest version. Due to download errors in the installation software, I was unable to download the new virtualbox software, and just updated BOINC. I then restarted. Two more hadcm3s units began looping their calculations, advancing for about ten seconds before restarting at the previous position. After ten or fifteen minutes of this I aborted computation on these units. Two more hadcm3s units then exhibited the same behaviour, starting from the beginning, then giving error: Task hadcm3s_8gne_2002_2_009680232_0 exited with zero status but no 'finished' file before crashing. Two hadam3p_eu units completed overnight and are ready to report, but I have four hadcm3s units on the system that I suspect I'm going to crash unless I fix the bug. I've had a good completion record for several weeks before this, so it's a new problem. Please advise. EDIT: I know that this kind of error can be caused by interference from other software, such as antivirus. BOINC has been excluded from antivirus for some time. |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,661,594 RAC: 14,529 |
I suspect it may be a problem with the models rather than anything to do with your computer. The task you link to is showing as still in progress but this may be that not everything is working fully after the database server being down earlier today. |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
The event log still says "feeder not running". I have several units trying to report. |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
The event log still says "feeder not running". The feeder shows as running now. |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
Just had the same error with 4 hadam3p_pnw_wjXX_2008 series. I think this is starting to look like my problem again: Task hadam3p_pnw_wjqa_2008_1_009705531_0 exited with zero status but no 'finished' file Do I reset the project? |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,661,594 RAC: 14,529 |
If you have any files that are corrupted causing the problem resetting will cause them to be downloaded afresh. Any running tasks for CPDN will be lost if you do that so I would probably set the project to no new tasks and wait till all you have are either finished or errored out before doing the reset. Also worth checking that you have carried out the advice from the sticky at the top of the preferences section of the message boards. If you have anything that is very processor intensive running at times this is especially so. |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
Now I'm getting really frustrated. I did these things (well, I checked the settings, which were set to the recommended ones some time ago, and then reset the project. Now I'm getting: Task hadam3p_afr_uag9_2013_1_009442494_1 exited with zero status but no 'finished' file This isn't funny. Any thoughts? |
Send message Joined: 7 Aug 04 Posts: 2183 Credit: 64,822,615 RAC: 5,275 |
Hi, Not sure what is going on, but the PNW's that crashed today at 1500 UTC/GMT still have 7.4.36 as your BOINC version in stderr whereas the latest is 7.4.42? Perhaps the install didn't take? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The 'If this happens repeatedly you may need to reset the project' tends to nearly always be a Red Herring; it normally never fixes it, The causes of it are loss of connection between the app and BOINC. So resetting the project is a waste of time. There's a problem with something on/in your computer. e.g. Intensive use of the computer by you to do something else. |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,661,594 RAC: 14,529 |
At which point I shall bow out of the discussion not having had a windows computer for over 10 years. |
Send message Joined: 18 Dec 13 Posts: 62 Credit: 1,078,935 RAC: 0 |
Okay, Thanks for that, everyone. I've just completed four hadcm3s units, and think it's probably fixed. I did virus and hardware scans, both of which turned up clean. Geophi was correct to suspect the BOINC update didn't install properly. I downloaded and installed the software again. This made it run, but very slowly. I had another look at the settings. Under Computing preferences-Processor usage-other options I changed the settings to allow it to use 100% of the processors and 100% of the CPU time, NOT 0% (no restriction): this seems to do two different things, even though it looks like it shouldn't. The timings on the units then reset to zero, and I thought I was about to lose the next set of work units. The system is running quite warm, and I'm having to watch the heat, which is fine in my den in March, but may not be come July. I may have to tell it to leave one processor alone, just to keep the heat down. Anyway, it works again. Thanks again. |
©2024 cpdn.org