climateprediction.net (CPDN) home page
Thread 'Several WU's fail after a few seconds - see log below'

Thread 'Several WU's fail after a few seconds - see log below'

Message boards : Number crunching : Several WU's fail after a few seconds - see log below
Message board moderation

To post messages, you must log in.

AuthorMessage
Mike Molson

Send message
Joined: 29 Mar 14
Posts: 20
Credit: 1,281,898
RAC: 0
Message 50412 - Posted: 8 Oct 2014, 16:14:14 UTC

9/10/2014 2:07:37 AM | climateprediction.net | task hadam3p_pnw_sbg2_2011_1_009084100_0 suspended by user
9/10/2014 2:07:37 AM | climateprediction.net | task hadam3p_pnw_sbop_2011_1_009084411_0 suspended by user
9/10/2014 2:07:37 AM | climateprediction.net | task hadam3p_pnw_sbg3_2011_1_009084101_0 suspended by user
9/10/2014 2:07:37 AM | climateprediction.net | task hadam3p_pnw_sbox_2011_1_009084419_0 suspended by user
9/10/2014 2:07:37 AM | World Community Grid | task MCM1_0008064_9989_1 suspended by user
9/10/2014 2:07:37 AM | World Community Grid | task MCM1_0008064_9853_0 suspended by user
9/10/2014 2:07:57 AM | climateprediction.net | task hadcm3s_2wrx_2003_2_009071888_1 resumed by user
9/10/2014 2:07:58 AM | climateprediction.net | Starting task hadcm3s_2wrx_2003_2_009071888_1
9/10/2014 2:08:10 AM | climateprediction.net | Task hadcm3s_2wrx_2003_2_009071888_1 exited with zero status but no 'finished' file
9/10/2014 2:08:10 AM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 2:08:17 AM | climateprediction.net | task hadcm3s_2wrx_2003_2_009071888_1 suspended by user
9/10/2014 2:08:21 AM | climateprediction.net | Task hadcm3s_2wrx_2003_2_009071888_1 exited with zero status but no 'finished' file
9/10/2014 2:08:21 AM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 2:08:30 AM | climateprediction.net | task hadam3p_pnw_sbox_2011_1_009084419_0 resumed by user
9/10/2014 2:08:31 AM | climateprediction.net | Starting task hadam3p_pnw_sbox_2011_1_009084419_0
9/10/2014 2:08:46 AM | climateprediction.net | Task hadam3p_pnw_sbox_2011_1_009084419_0 exited with zero status but no 'finished' file
9/10/2014 2:08:46 AM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 2:08:57 AM | climateprediction.net | Computation for task hadam3p_pnw_sbox_2011_1_009084419_0 finished
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_1.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_2.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_3.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_4.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_5.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_6.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_7.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_8.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_9.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_10.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_11.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_12.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:08:57 AM | climateprediction.net | Output file hadam3p_pnw_sbox_2011_1_009084419_0_13.zip for task hadam3p_pnw_sbox_2011_1_009084419_0 absent
9/10/2014 2:09:01 AM | climateprediction.net | task hadam3p_pnw_sbox_2011_1_009084419_0 suspended by user
ID: 50412 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50418 - Posted: 8 Oct 2014, 20:57:58 UTC - in response to Message 50412.  

Mike

That's not the error log, just a BOINC log.
The real error log is called Stderr, and is found on the page for each individual model. Click the + (plus) to expand the list.

What you posted is:
1) A list of the files that BOINC couldn't find when the model crashed. And that's because the model didn't run for long enough to reach a zip creation point.

2) An indication that you're still using the default setting for one of the options. OK for other projects, often fatal here. Climate models do NOT like being constantly interrupted. As has been said zillions of times.

Finally, please read the thread here about the problems with the latest batch of PNW models, especially the first post in the thread.


ID: 50418 · Report as offensive     Reply Quote
Mike Molson

Send message
Joined: 29 Mar 14
Posts: 20
Credit: 1,281,898
RAC: 0
Message 50420 - Posted: 9 Oct 2014, 2:17:19 UTC - in response to Message 50418.  

Is there more information useful to you in the Error Log you mentioned in your last post? In any event the following stuff from Event Log is clearly not good. I've done 'reset project' many times - it makes no difference to the WU processing. I've had this type of WU failure for the past 2 weeks at least.
What should I do now Les
Mike


9/10/2014 12:07:24 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:07:24 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:07:36 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:07:36 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:07:48 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:07:48 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:07:59 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:07:59 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:08:11 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:08:11 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:08:22 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:08:22 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:08:34 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:08:34 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:08:45 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:08:45 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:08:56 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:08:56 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:09:08 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:09:08 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:09:19 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:09:19 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
9/10/2014 12:09:31 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:09:31 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.
ID: 50420 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50421 - Posted: 9 Oct 2014, 2:33:31 UTC

You could try changing the setting for Suspend work if CPU usage is above to zero.

But also, leaving the models alone may work. That message doesn't indicate a fatal error is imminent, whereas a project reset is ALWAYS fatal for running models, because that's what it's supposed to do.

And we can see the stderr list for your models, so there's no need to post them.

ID: 50421 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50423 - Posted: 9 Oct 2014, 5:31:56 UTC

Another thing to check: Leave tasks in memory while suspended?

This is best set for Yes for this project.

ID: 50423 · Report as offensive     Reply Quote
Mike Molson

Send message
Joined: 29 Mar 14
Posts: 20
Credit: 1,281,898
RAC: 0
Message 50424 - Posted: 9 Oct 2014, 5:35:24 UTC - in response to Message 50421.  

Where do I find "suspend work if cpu usage is above"
ID: 50424 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,008,987
RAC: 21,524
Message 50425 - Posted: 9 Oct 2014, 5:56:24 UTC - in response to Message 50424.  

<tools> <computing preferences> <processor usage>
ID: 50425 · Report as offensive     Reply Quote
Mike Molson

Send message
Joined: 29 Mar 14
Posts: 20
Credit: 1,281,898
RAC: 0
Message 50427 - Posted: 9 Oct 2014, 7:04:06 UTC - in response to Message 50425.  

I must have an old version of BOINC - this option not in Tools / Computing Preferences?
ID: 50427 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,008,987
RAC: 21,524
Message 50428 - Posted: 9 Oct 2014, 7:20:28 UTC - in response to Message 50427.  

It has been there for a long time just realised that the format is slightly different. under the computing allowed section
underneath only after computer idle for Mins.
You will see, "While computer usage is less than .... percent (0 means no restriction.)

Setting this to 0 for some people reduces the incidence of the messages you posted.
ID: 50428 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50430 - Posted: 9 Oct 2014, 7:33:33 UTC

The options are also in the Computing Preferences section on your Account page on the project's server.

ID: 50430 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,703,308
RAC: 9,860
Message 50431 - Posted: 9 Oct 2014, 8:00:29 UTC

You have to use either the web-based preferences set through this web site, or local preferences set locally via BOINC Manager. You can't mix the two.

If you set even one value locally via BOINC Manager, the whole set is 'frozen in' and web-based preferences are ignored from that point on.
ID: 50431 · Report as offensive     Reply Quote
Mike Molson

Send message
Joined: 29 Mar 14
Posts: 20
Credit: 1,281,898
RAC: 0
Message 50432 - Posted: 9 Oct 2014, 8:21:35 UTC - in response to Message 50421.  

You said "That message doesn't indicate a fatal error is imminent, whereas a project reset is ALWAYS fatal for running models, because that's what it's supposed to do."

Les,
Does this mean that I should not have done a 'Project Reset"? (I did it after 'suspending' the failing WU). Is there something else I should do now that I have done a "Project Reset" e.g. reload the Climate Change programs?

Mike
ID: 50432 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,008,987
RAC: 21,524
Message 50433 - Posted: 9 Oct 2014, 8:33:24 UTC - in response to Message 50432.  

Probably not an issue. You said that models were failing anyway.

When you get the occasional message of the type

9/10/2014 12:07:24 PM | climateprediction.net | Task hadcm3s_32ih_2003_2_009079324_1 exited with zero status but no 'finished' file
9/10/2014 12:07:24 PM | climateprediction.net | If this happens repeatedly you may need to reset the project.


it isn't a problem but a long stream of them almost inevitably means the task ends up failing.

Main thing to do is check the settings in computing preferences and then see if that resolves the issue.

Other than making the settings more cpdn friendly, I don't think there is much else you could have done.
ID: 50433 · Report as offensive     Reply Quote

Message boards : Number crunching : Several WU's fail after a few seconds - see log below

©2024 cpdn.org