climateprediction.net (CPDN) home page
Thread 'Batch 774 (safr50)'

Thread 'Batch 774 (safr50)'

Message boards : Number crunching : Batch 774 (safr50)
Message board moderation

To post messages, you must log in.

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59114 - Posted: 27 Nov 2018, 19:52:42 UTC

My test run crashed at 6 seconds.
The Fortran pop up starts:

forttl: severe (19): invalid reference to variable in NAMELIST input, unit 4, file C\ProgramData\BOINC\projects\climateprediction.net\wah2_safr50_b3e4_200212_16_774_011680311\jobs\xadae.stashc, line 60, position 13

and then waffles on for a bit.

**************

There are a heck of a lot of files involved with putting these things together, and the person running this must have made a mistake in moving from a handful of test models running for a short period, to a full blown batch, which are also on different servers, so different server names.

Oh well. Research is full of trial and errors.
Hopefully no one was injured in this initial run. :)

(There's been a flurry of news stories recently in my local area about medical blunders, which is a worry.)
ID: 59114 · Report as offensive     Reply Quote
ProfileThyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 59116 - Posted: 27 Nov 2018, 22:51:14 UTC - in response to Message 59114.  

If you installed BOINC to run as a service the batch 774 tasks seem to get stuck in the initialisation of the regional model instead of crashing out with a Fortran runtime error popup dialog box.

If you are running BOINC as a service BOINC Manager will show the elapsed time and progress increasing as expected, but if you open the task properties dialog box the CPU time isn't changing from "---". If checkpoint or task debug is enabled BOINC's event log shows that no checkpoints are being made and the elapsed time and progress will revert to 0 if you restart BOINC.

If this applies to your system you should abort all of its batch 774 tasks.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 59116 · Report as offensive     Reply Quote
Paul

Send message
Joined: 14 Feb 06
Posts: 31
Credit: 4,507,116
RAC: 2,013
Message 59117 - Posted: 28 Nov 2018, 4:30:16 UTC

Thanks for confirming the problem. I had 4 of these.
ID: 59117 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4538
Credit: 19,005,674
RAC: 21,647
Message 59118 - Posted: 28 Nov 2018, 7:56:53 UTC

I understand this batch has been paused/withdrawn till the problem is sorted. Not sure if this has been done in a way to stop the retreads going out after failing however. Hopefully more news in an hour or two once Oxford wakes up.

(It isn't that far West of Cambridge so shouldn't be too long ;) )
ID: 59118 · Report as offensive     Reply Quote
ProfileJIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 59122 - Posted: 28 Nov 2018, 15:12:13 UTC - in response to Message 59118.  

I understand this batch has been paused/withdrawn till the problem is sorted. Not sure if this has been done in a way to stop the retreads going out after failing however. Hopefully more news in an hour or two once Oxford wakes up.

(It isn't that far West of Cambridge so shouldn't be too long ;) )


I had 4 of the retreads last night. Three were _1's and one was a _2. They all crashed after about 90 seconds. At least they don’t waste a lot of computer cycles before they buy the farm.
ID: 59122 · Report as offensive     Reply Quote
ProfileAlan K

Send message
Joined: 22 Feb 06
Posts: 491
Credit: 30,975,898
RAC: 14,500
Message 59123 - Posted: 28 Nov 2018, 23:08:16 UTC - in response to Message 59122.  

I had 5 overnight and this morning. All failed within 90secs. When I get problem sets like this I pause anything already running and start the rogue ones. Gets through the failures quicker.
ID: 59123 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 59124 - Posted: 1 Dec 2018, 10:14:56 UTC - in response to Message 59123.  

I had 5 overnight and this morning. All failed within 90secs. When I get problem sets like this I pause anything already running and start the rogue ones. Gets through the failures quicker.

Right - push the likely misconfigured units out the door, clear the queue for the next good batch.
ID: 59124 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59125 - Posted: 1 Dec 2018, 11:41:39 UTC

Batch 774 was closed yesterday.
ID: 59125 · Report as offensive     Reply Quote

Message boards : Number crunching : Batch 774 (safr50)

©2024 cpdn.org