Message boards : Number crunching : Batch 774 (safr50)
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
My test run crashed at 6 seconds. The Fortran pop up starts: forttl: severe (19): invalid reference to variable in NAMELIST input, unit 4, file C\ProgramData\BOINC\projects\climateprediction.net\wah2_safr50_b3e4_200212_16_774_011680311\jobs\xadae.stashc, line 60, position 13 and then waffles on for a bit. ************** There are a heck of a lot of files involved with putting these things together, and the person running this must have made a mistake in moving from a handful of test models running for a short period, to a full blown batch, which are also on different servers, so different server names. Oh well. Research is full of trial and errors. Hopefully no one was injured in this initial run. :) (There's been a flurry of news stories recently in my local area about medical blunders, which is a worry.) |
Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0 |
If you installed BOINC to run as a service the batch 774 tasks seem to get stuck in the initialisation of the regional model instead of crashing out with a Fortran runtime error popup dialog box. If you are running BOINC as a service BOINC Manager will show the elapsed time and progress increasing as expected, but if you open the task properties dialog box the CPU time isn't changing from "---". If checkpoint or task debug is enabled BOINC's event log shows that no checkpoints are being made and the elapsed time and progress will revert to 0 if you restart BOINC. If this applies to your system you should abort all of its batch 774 tasks. "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
Send message Joined: 14 Feb 06 Posts: 31 Credit: 4,507,116 RAC: 2,013 |
Thanks for confirming the problem. I had 4 of these. |
Send message Joined: 15 May 09 Posts: 4538 Credit: 19,002,360 RAC: 21,497 |
I understand this batch has been paused/withdrawn till the problem is sorted. Not sure if this has been done in a way to stop the retreads going out after failing however. Hopefully more news in an hour or two once Oxford wakes up. (It isn't that far West of Cambridge so shouldn't be too long ;) ) |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022 |
I understand this batch has been paused/withdrawn till the problem is sorted. Not sure if this has been done in a way to stop the retreads going out after failing however. Hopefully more news in an hour or two once Oxford wakes up. I had 4 of the retreads last night. Three were _1's and one was a _2. They all crashed after about 90 seconds. At least they don’t waste a lot of computer cycles before they buy the farm. |
Send message Joined: 22 Feb 06 Posts: 491 Credit: 30,971,756 RAC: 14,149 |
I had 5 overnight and this morning. All failed within 90secs. When I get problem sets like this I pause anything already running and start the rogue ones. Gets through the failures quicker. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I had 5 overnight and this morning. All failed within 90secs. When I get problem sets like this I pause anything already running and start the rogue ones. Gets through the failures quicker. Right - push the likely misconfigured units out the door, clear the queue for the next good batch. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Batch 774 was closed yesterday. |
©2024 cpdn.org