climateprediction.net (CPDN) home page
Thread 'HadCM3 short - errors galore'

Thread 'HadCM3 short - errors galore'

Message boards : Number crunching : HadCM3 short - errors galore
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 50328 - Posted: 27 Sep 2014, 4:31:24 UTC - in response to Message 50324.  

Interested to know if anyone else is getting this with the short models. I noticed disk usage seemed to be getting a bit high for BOINC and on checking, the last 4 short models hadn't cleaned up after themselves though they had sent all zips and cleared from Tasks In Progress view. If others have had this is it only on nix boxen or a global issue?

Edit: Just to be completely clear, this is not crashed tasks leaving their detritus on my disk which I know is a problem but models that have finished without a hitch other than having to wait to report/upload zips on some occasions.


Just to confirm Eirik Redd's earlier reply to your post...

I run CPDN on two Ubuntu machines, and I don't think any of the HADCM3S tasks that downloaded successfully have ever cleared up after themselves - it's been a regular task to recover the disc space!

I'm seeing this with both BOINC 7.2.33 (on 12.04) and BOINC 7.4.8 (on 14.04) which I got from costamagnagianfranco's PPA. So if it is the client rather than the application causing the clean-up problem, it's in the latest release candidate as well as the current production version.

Hopefully someone will chip in with something definitive about why it might be happening (or perhaps we need a separate thread to motivate that?) The current behaviour is certainly a nuisance!

Al.


ID: 50328 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50330 - Posted: 27 Sep 2014, 6:29:59 UTC

Thanks, that's what I needed to know. i.e. it is nothing to do with my machine.

Curiosity makes me ask whether it happens on windows & Mac machines as well?
ID: 50330 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50332 - Posted: 27 Sep 2014, 6:45:23 UTC - in response to Message 50330.  

Hi Dave

I'm afraid that "leave stuff behind" thing is a problem with the model type.
I'm hopeful that some more testing will get done on these with a new compile to clear up some of the things that got rushed through testing.


ID: 50332 · Report as offensive     Reply Quote
Werinbert

Send message
Joined: 29 Jul 13
Posts: 4
Credit: 1,008,021
RAC: 0
Message 50384 - Posted: 7 Oct 2014, 11:47:54 UTC
Last modified: 7 Oct 2014, 11:49:35 UTC

I am getting different errors for my short runs.
Stderr: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=17125779

I am also getting a windows error dialog (Visual Fortran run-time error):
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file
E:\...\projects\climateprediction.net\hadcm3s_2
nvt_1981_2_009047449\jobs\climate.cpdc, line 393, position 19
followed by a stack trace.
ID: 50384 · Report as offensive     Reply Quote
ProfileIain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,826,970
RAC: 5,066
Message 50386 - Posted: 7 Oct 2014, 12:05:24 UTC - in response to Message 50384.  
Last modified: 7 Oct 2014, 12:06:10 UTC

I am getting different errors for my short runs.
Stderr: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=17125779

I am also getting a windows error dialog (Visual Fortran run-time error):
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file
E:\...\projects\climateprediction.net\hadcm3s_2
nvt_1981_2_009047449\jobs\climate.cpdc, line 393, position 19
followed by a stack trace.


... as I understand it, the Visual Fortran message appears only if the BOINC installation is not a service. For service installs the model still fails but the message box is suppressed. There is a small anomaly close to the reported location in the NAMELIST file (tab instead of space) but a compliant FORTRAN compiler would not, as far as I know, treat that as an error. So the error message appears to be a dead end.
ID: 50386 · Report as offensive     Reply Quote
SuperSluether

Send message
Joined: 6 Jul 14
Posts: 11
Credit: 367,660
RAC: 0
Message 50388 - Posted: 7 Oct 2014, 13:15:10 UTC - in response to Message 50386.  

Just got 4 more errors on the 'short' application.
ID: 50388 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50390 - Posted: 7 Oct 2014, 13:19:25 UTC - in response to Message 50388.  

Just got 4 more errors on the 'short' application.


From the new batch or not?

There was some talk of more testing which hoped to stop this for the next release. Also waiting to see if new batch clean up after themselves.
ID: 50390 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,718,239
RAC: 8,054
Message 50393 - Posted: 7 Oct 2014, 16:04:19 UTC - in response to Message 50392.  

Short models created today still show download errors:

<core_client_version>6.10.43</core_client_version>
<![CDATA[
<message>
app_version download error: couldn't get input files:
<file_xfer_error>
<file_name>hadcm3s_7.24_windows_intelx86.exe</file_name>
<error_code>-120</error_code>
<error_message>signature verification error</error_message>
</file_xfer_error>
</message>
]]>

No, that's not a problem with models created today - that's the application file, unchanged since 23 July 2014.

Signature errors like that are usually the result of an anti-virus program or some other malware blocker interfering with the download. Some people reported success with that same file, using the technique I described in message 50291.
ID: 50393 · Report as offensive     Reply Quote
Anubischick

Send message
Joined: 11 Dec 05
Posts: 5
Credit: 1,653,433
RAC: 0
Message 50398 - Posted: 8 Oct 2014, 3:11:41 UTC

Yep just got 4 errors myself, and downloaded another 4 'short' and they are doing the same thing.


ID: 50398 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50399 - Posted: 8 Oct 2014, 4:07:15 UTC

It's not a problem on Linux.

ID: 50399 · Report as offensive     Reply Quote
Harri Liljeroos

Send message
Joined: 9 Dec 05
Posts: 116
Credit: 12,547,934
RAC: 2,738
Message 50400 - Posted: 8 Oct 2014, 7:27:07 UTC

The new short tasks are also giving the INVALID THETA error on my Win XP64 machine. It has never finished any short tasks successfully. I have now disabled short tasks for that machine. Below is stderr from one of the latest tasks.

<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
The device does not recognize the command. (0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  2048    

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  2048    

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  2048    

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  2048    

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  2048    

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  2048    
Sorry, too many model crashes! :-(
07:22:24 (2292): called boinc_finish

</stderr_txt>
]]>

ID: 50400 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50401 - Posted: 8 Oct 2014, 8:08:38 UTC

Harri, I see that all you wingmen also seemed to fail with their bits of those work units. They were also all using windows. I still have a couple of the old batch to complete before knowing whether the new ones complete on Linux or not.
ID: 50401 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50402 - Posted: 8 Oct 2014, 9:08:29 UTC - in response to Message 50400.  

INVALID THETA means that the model's physics has become unstable.
Which is what the researchers are looking for. All perfectly normal.

If there's a lot of them, then the values for the forcing parameters used for each of the models must be pushing the model's physics close to the limits of stability. This is called research.

And only the researcher knows what he's trying to find out at the moment.

ID: 50402 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50403 - Posted: 8 Oct 2014, 9:29:34 UTC

I recognise that Les but all of the ones I looked at fell over at about 5 seconds which makes me wonder if maybe there is something else going on.
ID: 50403 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50405 - Posted: 8 Oct 2014, 10:49:11 UTC - in response to Message 50403.  

Sorry Dave. That was meant for Harri.

I didn't look too far into failures.
My first time models start with "30", so the new series may be 30, 31, 32, etc.
They're also for 2003. And should finish soon.

My other machine picked up 4 rejects, 5 if you count a "permanent download failure".
Not sure how far these will run.

So, a mix of old/bad and new/good are floating around.


ID: 50405 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 50443 - Posted: 9 Oct 2014, 15:36:07 UTC

I have had 3 of these short models fail with

Model crashed: ATM_DYN : INVALID THETA DETECTED

I presume we will get some credit(s) for the work done. Also I hope there is some scientific value gained.
<img border="0" src="http://boinc.mundayweb.com/one/stats.php?userID=343" />
ID: 50443 · Report as offensive     Reply Quote
Profile[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 21 Oct 10
Posts: 53
Credit: 2,101,753
RAC: 3,985
Message 50446 - Posted: 9 Oct 2014, 19:43:46 UTC

None of the UK Met Office HadCM3 short v7.24 ever terminated successfully on my iMac : http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1314566

On the PC the situation is not brilliant but still better : http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1304501
ID: 50446 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 50447 - Posted: 9 Oct 2014, 19:50:55 UTC

On the PC the situation is not brilliant but still better :


And as those who have followed all of this and other threads on the subject, many pc's have not been able to complete a single task of this type. whereas most linux boxes seem to finish them all, certainly they seem pretty bomb proof on my machine so far, even surviving power failures without problems.
ID: 50447 · Report as offensive     Reply Quote
ProfileastroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 50448 - Posted: 9 Oct 2014, 19:51:55 UTC - in response to Message 50443.  

Credits are awarded per Trickle received at the server. If your tasks sent Trickles, you will get credit eventually (not instantaneously). Of course, there is typically a period between last Trickle and crash point that won't be compensated.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 50448 · Report as offensive     Reply Quote
Profile[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 21 Oct 10
Posts: 53
Credit: 2,101,753
RAC: 3,985
Message 50449 - Posted: 9 Oct 2014, 20:14:15 UTC

I'm not worried about credits and I know about the trickle mechanisms that's used here (and it's great !), it's more that I wonder "what the heck with this application" :D
ID: 50449 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : HadCM3 short - errors galore

©2024 cpdn.org